# Imports

First we import some of the libraries that we're going to use throughout this exercise:


### Pandas
[Pandas](https://pandas.pydata.org/) is a library designed to help with data analysis. It includes tools for organizing data, doing common analysis tasks like groups, joins, sums, and averages. It's built on top of another common (and more advanced) library called [numpy](http://www.numpy.org/) which enables extremely fast numerical computations. In general, if you can find a way to do what you want to do in pandas/numpy, it's going to be much faster than anything you're likely to code up.

### Plotly
[Plotly](https://plot.ly/python/) is a library for making pretty interactive graphs.

### Requests
[Requests](http://docs.python-requests.org/en/master/), as the name might suggest, makes HTTP web requests easy, earning it the tongue-in-cheek descriptor: "HTTP for humans".

### google.colab
[A library designed specifically for the google.colab coding environment](https://colab.research.google.com/notebooks/io.ipynb) in which this notebook runs. We'll use it for accessing files within our own google drive folders, but it can do a bunch of other stuff too.


### urllib.parse

A python [standard library module](https://docs.python.org/3/library/urllib.parse.html), from which we'll use the "urlparse" function only. As the name implies it allows us to easily parse the different parts of a URL (path, parameters, protocol, etc).

### re

Another member of the [standard library](https://docs.python.org/3/library/re.html) that provides a number of functions that help with using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression).

### time
Standard library module that handles time, and for our purposes provides us with a sleep function.

### Ipython.display
We'll import the HTML function, which allows us to display HTML files in the ipython notebook (or in this case, the google colab notebook) from the filesystem.


### Sklearn

[Sklearn](https://scikit-learn.org/stable/) is one of the most popular libraries for [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) in python (and in general). It implements everything from simple linear regression, to Support Vector Machines, to even small neural nets. 

Sklearn makes complicated machine learning algorithms accessible to everyone, and also has a number of functions for making sure you're practicing [good experiemental design](https://en.wikipedia.org/wiki/Cross-validation_(statistics%29), and [evaluating how good](https://en.wikipedia.org/wiki/Confusion_matrix) the models you've trained are.

In [0]:
import requests
import pandas as pd
import numpy as np
import re
import time
import plotly.offline as py
import plotly.graph_objs as go

from urllib.parse import urlparse
from google.colab import files, drive
from IPython.display import HTML


from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix


import itertools
import numpy as np
import matplotlib.pyplot as plt


# Files inside Google Colab

## Local Filesystem

If you click on the right-facing arrow at the top right of the google colab screen, a window pane will open up that has 3 tabs in it: "Table of Contents", "Code snippets", and "Files". 

Clicking on the "Files" tab you'll be able to see what files we can access with our code. Folders and Files listed here act just like they do on your computer, you can even right click on any of the files you see and select "download" to download them permanently to your computer. 

With the exception of the "sample_data" folder that appears in every google colab notebook, files uploaded here are ephemeral: they will disappear when you close the colab notebook, and you'll have to upload them again next time you want to use them in your code.


## Google Drive

Some of the files we'll generate from the code in this notebook are okay to throw away, like the HTML files that hold our plotly graphs. Some of them, however, we would like to save permanently because they take a lot of time to create, and we don't want to have to wait for them to get created again every time we want to play around with the code.

To solve this problem, google provides a way for us to access all the files we have in our Google Drive: 

1. Run the command in the cell below: `drive.mount("/content/gdrive")` you'll be presented with a URL to click, and a text box. 
2. Clicking the link will open a new tab and a page will appear asking you to give permission to access your google drive. Accept the permission and you'll be shown a string of text (called a "token"). 
3. Copy that token and navigate back here to the notebook, and paste the token in the text box provided earlier. 
4. Once you've pasted the token hit enter to confirm and wait a few seconds and your Google Drive should be mounted to the google colab notebook.
5. Click on the "refresh" button in the files tab on the left, you should now see a folder called "gdrive" with a sub-folder called "My Drive". You'll find all the files you have in your google drive within that folder.

You can now treat the gdrive/My Drive/ folder as a permanent storage space for any files you generate in the code below.

In [0]:
drive.mount("/content/gdrive")

# Solution 1 

## Part 1: Winners and Losers

### Getting The Data
We want to analyze some traffic data from google analytics, and we're going to use the [Google Analytics API]() to do it. 

Google, like most sites out there nowadays, requires us to verify that we are who we say we are before we can access our google analytics data. In short, we need a temporary password that lets google know that our data requests are legit. 

How do we get this temporary password? Well, we're going to cheat a tiny bit by getting it from the [Google Query Explorer](). The "correct" way to do this is beyond the scope of this tutorial. For now: if it works, it works!

1. Head on over to the Query Explorer
2. Click on the button at the top that says "Click here to Authorize" and follow the steps provided.
2. Use the dropdown menu to select the website you want to get data from
3. Don't worry too much most of the parameters, some will be filled in for you. The only one you need to fill in for now is the "metrics" parameter.
  * Select any parameter at this point, we only want to be able to run the query explorer once in order to get the token we need. I selected "**Users**" since it was at the top of the list.
4. Hit "Run Query" and let it run
5. Scroll down to the bottom of the page and look for the text-box that says "API Query URI". 
  * Check the box underneath it that says "*Include current access_token in the Query URI (will expire in ~60 minutes).*"
  * At the end of the URL in the text box you should now see access_token=string-of-text-here. Copy that string of text and paste it below in the variable called `token` (make sure to paste it inside the quotes)
  
6. Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called "**ids**". Copy the value in this box and paste it in the variable below called `gaid`. Again, inside the quotes.

7. Run the cell once you've filled in the `gaid` and `token` variables to instantiate them, and we're good to go!

In [0]:
metrics = ",".join(["ga:users","ga:newUsers"])
dimensions = ",".join(["ga:landingPagePath", "ga:date"])
segment = "gaid::-5"


# Required, please fill in with your own GA information example: ga:23322342
gaid = ""

# Example: ya29.GltOBqPcbInIS41UouaCztVeYt_-bYyvsb5bs2Du_vz62r21yhEBveNPK-D3k4Du1WVmbAw-zIt8gyjjLAHN4HiWBiEY0IONIwJ_e2swrDK4aZUkBu5ZoVw4nvjS
token = ""

# Example https://www.example.com or http://example.org
base_site_url = ""


# You can change the start and end dates as you like
start = "2017-06-01"
end = "2018-06-30"

The following functions use the variables we filled in above to get google analytics data. The specifics of how this works is left as an exercise for the reader, but if you need more information you might try looking [here](https://developers.google.com/analytics/devguides/reporting/core/v3/reference#q_summary)



In [0]:
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  """Creates a generator that yields GA API data 
     in chunks of size `max_results`"""
  
  #build uri w/ params
  api_uri = "https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&"\
             "start-date={start}&end-date={end}&metrics={metrics}&"\
             "dimensions={dimensions}&segment={segment}&access_token={token}&"\
             "max-results={max_results}"
  
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get("nextLink", None):
    while data.get("nextLink"):
      new_uri = data.get("nextLink")
      new_uri += "&access_token={token}".format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data

      
def to_df(gadata):
  """Takes in a generator from GAData() 
     creates a dataframe from the rows"""
  
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data['rows'], 
          columns=[x['name'] for x in data['columnHeaders']]
      )
    else:
      newdf = pd.DataFrame(
          data['rows'], 
          columns=[x['name'] for x in data['columnHeaders']]
      )
      df = df.append(newdf)
    print("Gathered {} rows".format(len(df)))
  return df

In [0]:
# Here we first try to see if the traffic data already exists in our google drive
# Otherwise we call the Analytics API to request the data with the above functions
# We don't want to have to call the Analytics API if we don't have to.


try:
  data = pd.read_csv("/content/gdrive/My Drive/site_traffic.csv")
except FileNotFoundError:
  data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
  
  data = to_df(data)
  # If we do end up calling the API for the data: 
  # we then save it, so we don't have to again
  data.to_csv("/content/gdrive/My Drive/site_traffic.csv")

In [0]:
data['path'] = data['ga:landingPagePath'].apply(lambda x: urlparse(x).path)
data['url'] = base_site_url + data['path']
data['ga:date'] = pd.to_datetime(data['ga:date'])
data['ga:users'] = pd.to_numeric(data['ga:users'])
data['ga:newUsers'] = pd.to_numeric(data['ga:newUsers'])

In [0]:
before_shopify = data[data['ga:date'] < pd.to_datetime("2017-12-15")]
after_shopify = data[data['ga:date'] >= pd.to_datetime("2017-12-15")]


# Traffic totals before shopify switch
totals_before = before_shopify[["ga:landingPagePath", "ga:newUsers"]]\
                .groupby("ga:landingPagePath").sum()

totals_before = totals_before.reset_index()\
                .sort_values("ga:newUsers", ascending=False)



# Traffic totals after shopify switch
totals_after = after_shopify[["ga:landingPagePath", "ga:newUsers"]]\
               .groupby("ga:landingPagePath").sum()

totals_after = totals_after.reset_index()\
               .sort_values("ga:newUsers", ascending=False)

In [0]:
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on="ga:landingPagePath", 
                            right_on="ga:landingPagePath", 
                            suffixes=["_after", "_before"], 
                            how="outer")

change.fillna(0, inplace=True)


change['difference'] = change['ga:newUsers_after'] - change['ga:newUsers_before']
change['percent_change'] = change['difference'] / change['ga:newUsers_before']


winners = change[change['percent_change'] > 0]
losers = change[change['percent_change'] < 0]
no_change = change[change['percent_change'] == 0]


In [0]:
# Checking that the total traffic adds up
data['ga:newUsers'].sum() == change[['ga:newUsers_after', 'ga:newUsers_before']].sum().sum()

In [0]:
losers.sort_values("difference").head(10)

# Solution 1b

In [0]:



def get_redirects(url):
  try:
    r = requests.head(url, stream=True)
  except:
    return (url, None, "Error")
  if r.status_code in [301, 302, 307]:
    return (url, r.status_code, r.headers['Location'])
  elif r.status_code == 404:
    return (url, r.status_code, None)
  else:
    return (url, r.status_code, None)


    

In [0]:

results = []
def crawl_redirects(urls, sleep_time=.15):
  global results
  for i, url in enumerate(urls):
    result = get_redirects(url)
    results.append(result)
    if i % 1000 == 0:
      print(i,":", result)
    time.sleep(sleep_time)

In [0]:
try:
  redirects = pd.read_csv("/content/gdrive/My Drive/site_redirects.csv")
except FileNotFoundError:
  results = crawl_redirects(data['url'].unique().tolist())
  redirects = pd.DataFrame(results, columns=["url", "status_code", "redirect_url"]).drop_duplicates()
  redirects.to_csv("/content/gdrive/My Drive/site_redirects.csv")

In [0]:
data_redirects = data.merge(redirects, left_on="url", right_on="url", how="outer")

In [0]:
data_redirects['true_url'] = data_redirects['redirect_url'].combine_first(data_redirects['path'])
data_redirects['true_url'] = data_redirects['true_url'].apply(lambda x: urlparse(x).path)

In [0]:
true_before = data_redirects[data_redirects['ga:date'] < pd.to_datetime("2017-12-15")]
true_after = data_redirects[data_redirects['ga:date'] >= pd.to_datetime("2017-12-15")]


# Traffic totals before shopify switch
true_totals_before = true_before[["true_url", "ga:newUsers"]]\
                     .groupby("true_url").sum()

true_totals_before = true_totals_before.reset_index()\
                     .sort_values("ga:newUsers", ascending=False)




# Traffic totals after shopify switch
true_totals_after = true_after[["true_url", "ga:newUsers"]]\
                    .groupby("true_url").sum()

true_totals_after = true_totals_after.reset_index()\
                    .sort_values("ga:newUsers", ascending=False)

In [0]:
# Comparing pages from before and after the switch
true_change = true_totals_after.merge(true_totals_before, 
                            left_on="true_url", 
                            right_on="true_url", 
                            suffixes=["_after", "_before"], 
                            how="outer")

true_change.loc[:, ["ga:newUsers_after", "ga:newUsers_before"]].fillna(0, inplace=True)


true_change['difference'] = true_change['ga:newUsers_after'] - true_change['ga:newUsers_before']
true_change['percent_change'] = true_change['difference'] / true_change['ga:newUsers_before']


true_winners = true_change[true_change['percent_change'] > 0]
true_losers = true_change[true_change['percent_change'] < 0]
true_no_change = true_change[true_change['percent_change'] == 0]

In [0]:
# Checking again that the total traffic adds up
true_change[["ga:newUsers_before", "ga:newUsers_after"]].sum().sum() == data['ga:newUsers'].sum()

# Solution 2

In [0]:
# ^/collections/.*/products
# ^/collections(?!.*/products.*)

In [0]:
data_redirects['group'] = "N/A"
data_redirects.loc[data_redirects['true_url'].str.contains(r"/collections(?!.*/products.*)(?!.*/product.*)"), "group"] = "Collections"
data_redirects.loc[data_redirects['true_url'].str.contains(r".*/products/.*|.*/product/.*"), "group"] = "Products"

In [0]:
grouped_data = data_redirects[['group', "ga:newUsers", "ga:date"]].groupby(["group", "ga:date"]).sum().reset_index()

In [0]:
grouped_before = grouped_data[grouped_data['ga:date'] < pd.to_datetime("2017-12-15")]
grouped_after = grouped_data[grouped_data['ga:date'] >= pd.to_datetime("2017-12-15")]

grouped_before_total = grouped_before[["group", "ga:newUsers"]].groupby("group").sum().reset_index()
grouped_after_total = grouped_after[["group", "ga:newUsers"]].groupby("group").sum().reset_index()

grouped_change = grouped_before_total.merge(grouped_after_total, left_on="group", right_on="group", suffixes=["_before", "_after"])

In [0]:
grouped_change['difference'] = grouped_change['ga:newUsers_after'] - grouped_change['ga:newUsers_before']
grouped_change['percent_change'] = grouped_change['difference'] / grouped_change['ga:newUsers_before']
grouped_change.sort_values("difference")

In [0]:

plot_data = [
    go.Bar(
        x = grouped_change['group'].tolist(),
        y = grouped_change['difference'].tolist(),
        marker = dict(
          color = 'red'
        ),
        name = 'Traffic Difference'
    ),
    go.Bar(
      x = grouped_change['group'],
      y = grouped_change['ga:newUsers_before'],
      marker = dict(
        color = 'blue'
      ),
      name = "Traffic Before"
    ),
    go.Bar(
      x = grouped_change['group'],
      y = grouped_change['ga:newUsers_after'],
      marker = dict(
        color = 'orange'
      ),
      name = "Traffic After"
    )
]



fig = go.Figure(data=plot_data)
py.plot(fig, filename="base-bar.html")



HTML(filename="./base-bar.html")

In [0]:
line_data = []

for group in grouped_data['group'].unique().tolist():
  line = go.Scatter(
    x = grouped_data.loc[grouped_data['group'] == group, 'ga:date'],
    y = grouped_data.loc[grouped_data['group'] == group, 'ga:newUsers'],
    name = group,
    mode="lines"
  )
  
  line_data.append(line)
  

  


layout = go.Layout(
    shapes=
     [
        {
            'type': 'line',
            'x0': "2017-12-15",
            'y0': 0,
            'x1': "2017-12-15",
            'y1': max(grouped_data['ga:newUsers'])+200,
            'line': {
                'color': 'grey',
                'width': .5,
             },
        },
        {
            'type': 'line',
            'x0': "2018-05-15",
            'y0': 0,
            'x1': "2018-05-15",
            'y1': max(grouped_data['ga:newUsers'])+200,
            'line': {
                'color': 'grey',
                'width': .5,
             },
        }
     ],
                   
    annotations = [dict(
        showarrow = True,
        x = "2017-12-15",
        y = 1000,
        text = "Shopify Switch",
        xanchor = "left",
        ax=50,
        opacity = 1
      ),
      dict(
        showarrow = True,
        x = "2018-05-15",
        y = 1000,
        text = "???",
        xanchor = "left",
        ax=-50,
        opacity = 1
      )]
)
       
  
  
  
fig = go.Figure(data=line_data, layout=layout)
py.plot(fig, filename="base-line.html")



from IPython.display import HTML
HTML(filename="./base-line.html")

# Solution 3

In [0]:
import sklearn as sk
from bs4 import BeautifulSoup

In [0]:
urls = data_redirects.loc[~data_redirects['true_url'].str.match("Error"), "true_url"]
urls = pd.DataFrame(urls.unique(), columns=["url"])


# Only taking 50 for development purposes
urls.to_csv("./crawl_urls.csv")

In [0]:
from urllib.request import urlopen
from PIL import ImageFile


def geturl(url):
  
  scheme, host, path, params, query, fragment = urlparse(url)
  if not path:
      path = "/"
  if params:
      path = path + ";" + params
  if query:
      path = path + "?" + query

  url = host + path
  
  return "http://" + url
 

def getsizes(uri):
    # get file size *and* image size (None if not known)
    with requests.get(uri, stream=True) as file:
      size = file.headers.get("content-length")
      if size: 
        size = int(size)
      p = ImageFile.Parser()
      data = file.iter_content(chunk_size=512)
      for datum in data:
        p.feed(datum)
        if p.image:
          return size, p.image.size
    return size, None


In [0]:
img_counts = pd.read_csv("/content/gdrive/My Drive/img_sizes.csv", index_col=0)
form_counts = pd.read_csv("/content/gdrive/My Drive/form_counts.csv", index_col=0)

In [0]:
# 0 - 1000
# 1001 - 2000
# 2001 - 3000

def img_size_group(size):
  max_size = 50000
  img_size_groups = [i for i in 
                      zip(
                        [i for i in range(0, max_size, 1000)], 
                        [i for i in range(1000, max_size, 1000)]
                      )
                    ]
  
  for lower, upper in img_size_groups:
    if size > max_size:
      return str(max_size)+"+"
    elif lower < size < upper:
      return "{}-{}".format(lower, upper)
    
    

    



In [0]:
img_counts['filesize_group'] = img_counts['filesize'].apply(img_size_group)

In [0]:
onehot_img = img_counts[['url']].join(pd.get_dummies(img_counts['filesize_group'], drop_first=True))
onehot_img = onehot_img.groupby("url").sum().reset_index()
onehot_img = img_counts[["url"]].merge(onehot_img, on="url", how="left").drop_duplicates()

In [0]:
data = form_counts.merge(onehot_img, on="url")
data.loc[:, 'group'] = "N/A"
data.loc[data['url'].str.contains(r".*/products/.*|.*/product/.*"), "group"] = "Products"
data.loc[data['url'].str.contains(r"/collections(?!.*/products.*)(?!.*/product.*)"), "group"] = "Category"

# data = data.loc[data['group'] == "Category", :].append(data.loc[data['group'] == "Products", :].sample(500))

In [0]:


def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    plt.axes().grid(b=None)
     
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
#     plt.tight_layout()
    
    fig = plt.gcf()
    fig.set_size_inches(15,12)
  
    plt.show()


np.set_printoptions(precision=2)

X_train, X_test, y_train, y_test = train_test_split(data.drop(["group", "url"], axis=1), data['group'], test_size=0.2, random_state=42)

In [0]:

names = [
         "Naive Bayes",
         "Linear SVM",
         "Logistic Regression",
         "Random Forest",
         "Multilayer Perceptron"
        ]

classifiers = [
    MultinomialNB(),
    LinearSVC(),
    LogisticRegression(),
    RandomForestClassifier(),
    MLPClassifier()
]

parameters = [
              {'alpha': (1e-2, 1e-3, 1e-4)},
              {'C': (np.logspace(-5, 1, 5))},
              {'C': (np.logspace(-5, 1, 5))},
              {'max_depth': (1, 2, 3, 4, 5)},
              {'alpha': (1e-2, 1e-3), "max_iter": [1000, 2000, 3000]}
             ]


rows = []
cms = []
for name, classifier, params in zip(names, classifiers, parameters):
    gs_clf = GridSearchCV(classifier, param_grid=params, n_jobs=-1)
    clf = gs_clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    
    predictions = clf.predict(X_test)
    
    cm = confusion_matrix(predictions, y_test.tolist())
    cms.append((name, cm, clf.classes_))
   
    
    
    
    print("{} score: {}".format(name, clf.best_score_))
    print("Grid scores on test set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
        
        row = (name, mean, str(params))
        rows.append(row)
        
    print()

In [0]:
for name, conf_matrix, classes in cms:
  plot_confusion_matrix(conf_matrix, classes, title=name)

In [0]:
results = pd.DataFrame(rows, columns=["algorithm", "score", "params"])
results

In [0]:

plot_data = []

for name in names:
  bar = go.Bar(
      x = results[results['algorithm'] == name]['params'].tolist(),
      y = results[results['algorithm'] == name]['score'].tolist(),
      name = name
  )
  plot_data.append(bar)
  



fig = go.Figure(data=plot_data)
py.plot(fig, filename="base-bar-results.html")



from IPython.display import HTML
HTML(filename="./base-bar-results.html")