# HW3: Initial Model Development and Analysis

##### Author: Yuji Mori
##### Last Updated: 02/10/2021

Tasks:
- Perform feature engineering
- Estimate 1 (baseline) model and evaluate model performance
- Hint: Determine what metric(s) is/are appropriate for your use case
- Estimate 1 (different) model and/or loss function to improve model performance
- Interpret results of model
- Use results to answer business question you posed
- Write-up a summary of what you did and why in “Methodology” section of README, referencing 3+ cells, figures and/or tables 


In [112]:
import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics


from matplotlib import pyplot as plt

In [67]:
# JSON flattening code
# This code was widely used by most Kaggle competitiors due to JSON structure of some fields
# credit goes to:
# https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields

def load_df(csv_path='../data/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

**NOTE:** The Kaggle competition supplies a training and testing set. However, the supplied testing set does not contain the target variable. Therefore, I will use the competition 'training set' and split further into training and testing sets for the purposes of this project.

# Data Pre-processing 

In [68]:
train_df = load_df(csv_path='./data/train.csv')

  column_as_df = json_normalize(df[column])


Loaded train.csv. Shape: (903653, 55)


#### Dropping Columns

We drop some columns that are uninformative for analysis. This includes: 

- columns without unique values
- ID/string columns with only unique values 
- timestamp columns


In [69]:
# dropping columns without unique values:
nunique_cols = [col for col in train_df.columns if train_df[col].nunique() == 1]
train_df.drop(nunique_cols, axis=1,inplace=True)

In [70]:
# dropping string/ID columns that are uninformative or irrelevant for statistical analysis:
# - sessionId
# - visitId
# - fullvisitorID (TEMPORARY -- might aggregate later)
# - visitStartTime
# - trafficSource.keyword: The keyword of the traffic source
# - trafficSource.referralPath: referral URL (if available)
# - trafficSource.adwordsClickInfo.gclId: google click ID
# - networkDomain: name of ISP provider
string_cols = ['sessionId','visitId','fullVisitorId','visitStartTime','trafficSource.keyword', 'trafficSource.referralPath',
               'trafficSource.adwordsClickInfo.gclId', 'geoNetwork.networkDomain']
train_df.drop(string_cols, axis=1,inplace=True)

In [71]:
train_df.shape

(903653, 24)

#### Filling NAs (for numeric columns):

In [72]:
train_df['totals.transactionRevenue'] = train_df['totals.transactionRevenue'].fillna(0)
train_df['totals.pageviews'] = train_df['totals.pageviews'].fillna(0)

In [73]:
# once NaNs are filled, we can convert to int/float:
train_df = train_df.astype({'totals.hits': 'int64', 
                            'totals.pageviews': 'int64',
                            'totals.transactionRevenue':'float'})

In [78]:
train_df.dtypes

channelGrouping                                  object
date                                              int64
fullVisitorId                                    object
visitNumber                                       int64
device.browser                                   object
device.operatingSystem                           object
device.isMobile                                    bool
device.deviceCategory                            object
geoNetwork.continent                             object
geoNetwork.subContinent                          object
geoNetwork.country                               object
geoNetwork.region                                object
geoNetwork.metro                                 object
geoNetwork.city                                  object
totals.hits                                       int64
totals.pageviews                                  int64
totals.transactionRevenue                       float64
trafficSource.campaign                          

--------

#### Grouping low-frequency categories

I arbitrarily set a threshold of 100 for all categorical columns. This will help reduce the computational/memory load when dummy encoding.

I am adapting code that I found here: https://stackoverflow.com/questions/41577468/replace-low-frequency-categorical-values-from-pandas-dataframe-while-ignoring-na

In [77]:
train_df = train_df.apply(lambda x: x.mask(x.map(x.value_counts())<100, 'Other') if x.dtypes == 'O' else x)

In [94]:
train_df.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,visitNumber,device.browser,device.operatingSystem,device.isMobile,device.deviceCategory,geoNetwork.continent,geoNetwork.subContinent,...,totals.transactionRevenue,trafficSource.campaign,trafficSource.source,trafficSource.medium,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adContent,Purchase_Flag,Log_Revenue
0,Organic Search,20160902,Other,1,Chrome,Windows,False,desktop,Asia,Western Asia,...,0.0,(not set),google,organic,,,,,False,0.0
1,Organic Search,20160902,Other,1,Firefox,Macintosh,False,desktop,Oceania,Australasia,...,0.0,(not set),google,organic,,,,,False,0.0
2,Organic Search,20160902,Other,1,Chrome,Windows,False,desktop,Europe,Southern Europe,...,0.0,(not set),google,organic,,,,,False,0.0
3,Organic Search,20160902,Other,1,UC Browser,Linux,False,desktop,Asia,Southeast Asia,...,0.0,(not set),google,organic,,,,,False,0.0
4,Organic Search,20160902,Other,2,Chrome,Android,True,mobile,Europe,Northern Europe,...,0.0,(not set),google,organic,,,,,False,0.0


-----
# Statistical Modelling

I log-transform the revenue from each visit as the response variable:

$$ln(totals.transactionRevenue + 1)$$

In [81]:
train_df['Log_Revenue'] = np.log(train_df['totals.transactionRevenue'] + 1)

The summary ouputs below demonstrate how the transformation affects the response variable distribution:

In [146]:
train_df['Log_Revenue'].describe()

count    903653.000000
mean          0.227118
std           2.003710
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          23.864375
Name: Log_Revenue, dtype: float64

In [147]:
train_df['totals.transactionRevenue'].describe()

count    9.036530e+05
mean     1.704273e+06
std      5.277866e+07
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      2.312950e+10
Name: totals.transactionRevenue, dtype: float64

#### Encode Categorical Variables

In [82]:
# dummy encode categorical (object dtype) columns using Pandas get_dummies()
# dummy_na=True: parameter creates extra column to represent NaN (no need to impute/fill!)

train_df_dummies = pd.get_dummies(train_df, dummy_na=True)

In [83]:
train_df_dummies.columns

Index(['date', 'visitNumber', 'device.isMobile', 'totals.hits',
       'totals.pageviews', 'totals.transactionRevenue', 'Purchase_Flag',
       'Log_Revenue', 'channelGrouping_(Other)', 'channelGrouping_Affiliates',
       ...
       'trafficSource.adContent_Ad from 12/13/16',
       'trafficSource.adContent_Display Ad created 3/11/14',
       'trafficSource.adContent_Display Ad created 3/11/15',
       'trafficSource.adContent_Full auto ad IMAGE ONLY',
       'trafficSource.adContent_Google Merchandise Collection',
       'trafficSource.adContent_Google Online Store',
       'trafficSource.adContent_Other',
       'trafficSource.adContent_{KeyWord:Google Brand Items}',
       'trafficSource.adContent_{KeyWord:Google Merchandise}',
       'trafficSource.adContent_nan'],
      dtype='object', length=718)

---------

Aside: 

I attempted to build a classification model on the `Purchase_Flag` column, but due to some potential leakage, I found that the predictions were perfect. Therefore, I will no longer consider a binary classification model and focus on the initial task of predicting log revenue.

In [148]:
# CREATING A BINARY RESPONSE VARIABLE: IF transactionRevenue > 0, THEN TRUE:
'''
train_df['Purchase_Flag'] = train_df['totals.transactionRevenue'] > 0
train_df['Purchase_Flag'].value_counts()

y = train_df_dummies['Purchase_Flag']
X = train_df_dummies.drop(['Purchase_Flag','Log_Revenue','totals.transactionRevenue'],axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

tree_class = tree.DecisionTreeClassifier(max_depth=5)
tree_class.fit(X_train, y_train)
yhat_tree_class = tree_class.predict(X_test)
metrics.confusion_matrix(y_test, yhat_tree_class)
'''

"\ntrain_df['Purchase_Flag'] = train_df['totals.transactionRevenue'] > 0\ntrain_df['Purchase_Flag'].value_counts()\n\ny = train_df_dummies['Purchase_Flag']\nX = train_df_dummies.drop(['Purchase_Flag','Log_Revenue','totals.transactionRevenue'],axis=1)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)\n\ntree_class = tree.DecisionTreeClassifier(max_depth=5)\ntree_class.fit(X_train, y_train)\nyhat_tree_class = tree_class.predict(X_test)\nmetrics.confusion_matrix(y_test, yhat_tree_class)\n"

-------

### Model 1 - Single Regression Tree

In [105]:
# y = train_df['totals.transactionRevenue']
# X = train_df.drop(['totals.transactionRevenue'],axis=1)

# WORKING WITH SUBSET FOR NOW:
y = train_df_dummies['Log_Revenue']
X = train_df_dummies.drop(['Log_Revenue','totals.transactionRevenue'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [106]:
tree_reg = tree.DecisionTreeRegressor(max_depth=5)
tree_reg.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=5)

##### Model Evaluation:

In [107]:
yhat_tree_reg = tree_reg.predict(X_test)

In [109]:
metrics.mean_squared_error(y_test, yhat_tree_reg)

2.9018251720330053

In [128]:
metrics.mean_squared_error(y_test[y_test > 0], yhat_tree_reg[y_test > 0])

170.00662721243324

----
### Model 2 -  Random Forest Classifier

In [113]:
rf_reg = RandomForestRegressor(n_estimators=10, random_state=42)
rf_reg.fit(X_train, y_train)

NameError: name 'regressor' is not defined

##### Model Evaluation:

In [114]:
yhat_rf_reg = rf_reg.predict(X_test)

In [115]:
metrics.mean_squared_error(y_test, yhat_rf_reg)

2.960154080933274

In [142]:
metrics.mean_squared_error(np.exp(y_test[25:30]), np.exp(yhat_rf_reg[25:30]))

73651220000000.2

In [127]:
metrics.mean_squared_error(y_test[y_test > 0], yhat_rf_reg[y_test > 0])

148.88062848715484