This Kaggle dataset is about customer conversion on Google's Google Merchandise Store (also known as GStore, where Google swag is sold). The main purpose of this analysis is to predict revenue per customer and make recommandations on promotional strategies. The main technical challenge it poses to predicting revenue is the presence of multiple high cardinality categorical features. By careful data exploration followed by well-thought choice of feature treatments as well as machine learning algorithm, I show that an optimal solution based on feature-engineering and extreme gradient-boosted decision trees yields an enhanced predictive power of 0.997, as measured by the area under the precision-recall curve. Crucially, these results were obtained without artificial balancing of the data making this approach suitable to real-world applications.

<a id='top'></a>
#### Outline: 
#### 1. <a href='#Sampling'>Random sampling of training set</a>


#### 2. <a href='#clean'>Data Cleaning</a>
21. <a href='#cleanID'>Cleaning of IDs</a>
22. <a href='#cleanTotals'>Cleaning of Variable Totals</a>
23. <a href='#cleanTS'>Cleaning of Time Series Variables</a>
24. <a href='#cleanLoc'>Cleaning of Location Variables</a>
25. <a href='#cleanDev'>Cleaning of Device Variables</a>
26. <a href='#cleanCust'>Cleaning of Custom Dimension Variables</a>


#### 3. <a href='#feature-eng'>Feature Engineering</a>
41. <a href='#dropUnif'> Drop uninformative categorical varibles</a>
42. <a href='#dropMissing'>Drop variables with too many missing values</a>
43. <a href='#encodeCat'>Encode Categorical Variables</a>
44. <a href='#encodeCat1'>Encode Network Domain</a>
45. <a href='#encodeCat2'>Encode Operating Systems</a>

#### 4. <a href='#EDA'>Exploratory Data Analysis</a>

#### 5. <a href='#ML'>Machine Learning to Predict Transactions</a>
51. <a href='#rfEnum'>Random Forest with Enum Encoding</a>
52. <a href='#gbmEnum'>GBM with Enum Encoding</a>
53. <a href='#gbmTarget'>GBM with sort_by_response Encoding</a>
54. <a href='#gbmTargetDev'>GBM with Target Encoding Channel/Device Analysis</a>


#### 6. <a href='#visualization'>Visualization</a>
61. <a href='#varimp'>Variable Importance Plot</a>
62. <a href='#pdp'>Partial Dependency Plot</a>
63. <a href='#pdp2dim'>Two Variable Partial Dependency Plot</a>
64. <a href='#Treeplot'>Major Decision Trees Plot</a>


#### 7. <a href='#conclusion'>Conclusion</a>

In [1]:
import pandas as pd
import json
import random
import csv
import numpy as np
from datetime import datetime

In [2]:
import seaborn
import matplotlib.pyplot as plt
%matplotlib inline
seaborn.set(rc={'figure.figsize':(15,12)})

In [3]:
import subprocess
from IPython.display import Image

In [4]:
def json_to_series(text):
    keys, values = zip(*[item for item in json.loads(text).items()])
    return pd.Series(values, index=keys)

<a id='Sampling'></a>
# 1.Random sample 90000 rows of training set
<a href='#top'>back to top</a>

In [7]:
filename = ".\\all\\train_v2.csv"

In [None]:
n = sum(1 for line in open(filename,encoding="utf8")) - 1 

In [None]:
s = 90000
skip = sorted(random.sample(range(1,1708337+1),1708337-s))
train_top = pd.read_csv(filename,skiprows=skip)

In [None]:
train_top.to_csv("train_top.csv",index=False)

In [8]:
df = pd.read_csv("train_top.csv")

In [37]:
df = pd.read_csv(filename,nrows=100000)

<a id='clean'></a>
# 2.Data Cleaning

<a id='cleanID'></a>
### 2.1 Clean ID
<a href='#top'>back to top</a>

sessionId = fullVisitorId_visitId    
visitNumber may be strong indicators since it means the visitor comes back multiple times

In [38]:
ct = df.groupby(by='fullVisitorId')['visitId'].count()
visitn = df.groupby(by='fullVisitorId')['visitNumber'].max()
visitn = pd.concat([visitn,ct],axis=1)
visitn.columns = ["max_visitNumber","count"]
visitn.reset_index(inplace=True)

In [39]:
df.drop(['hits','trafficSource'],axis=1,inplace=True)

<a id='cleanTotals'></a>
### 2.2 Clean Totals
<a href='#top'>back to top</a>

In [40]:
df = pd.concat([df, df['totals'].apply(json_to_series)], axis=1)

In [41]:
df.drop(['totals'],axis=1,inplace=True)

In [42]:
#df['hits'] = df['hits'].map(lambda x: int(x) if not pd.isnull(x) else 0)
df['pageviews'] = df['pageviews'].map(lambda x: int(x) if not pd.isnull(x) else 0)
df['newVisits'] = df['newVisits'].map(lambda x: int(x) if not pd.isnull(x) else 0)
df['visits'] = df['visits'].map(lambda x: int(x) if not pd.isnull(x) else 0)
df['bounces'] = df['bounces'].map(lambda x: int(x) if not pd.isnull(x) else 0)
df['transactionRevenue'] = df['transactionRevenue'].map(lambda x: int(x) if not pd.isnull(x) else 0)
df['totalTransactionRevenue'] = df['totalTransactionRevenue'].map(lambda x: int(x) if not pd.isnull(x) else 0)

In [43]:
df.drop(['totalTransactionRevenue'],axis=1,inplace=True)

In [44]:
df['log_transactionRevenue'] = df['transactionRevenue'].map(lambda x:np.log(x+1))

<a id='cleanTS'></a>
### 2.3 Clean Time Series
<a href='#top'>back to top</a>

In [45]:
df["raw_visitStartTime"] = df["visitStartTime"].map(lambda x:datetime.utcfromtimestamp(x))
df["visitStartTime"] = df["visitStartTime"].map(lambda x:datetime.utcfromtimestamp(x).hour)

In [77]:
df['nvisits'] = df.groupby(by='fullVisitorId')['fullVisitorId'].cumcount()+1

In [83]:
df.sort_values(by=['fullVisitorId','nvisits','raw_visitStartTime'],inplace=True)

In [84]:
df['pre_visitStartTime'] = df.groupby(by='fullVisitorId')['raw_visitStartTime'].shift(1)

In [85]:
df['diff_lastVisitTime'] = (df['raw_visitStartTime']-df['pre_visitStartTime'])

In [86]:
df['diff_lastVisitTime'] = df['diff_lastVisitTime'].map(lambda x: int(x.seconds//3600)+x.days*24 if not pd.isnull(x) else np.NaN)

In [87]:
ids = list(df.loc[df['diff_lastVisitTime']<0,'fullVisitorId'].unique())

In [90]:
df['week'] = df['raw_visitStartTime'].map(lambda x: x.isocalendar()[1])
df['day_of_week'] = df['raw_visitStartTime'].map(lambda x: x.isocalendar()[2])

In [91]:
df.drop(['date'],axis=1,inplace=True)

<a id='cleanLoc'></a>
### 2.4 Clean Location
<a href='#top'>back to top</a>

In [92]:
df = pd.concat([df, df['geoNetwork'].apply(json_to_series)], axis=1)

In [93]:
df.drop(['geoNetwork'],axis=1,inplace=True)

<a id='cleanDev'></a>
### 2.5 Clean Device
<a href='#top'>back to top</a>

In [94]:
df = pd.concat([df, df['device'].apply(json_to_series)], axis=1)

In [95]:
df.drop(['device'],axis=1,inplace=True)

<a id='cleanCust'></a>
### 2.6 Clean customDimensions
<a href='#top'>back to top</a>

In [96]:
df['customDimensions'] = df['customDimensions'].map(lambda x: x[1:-1])

In [97]:
df['customDimensions'] = df['customDimensions'].map(lambda x: x.replace("'","\""))

In [98]:
df.loc[df['customDimensions']=="",'customDimensions'] = "{\"index\":\"NaN\",\"value\":\"NaN\"}"

In [99]:
df = pd.concat([df, df['customDimensions'].apply(json_to_series)], axis=1)

In [100]:
df.drop(['customDimensions'],axis=1,inplace=True)

<a id='feature-eng'></a>
# 3. Feature Cleaning
<a href='#top'>back to top</a>

In [101]:
ID = ["fullVisitorId","visitId"]

<a id='dropUnif'></a>
### 3.1 Drop uninformative categorical varibles
<a href='#top'>back to top</a>

In [102]:
for col in df.columns:
    cn = df[col].value_counts()
    if cn.shape[0]==1:
        df.drop(col,axis=1,inplace=True)

<a id='dropMissing'></a>
### 3.2 Drop variables with too many missing values
<a href='#top'>back to top</a>

In [103]:
df.replace("not available in demo dataset",np.NaN,inplace=True)

In [104]:
df.replace("NaN",np.NaN,inplace=True)

In [105]:
df.replace("(not set)",np.NaN,inplace=True)

In [106]:
na_df = np.sum(df.isna(),axis=0)/df.shape[0]
na_df[na_df>0.1]

sessionQualityDim     0.53872
timeOnSite            0.49974
transactions          0.98998
pre_visitStartTime    0.90097
diff_lastVisitTime    0.90097
region                0.57240
metro                 0.77939
city                  0.58195
networkDomain         0.28537
index                 0.20214
value                 0.20214
dtype: float64

In [107]:
df.drop(['index',"isMobile","transactions","metro"],axis=1,inplace=True)

<a id='encodeCat'></a>
### 3.3 Encode Categorical Variables
<a href='#top'>back to top</a>

In [108]:
import category_encoders as ce

<a id='encodeCat1'></a>
### 3.3.1 Encode Categorical Variables -- networkDomain
<a href='#top'>back to top</a>

In [109]:
df.loc[df['networkDomain'].isna(),'networkDomain'] = 'unknown.unknown'

df["raw_networkDomain"] = df["networkDomain"]

In [110]:
min_sample_leaf = int(df.shape[0]*0.005)

encoder = ce.TargetEncoder(cols=["networkDomain"],min_samples_leaf=min_sample_leaf,smoothing=1)

In [111]:
encoder.fit(df.loc[:,~df.columns.isin(["transactionRevenue","log_transactionRevenue"])], df["log_transactionRevenue"])

TargetEncoder(cols=['networkDomain'], drop_invariant=False,
       handle_unknown='impute', impute_missing=True, min_samples_leaf=500,
       return_df=True, smoothing=1.0, verbose=0)

In [112]:
df_x = encoder.transform(df.loc[:,~df.columns.isin(["transactionRevenue","log_transactionRevenue"])])

In [113]:
df['networkDomain'] = df_x['networkDomain']

In [114]:
cl='networkDomain'
    
_dict={}

for k,g in df.groupby(by=cl):

    if len(g['raw_'+cl].unique())<=10:
        _dict[k] = list(g['raw_'+cl].unique())
    else:
        _dict[k] = ['Others']
    
networkDomain_dict = _dict 

<a id='encodeCat2'></a>
### 3.3.2 Encode Categorical Variables -- operating Systems
<a href='#top'>back to top</a>

In [115]:
df['raw_operatingSystem'] = df['operatingSystem']
df['raw_browser'] = df['browser']
df['raw_region'] = df['region']
df['raw_city'] = df['city']
df['raw_country'] = df['country']

In [116]:
cols = ["operatingSystem","browser","region","city","country"]

min_sample_leaf = int(df.shape[0]*0.005)

encoder = ce.TargetEncoder(cols=cols,
                           min_samples_leaf=min_sample_leaf,smoothing=1)

In [117]:
encoder.fit(df.loc[:,~df.columns.isin(["transactionRevenue","log_transactionRevenue"])], df["log_transactionRevenue"])

TargetEncoder(cols=['operatingSystem', 'browser', 'region', 'city', 'country'],
       drop_invariant=False, handle_unknown='impute', impute_missing=True,
       min_samples_leaf=500, return_df=True, smoothing=1.0, verbose=0)

In [118]:
df_x = encoder.transform(df.loc[:,~df.columns.isin(["transactionRevenue","log_transactionRevenue"])])

In [119]:
for col in cols:
    df[col] = df_x[col]

In [120]:
cl='operatingSystem'
    
_dict={}

for k,g in df.groupby(by=cl):

    if len(g['raw_'+cl].unique())<=10:
        _dict[k] = list(g['raw_'+cl].unique())
    else:
        _dict[k] = ['Others']
    
operatingSystem_dict = _dict 

In [121]:
cl='browser'
    
_dict={}

for k,g in df.groupby(by=cl):

    if len(g['raw_'+cl].unique())<=10:
        _dict[k] = list(g['raw_'+cl].unique())
    else:
        _dict[k] = ['Others']
    
browser_dict = _dict 

In [122]:
cl='region'
    
_dict={}

for k,g in df.groupby(by=cl):

    if len(g['raw_'+cl].unique())<=10:
        _dict[k] = list(g['raw_'+cl].unique())
    else:
        _dict[k] = ['Others']
    
region_dict = _dict 

In [123]:
cl='city'
    
_dict={}

for k,g in df.groupby(by=cl):

    if len(g['raw_'+cl].unique())<=10:
        _dict[k] = list(g['raw_'+cl].unique())
    else:
        _dict[k] = ['Others']
    
city_dict = _dict 

In [124]:
cl='country'
    
_dict={}

for k,g in df.groupby(by=cl):

    if len(g['raw_'+cl].unique())<=10:
        _dict[k] = list(g['raw_'+cl].unique())
    else:
        _dict[k] = ['Others']
    
country_dict = _dict 

In [125]:
df.to_csv("train.csv",index=False)