This Kaggle dataset is about customer conversion on Google's Google Merchandise Store (also known as GStore, where Google swag is sold). The main purpose of this analysis is to predict revenue per customer and make recommandations on promotional strategies. The main technical challenge it poses to predicting revenue is the presence of multiple high cardinality categorical features. By careful data exploration followed by well-thought choice of feature treatments as well as machine learning algorithm, I show that an optimal solution based on feature-engineering and extreme gradient-boosted decision trees yields an enhanced predictive power of 0.997, as measured by the area under the precision-recall curve. Crucially, these results were obtained without artificial balancing of the data making this approach suitable to real-world applications.

<a id='top'></a>
#### Outline: 
#### 0. <a href='#Sampling'>Random sampling of training set</a>


#### 1. <a href='#clean'>Data Cleaning</a>
11. <a href='#cleanID'>Cleaning of IDs</a>
12. <a href='#cleanTotals'>Cleaning of Variable Totals</a>
13. <a href='#cleanTS'>Cleaning of Time Series Variables</a>
14. <a href='#cleanLoc'>Cleaning of Location Variables</a>
15. <a href='#cleanDev'>Cleaning of Device Variables</a>
16. <a href='#cleanCust'>Cleaning of Custom Dimension Variables</a>


#### 2. <a href='#feature-eng'>Feature Engineering</a>
21. <a href='#dropUnif'> Drop uninformative categorical varibles</a>
22. <a href='#dropMissing'>Drop variables with too many missing values</a>
23. <a href='#encodeCat'>Encode Categorical Variables</a>
24. <a href='#encodeCat1'>Encode Network Domain</a>
25. <a href='#encodeCat2'>Encode Operating Systems</a>

#### 3. <a href='#EDA'>Exploratory Data Analysis</a>

#### 4. <a href='#ML'>Machine Learning to Predict Transactions</a>
41. <a href='#rfEnum'>Random Forest with Enum Encoding</a>
42. <a href='#gbmEnum'>GBM with Enum Encoding</a>
43. <a href='#gbmTarget'>GBM with sort_by_response Encoding</a>
44. <a href='#gbmTargetDev'>GBM with Target Encoding Channel/Device Analysis</a>


#### 5. <a href='#visualization'>Visualization</a>
51. <a href='#varimp'>Variable Importance Plot</a>
52. <a href='#pdp'>Partial Dependency Plot</a>
53. <a href='#pdp2dim'>Two Variable Partial Dependency Plot</a>
54. <a href='#Treeplot'>Major Decision Trees Plot</a>

#### 6. <a href='#Out-Of-Sample Walk Forward Testing'>Out-Of-Sample Walk Forward Testing</a>


#### 7. <a href='#conclusion'>Conclusion</a>

# 4. Machine Learning to Predict Transactions

In [5]:
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()

ModuleNotFoundError: No module named 'h2o'

In [15]:
Path = "train.csv"
df = pd.read_csv(Path)

In [135]:
names_y = "log_transactionRevenue"
ig_cols = ['fullVisitorId','visitId','raw_networkDomain','transactionRevenue','raw_operatingSystem','raw_browser',
          'raw_country','raw_region','raw_city','pre_visitStartTime','raw_visitStartTime','visitNumber']

In [136]:
given_types = {'channelGrouping':'enum', 'fullVisitorId':'numeric', 'visitId':'numeric', 'visitStartTime':'enum', 
               'hits':'numeric','pageviews':'numeric', 'timeOnSite':'numeric', 'newVisits':'enum', 
               'sessionQualityDim':'numeric','transactionRevenue':'numeric', 'bounces':'enum', 'week':'enum', 
               'day_of_week':'enum', 'continent':'enum','subContinent':'enum', 'raw_country':'enum', 'country':'numeric',
               'raw_region':'enum','region':'numeric','raw_city':'enum', 'city':'numeric', 
               'raw_networkDomain':'enum', 'raw_browser':'enum','browser':'numeric','operatingSystem':"numeric",
               'deviceCategory':'enum', 'value':'enum','networkDomain':'numeric',"log_transactionRevenue":"numeric",
              "raw_operatingSystem":"enum",'pre_visitStartTime':"time",'diff_lastVisitTime':"numeric",
              "raw_visitStartTime":"time","visitNumber":"numeric",'nvisits':'numeric'}

In [137]:
train = h2o.H2OFrame(df,column_types=given_types)

  data = _handle_python_lists(python_obj.as_matrix().tolist(), -1)[1]


Parse progress: |█████████████████████████████████████████████████████████| 100%


In [138]:
types = train.types

In [139]:
cat = []
for key,value in types.items():
    if value == 'enum':
        cat.append(key)
print(cat)

['channelGrouping', 'visitStartTime', 'bounces', 'newVisits', 'week', 'day_of_week', 'continent', 'subContinent', 'deviceCategory', 'value', 'raw_networkDomain', 'raw_operatingSystem', 'raw_browser', 'raw_region', 'raw_city', 'raw_country']


In [140]:
cat = []
for key,value in types.items():
    if value != 'enum':
        cat.append(key)
print(cat)

['fullVisitorId', 'visitId', 'visitNumber', 'hits', 'pageviews', 'sessionQualityDim', 'timeOnSite', 'transactionRevenue', 'log_transactionRevenue', 'raw_visitStartTime', 'nvisits', 'pre_visitStartTime', 'diff_lastVisitTime', 'country', 'region', 'city', 'networkDomain', 'browser', 'operatingSystem']


In [141]:
names_x = list(types.keys())
names_x.remove("log_transactionRevenue")              
names_x = [i for i in names_x if i not in ig_cols ]

In [144]:
train,test,te_holdout,valid = train.split_frame(ratios=[0.7,0.1,0.1])

<a id='rfEnum'></a>
### 4.1 Random Forest with Enum Encoding
<a href='#top'>back to top</a>

In [None]:
drf0 = H2ORandomForestEstimator(ntrees=300, max_depth=8, nfolds=8,nbins_cats=10)

In [None]:
drf0.train(x=names_x,y=names_y,
             training_frame=train,
             validation_frame=valid)

In [None]:
train_mse_drf = drf0.model_performance(train.rbind(valid)).mse()
test_mse_drf  = drf0.model_performance(test).mse()

In [None]:
print("train_mse: {0:.3f}".format(train_mse_drf))
print("test_mse: {0:.3f}".format(test_mse_drf))

In [None]:
drf0.varimp_plot()

<a id='gbmEnum'></a>
### 4.2 GBM with Enum Encoding
<a href='#top'>back to top</a>

train test error plot

In [328]:
gbm_params = {'learn_rate': [0.01,0.05,0.1],
                'max_depth': [4,6],
              'nbins_cats': [8,12],
              'ntrees':[50,100,300,500],
              'col_sample_rate':[0.6,1]}

In [329]:
gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid',
                          hyper_params=gbm_params)

In [331]:
gbm_grid.train(x=names_x, y=names_y,
                training_frame=train,
                validation_frame=valid,
                sample_rate=0.6,nfolds=5)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [332]:
gbm_gridperf = gbm_grid.get_grid(decreasing=True)

In [333]:
gbm_gridperf

      col_sample_rate learn_rate max_depth nbins_cats ntrees  \
0                 0.6       0.01         6         12    300   
1                 0.6       0.01         6          8    300   
2                 0.6       0.01         6         12    500   
3                 0.6       0.05         6          8     50   
4                 0.6       0.01         6          8    500   
5                 0.6       0.05         6         12     50   
6                 1.0       0.01         6          8    300   
7                 1.0       0.01         6         12    300   
8                 1.0       0.05         6          8     50   
9                 1.0       0.05         6         12     50   
10                0.6       0.05         6          8    100   
11                1.0       0.01         6         12    500   
12                0.6       0.05         6         12    100   
13                1.0       0.01         6          8    500   
14                0.6       0.01        



In [198]:
gbm0 = H2OGradientBoostingEstimator(ntrees=300,max_depth=6,nfolds=5,nbins_cats=12,
                                    learn_rate=0.01,sample_rate=0.6,col_sample_rate=0.6,distribution='poisson')

In [199]:
gbm0.train(x=names_x,y=names_y,
             training_frame=train,
             validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [200]:
gbm0.model_performance(train.rbind(valid))


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 1.8541677016666736
RMSE: 1.3616782665764602
MAE: 0.17704177101781846
RMSLE: 0.23306538961192025
Mean Residual Deviance: -0.1294195877979888




In [201]:
gbm0.model_performance(test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 2.4314480603178064
RMSE: 1.5593101232012208
MAE: 0.20730831703993052
RMSLE: 0.27055235507974545
Mean Residual Deviance: 0.5737893493989076




In [189]:
train_mse_gbm = gbm0.model_performance(train.rbind(valid))
test_mse_gbm = gbm0.model_performance(test)

In [190]:
train_mse_gbm


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 2.1419715617089086
RMSE: 1.463547594616898
MAE: 0.2012081784290787
RMSLE: 0.25185820178603613
Mean Residual Deviance: 36461.70262570637




In [191]:
test_mse_gbm


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 2.484945455996547
RMSE: 1.5763709766411418
MAE: 0.2185291873739998
RMSLE: 0.27072966501444745
Mean Residual Deviance: 272428.9997538041




In [188]:
print("train_mse: {0:.3f}".format(train_mse_gbm))
print("test_mse: {0:.3f}".format(test_mse_gbm))

train_mse: 2.142
test_mse: 2.485


In [147]:
train_mse_gbm = gbm0.model_performance(train.rbind(valid)).mse()
test_mse_gbm = gbm0.model_performance(test).mse()

In [148]:
print("train_mse: {0:.3f}".format(train_mse_gbm))
print("test_mse: {0:.3f}".format(test_mse_gbm))

train_mse: 1.804
test_mse: 2.273


In [None]:
pd.DataFrame(gbm0.varimp())

In [149]:
names_x.remove("diff_lastVisitTime")

In [150]:
names_x.remove('nvisits')

In [151]:
gbm1 = H2OGradientBoostingEstimator(ntrees=300,max_depth=6,nfolds=5,nbins_cats=12,
                                    learn_rate=0.01,sample_rate=0.6,col_sample_rate=0.6)

In [152]:
gbm1.train(x=names_x,y=names_y,
             training_frame=train,
             validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [153]:
train_mse_gbm = gbm1.model_performance(train.rbind(valid)).mse()
test_mse_gbm = gbm1.model_performance(test).mse()

In [154]:
print("train_mse: {0:.3f}".format(train_mse_gbm))
print("test_mse: {0:.3f}".format(test_mse_gbm))

train_mse: 1.808
test_mse: 2.300


In [155]:
names_x.remove("hits")

In [156]:
names_x.remove('continent')

In [157]:
gbm2 = H2OGradientBoostingEstimator(ntrees=300,max_depth=6,nfolds=5,nbins_cats=12,
                                    learn_rate=0.01,sample_rate=0.6,col_sample_rate=0.6)

In [158]:
gbm2.train(x=names_x,y=names_y,
             training_frame=train,
             validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [159]:
train_mse_gbm = gbm2.model_performance(train.rbind(valid)).mse()
test_mse_gbm = gbm2.model_performance(test).mse()

In [160]:
print("train_mse: {0:.3f}".format(train_mse_gbm))
print("test_mse: {0:.3f}".format(test_mse_gbm))

train_mse: 1.838
test_mse: 2.289


In [161]:
names_x.append("diff_lastVisitTime")

In [162]:
names_x.append('nvisits')

In [163]:
gbm3 = H2OGradientBoostingEstimator(ntrees=300,max_depth=6,nfolds=5,nbins_cats=12,
                                    learn_rate=0.01,sample_rate=0.6,col_sample_rate=0.6)

In [164]:
gbm3.train(x=names_x,y=names_y,
             training_frame=train,
             validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [165]:
train_mse_gbm = gbm3.model_performance(train.rbind(valid)).mse()
test_mse_gbm = gbm3.model_performance(test).mse()

In [166]:
print("train_mse: {0:.3f}".format(train_mse_gbm))
print("test_mse: {0:.3f}".format(test_mse_gbm))

train_mse: 1.826
test_mse: 2.265


XGboost 
mean absolute errors

<a id='gbmTarget'></a>
### 4.3 GBM with sort_by_response Encoding
<a href='#top'>back to top</a>

In [172]:
gbm4 = H2OGradientBoostingEstimator(ntrees=300,max_depth=6,nfolds=5,nbins_cats=12,
                                    learn_rate=0.01,sample_rate=0.6,col_sample_rate=0.6,
categorical_encoding='sort_by_response')

In [173]:
names_x.remove("networkDomain")
names_x.append("raw_networkDomain")
names_x.remove("operatingSystem")
names_x.append("raw_operatingSystem")
names_x.append("diff_lastVisitTime")
names_x.append("nvisits")

In [174]:
gbm4.train(x=names_x,y=names_y,
             training_frame=train,
             validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [175]:
train_mse_gbm = gbm4.model_performance(train.rbind(valid)).mse()
test_mse_gbm = gbm4.model_performance(test).mse()

In [176]:
print("train_mse: {0:.3f}".format(train_mse_gbm))
print("test_mse: {0:.3f}".format(test_mse_gbm))

train_mse: 1.783
test_mse: 2.242


In [None]:
gbm3.varimp_plot()

<a id='gbmTargetDev'></a>
### 4.4 GBM with Target Encoding Channel/Device Analysis
<a href='#top'>back to top</a>

In [None]:
sessionQuality = ['pageviews','sessionQualityDim','timeOnSite','hits']
names_x_sub = [i for i in names_x if i not in sessionQuality ]

In [None]:
gbm2 = H2OGradientBoostingEstimator(ntrees=300,max_depth=5,nfolds=5,nbins_cats=5,learn_rate=0.01,sample_rate=0.67,
                                   categorical_encoding='sort_by_response')

In [None]:
gbm2.train(x=names_x_sub,y=names_y,
             training_frame=train,
             validation_frame=valid)

In [None]:
train_mse_gbm = gbm2.model_performance(train.rbind(valid)).mse()
test_mse_gbm = gbm2.model_performance(test).mse()

In [None]:
prediction = gbm2.predict(test)

In [None]:
print("train_mse: {0:.3f}".format(train_mse_gbm))
print("test_mse: {0:.3f}".format(test_mse_gbm))

In [None]:
gbm2.varimp_plot()

##### Mac OS
path=""
mojo_file_name = 'gbm2.zip'
modelfile = gbm2.download_mojo(mojo_file_name,get_genmodel_jar=False)
print("Model saved to " + modelfile)

In [42]:
path = 'C://Users//XuL//Desktop//kaggle'
mojo_file_name = '//model//gbm0.zip'
modelfile = gbm0.download_mojo(path+mojo_file_name,get_genmodel_jar=False)
print("Model saved to " + modelfile)

Model saved to C:\Users\XuL\Desktop\kaggle\model\gbm0.zip
