# Modeling Plan

### Features
- use user_info.
- use actions vector. 
- try action types, action details, and actions
- also, try tf-idf transformation  
- use actions_sec vector similarly to actions vector.
- the combination of above. 

### The best feature combination
(based on 5-fold CV accuracy)
- user_info
- actions
- without seconds
- without tf-idf 

### Model
- We first use only user information and apply a decision tree to understand the relation between user information and booking. 
- Next, we use user information and user activity vectors and apply a logistic regression. We have a logistic regression because features have a high dimension of ~350, and it would be (relatively) easier to separate linearly. 
- We apply a random forest for the feature analysis and comparision. 

# Summary

### A decision tree with user information
- We see no improvement over a simple guess.

### A logistic regression with activity vectors.
- With user activity vectors, we get around 8% increase in accuracy. 

### A random forest: feature analysis
- We get around 9% increase in accuracy with a random forest. 
- We see that ageCat has the highest importance as we saw in the exploratory analysis. (missing age -> no booking)
- signup_method plays an important role while we didn't see it in the exploratory analysis. We need further analysis for this. 
- We see that activities like 'pending', 'requested' are an important feature for booking analysis. It could be a good starting point to analyze session data. 

### Using all data
- Overall, we achieve around 79% of accuracy, which is a 10% increase in accuracy compared to the base accuracy 69%. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df_origin = pd.read_csv('train_user_session_merged.csv')

In [3]:
df_origin.shape

(65136, 676)

# 1. Data Setup

## Preprocess

In [4]:
from sklearn.preprocessing import LabelEncoder

user_info = ['gender', 'ageCat','signup_method', 'signup_flow',
           'language', 'affiliate_channel', 'affiliate_provider',
           'signup_app', 'first_device_type', 'first_browser','signup_device']

for col in user_info:
    df_origin[col] = LabelEncoder().fit_transform(df_origin[col])

target = 'country_destination'

actions = pd.read_csv('actions.csv', header=None, names=['action']).action.tolist()
actions_sec = [action+'_sec' for action in actions]

## Sampling


In [5]:
n_sample = 20000
df = df_origin.sample(n_sample, random_state=0)

## Base Accuracy

In [6]:
# original data
y_cnt = df_origin.country_destination.value_counts()
y_cnt_percent = y_cnt/y_cnt.sum()*100
y_cnt_percent

NDF    69.149165
US     30.850835
Name: country_destination, dtype: float64

In [7]:
# sampled data
y_cnt = df.country_destination.value_counts()
y_cnt_percent = y_cnt/y_cnt.sum()*100
y_cnt_percent

NDF    69.2
US     30.8
Name: country_destination, dtype: float64

- We see that the sample data reflect well on overall booking rate. 

# 2. Modeling with user information using a decision tree. 

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import time

def printFeatureImportance(clf, X,y):
    print('\n feature importance:')
    clf.fit(X, y)
    features = X.columns
    importances = clf.feature_importances_
    arg_sort = np.argsort(importances)[::-1]
    n = min(len(features), 10)
    for i in range(n):
        idx = arg_sort[i]
        print('%2d. %-*s %.2f %%' %(i+1,15, features[idx],importances[idx]*100))

def fitTree(X,y, estimator=DecisionTreeClassifier, param_grid=None, cv=5):
    t1 = time.time()
    
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
    
    clf = estimator()
     
    if param_grid:
        gs = GridSearchCV(estimator=clf,
                    param_grid=param_grid,
                    scoring='accuracy',
                    cv=cv)
        gs = gs.fit(X_train, y_train)
        clf = gs.best_estimator_
        print('best param: ' + str(gs.best_params_))
        print('grid search time: %.2f sec.' %(time.time()-t1))
    
    train_scores = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=cv)
    test_scores = cross_val_score(estimator=clf, X=X_test, y=y_test, cv=cv)
    
    print('train mean accuracy: %.2f %% (std=%.2f %%)' 
              %(np.mean(train_scores)*100, np.std(train_scores)*100))
    print('test mean accuracy: %.2f %% (std=%.2f %%)' 
              %(np.mean(test_scores)*100, np.std(test_scores)*100))
    
    printFeatureImportance(clf,X,y)
    print('\n total time: %.2f sec.' %(time.time()-t1))
    return clf

In [9]:
feature_columns = []
feature_columns += user_info

X = df[feature_columns]
y = df[target]

In [10]:
X.shape

(20000, 11)

In [11]:
fitTree(X,y)

train mean accuracy: 71.91 % (std=0.66 %)
test mean accuracy: 70.65 % (std=1.31 %)

 feature importance:
 1. ageCat          44.04 %
 2. signup_method   14.45 %
 3. gender          7.52 %
 4. affiliate_channel 7.50 %
 5. first_browser   7.04 %
 6. first_device_type 4.27 %
 7. affiliate_provider 4.09 %
 8. language        3.85 %
 9. signup_device   2.94 %
10. signup_flow     2.35 %

 total time: 0.46 sec.


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

- We see no improvement over a simple guess.

# 3. Modeling with user activity vectors using a logistic regression

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import time

def fitLR(X,y, estimator=LogisticRegression, categorical_features=[], param_grid=None, cv=5, normalize=False, tfidf=False):
    t1 = time.time()
    
    X = X.copy()
    numeric_features = [col for idx, col in enumerate(X.columns) if idx not in categorical_features]
    if normalize and numeric_features:
        stds = StandardScaler()
        X[numeric_features] = stds.fit_transform(X[numeric_features])
    
    if tfidf and numeric_features:
        tfidf = TfidfTransformer()
        X[numeric_features] = tfidf.fit_transform(X[numeric_features]).toarray()
    
    X_ohe = X
    if categorical_features:
        ohe = OneHotEncoder(categorical_features=categorical_features)
        X_ohe = ohe.fit_transform(X).toarray()
    
    X_train, X_test, y_train, y_test = train_test_split(X_ohe,y, test_size=0.3)
    
    clf = estimator()
    
    if param_grid:
        gs = GridSearchCV(estimator=clf,
                    param_grid=param_grid,
                    scoring='accuracy',
                    cv=cv)
        gs = gs.fit(X_train, y_train)
        clf = gs.best_estimator_
        print('best param: ' + str(gs.best_params_))
        print('grid search time: %.2f sec.' %(time.time()-t1))
    
    train_scores = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=cv)
    test_scores = cross_val_score(estimator=clf, X=X_test, y=y_test, cv=cv)
    
    print('train mean accuracy: %.2f %% (std=%.2f %%)' 
              %(np.mean(train_scores)*100, np.std(train_scores)*100))
    print('test mean accuracy: %.2f %% (std=%.2f %%)' 
              %(np.mean(test_scores)*100, np.std(test_scores)*100))
    
    print('\n total time: %.2f sec.' %(time.time()-t1))
    return clf

In [13]:
feature_columns = []
feature_columns += user_info
feature_columns += actions

X = df[feature_columns]
y = df[target]

In [14]:
X.shape

(20000, 340)

## Logistic Regression

In [15]:
fitLR(X, y,categorical_features=[0,1,2,3,4,5,6,7,8,9,10])

train mean accuracy: 78.65 % (std=0.47 %)
test mean accuracy: 77.48 % (std=0.70 %)

 total time: 14.62 sec.


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Logistic Regression with Normalization

In [16]:
fitLR(X, y,categorical_features=[0,1,2,3,4,5,6,7,8,9,10], normalize=True)

train mean accuracy: 78.54 % (std=0.21 %)
test mean accuracy: 76.60 % (std=1.26 %)

 total time: 40.68 sec.


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Logistic Regression with tf-idf Transformation

In [17]:
fitLR(X, y,categorical_features=[0,1,2,3,4,5,6,7,8,9,10], tfidf=True)

train mean accuracy: 78.60 % (std=0.42 %)
test mean accuracy: 77.65 % (std=0.68 %)

 total time: 9.56 sec.


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

- With user activity vectors, we get around 8% increase in accuracy. 

# 3. Advanced Algorithms

# Random Forest

In [18]:
param_grid = [{'n_estimators':[20,50], 'max_features':[50,100,150], 'max_depth':[5,10,15]}]
fitTree(X,y, estimator=RandomForestClassifier, param_grid=param_grid)

best param: {'n_estimators': 20, 'max_features': 150, 'max_depth': 10}
grid search time: 164.69 sec.
train mean accuracy: 78.91 % (std=0.65 %)
test mean accuracy: 77.98 % (std=0.74 %)

 feature importance:
 1. ageCat          25.08 %
 2. pending         12.98 %
 3. signup_method   7.38 %
 4. requested       7.34 %
 5. missing         6.04 %
 6. gender          3.68 %
 7. verify          3.26 %
 8. manage_listing  1.79 %
 9. create          1.21 %
10. show            1.16 %

 total time: 177.45 sec.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features=150, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

- We get around 9% increase in accuracy with a random forest. 
- We see that ageCat has the highest importance as we saw in the exploratory analysis. (missing age -> no booking)
- signup_method plays an important role while we didn't see it in the exploratory analysis. We need further analysis for this. 
- We see that activities like 'pending', 'requested' are an important feature for booking analysis. It could be a good starting point to analyze session data. 

# 4. Using All Data

- Fianlly, we test models using all 650K data for comparison.

In [19]:
feature_columns = []
feature_columns += user_info
feature_columns += actions

X = df_origin[feature_columns]
y = df_origin[target]

In [20]:
fitLR(X, y,categorical_features=[0,1,2,3,4,5,6,7,8,9,10])

train mean accuracy: 78.85 % (std=0.20 %)
test mean accuracy: 78.92 % (std=0.52 %)

 total time: 79.20 sec.


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [21]:
param_grid = [{'n_estimators':[50], 'max_features':[100], 'max_depth':[10]}]
fitTree(X,y, estimator=RandomForestClassifier, param_grid=param_grid)

best param: {'n_estimators': 50, 'max_features': 100, 'max_depth': 10}
grid search time: 59.15 sec.
train mean accuracy: 80.03 % (std=0.42 %)
test mean accuracy: 79.10 % (std=0.48 %)

 feature importance:
 1. ageCat          27.14 %
 2. pending         14.37 %
 3. missing         9.02 %
 4. requested       7.83 %
 5. signup_method   5.98 %
 6. gender          5.12 %
 7. verify          2.29 %
 8. create          1.73 %
 9. at_checkpoint   1.68 %
10. travel_plans_current 1.37 %

 total time: 140.53 sec.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features=100, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

- We now get around 10% increase in accurary from either model. 