This is a Hackathon project, AmExpert 2018 - Machine Learning (ML), hosted by Analytics Vidhya sponsored by American Express. The goal is to predict the probability of a click on an ad. The prediction is based on a specific Web session that belongs to a specific user along with his characteristics and parameters categorizing the product displayed.

### Problem Statement

Recent years have witnessed a surge in the number of internet savvy users. Companies in the financial services domain leverage this huge internet traffic arriving at their interface by strategically placing ads/promotions for cross selling of various financial products on a plethora of web pages. The digital analytics unit of Best Cards Company uses cutting edge data science and machine learning for successful promotion of its valuable card products. They believe that a predictive model that forecasts whether a session involves a click on the ad/promotion would help them extract the maximum out of the huge clickstream data that they have collected. The current job as a consultant is to build an efficient model to predict whether a user will click on an ad or not, given the following features:

* **Clickstream data/train data for duration: (2nd July 2017 – 7th July 2017)**<br>
* **Test data for duration: (8th July 2017 – 9th July 2017)**<br>
* **User features (demographics, user behaviour/activity, buying power etc.)**<br>
* **Historical transactional data of the previous month with timestamp info (28th May 2017– 1st July 2017) (User views/interest registered)**<br>
* **Ad features (product category, webpage, campaign for ad etc.)**<br>
* **Date time features (exact timestamp of the user session)**<br>

Variable | Definition
--- | ---
session_id	| Unique ID for a session
DateTime	| Timestamp
user_id	| Unique ID for user
product	| Product ID
campaign_id	| Unique ID for ad campaign
webpage_id	| Webpage ID at which the ad is displayed
product_category_1	| Product category 1 (Ordered)
product_category_2	| Product category 2
user_group_id	| Customer segmentation ID
gender	| Gender of the user
age_level	| Age level of the user
user_depth	| Interaction level of user with the web platform (1 - low, 2 - medium, 3 - High)
city_development_index	| Scaled development index of the residence city
var_1	| Anonymised session feature
is_click	| 0 - no click, 1 - click


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime

%matplotlib inline
sns.set(style='white', context='notebook', palette='deep')

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from xgboost.sklearn import XGBClassifier
import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

from catboost import CatBoostClassifier


Using TensorFlow backend.


In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
combine = pd.concat([train.drop('is_click',1),test])
log = pd.read_csv('historical_user_logs.csv')
click = train['is_click']

In [247]:
train.head()

Unnamed: 0,session_id,DateTime,user_id,product,campaign_id,webpage_id,product_category_1,product_category_2,user_group_id,gender,age_level,user_depth,city_development_index,var_1,is_click
0,140690,2017-07-02 00:00,858557,C,359520,13787,4,,10.0,Female,4.0,3.0,3.0,0,0
1,333291,2017-07-02 00:00,243253,C,105960,11085,5,,8.0,Female,2.0,2.0,,0,0
2,129781,2017-07-02 00:00,243253,C,359520,13787,4,,8.0,Female,2.0,2.0,,0,0
3,464848,2017-07-02 00:00,1097446,I,359520,13787,3,,3.0,Male,3.0,3.0,2.0,1,0
4,90569,2017-07-02 00:01,663656,C,405490,60305,3,,2.0,Male,2.0,3.0,2.0,1,0


In [248]:
train.describe()

Unnamed: 0,session_id,user_id,campaign_id,webpage_id,product_category_1,product_category_2,user_group_id,age_level,user_depth,city_development_index,var_1,is_click
count,463291.0,463291.0,463291.0,463291.0,463291.0,97437.0,445048.0,445048.0,445048.0,338162.0,463291.0,463291.0
mean,285544.090725,546049.7,308474.540069,29685.878994,3.072427,162753.345105,3.477396,2.782266,2.878415,2.557121,0.422169,0.067627
std,168577.345887,329462.5,126517.101294,21542.053106,1.304233,78743.74272,2.412889,1.069701,0.40013,0.921345,0.493906,0.251105
min,2.0,4.0,82320.0,1734.0,1.0,18595.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,137856.5,257855.0,118601.0,13787.0,2.0,82527.0,2.0,2.0,3.0,2.0,0.0,0.0
50%,285429.0,531801.0,359520.0,13787.0,3.0,146115.0,3.0,3.0,3.0,2.0,0.0,0.0
75%,435535.5,827849.0,405490.0,53587.0,4.0,254132.0,4.0,3.0,3.0,3.0,1.0,0.0
max,595812.0,1141729.0,414149.0,60305.0,5.0,450184.0,12.0,6.0,3.0,4.0,1.0,1.0


In [249]:
na_percent = train.isnull().sum(axis=0)/train.shape[0]
na_total = train.isnull().sum()
unique_values = train.apply(lambda x: x.nunique())

pd.concat([na_total,na_percent,unique_values], axis=1).rename(columns={0:'missing_total',1:'missing_percent',2:'unique_values'})

Unnamed: 0,missing_total,missing_percent,unique_values
session_id,0,0.0,463291
DateTime,0,0.0,8610
user_id,0,0.0,150347
product,0,0.0,10
campaign_id,0,0.0,10
webpage_id,0,0.0,9
product_category_1,0,0.0,5
product_category_2,365854,0.789685,29
user_group_id,18243,0.039377,13
gender,18243,0.039377,2


In [250]:
product_campaign=list(zip(train['product'],train['campaign_id']))
print(set(product_campaign))

{('C', 414149), ('B', 98970), ('H', 359520), ('D', 118601), ('D', 360936), ('D', 105960), ('F', 404347), ('A', 396664), ('F', 405490), ('H', 414149), ('D', 414149), ('A', 82320), ('I', 118601), ('D', 405490), ('H', 405490), ('F', 359520), ('E', 98970), ('G', 98970), ('C', 105960), ('C', 360936), ('D', 404347), ('G', 404347), ('D', 359520), ('D', 98970), ('J', 82320), ('I', 396664), ('J', 396664), ('B', 360936), ('B', 105960), ('C', 405490), ('A', 405490), ('C', 404347), ('H', 82320), ('C', 359520), ('I', 414149), ('B', 405490), ('D', 396664), ('H', 396664), ('B', 404347), ('I', 404347), ('E', 82320), ('B', 359520), ('C', 82320), ('G', 105960), ('F', 118601), ('F', 98970), ('G', 118601), ('A', 414149), ('C', 396664), ('I', 360936), ('I', 105960), ('B', 82320), ('G', 414149), ('E', 414149), ('E', 404347), ('I', 82320), ('B', 396664), ('A', 105960), ('C', 98970), ('A', 118601), ('H', 98970), ('A', 404347), ('H', 404347), ('F', 396664), ('B', 118601), ('I', 359520), ('F', 414149), ('H', 10

Observation1: one campaign can feature multiple product. one product can appear in different campaigns.

In [251]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 463291 entries, 0 to 463290
Data columns (total 15 columns):
session_id                463291 non-null int64
DateTime                  463291 non-null object
user_id                   463291 non-null int64
product                   463291 non-null object
campaign_id               463291 non-null int64
webpage_id                463291 non-null int64
product_category_1        463291 non-null int64
product_category_2        97437 non-null float64
user_group_id             445048 non-null float64
gender                    445048 non-null object
age_level                 445048 non-null float64
user_depth                445048 non-null float64
city_development_index    338162 non-null float64
var_1                     463291 non-null int64
is_click                  463291 non-null int64
dtypes: float64(5), int64(7), object(3)
memory usage: 53.0+ MB


categorical variables represented by numerical values:

user_id, product_category_1,campaign_id,webpage_id,user_group_id,var_1

In [252]:
train['DateTime']

0         2017-07-02 00:00
1         2017-07-02 00:00
2         2017-07-02 00:00
3         2017-07-02 00:00
4         2017-07-02 00:01
5         2017-07-02 00:01
6         2017-07-02 00:01
7         2017-07-02 00:01
8         2017-07-02 00:02
9         2017-07-02 00:02
10        2017-07-02 00:02
11        2017-07-02 00:02
12        2017-07-02 00:02
13        2017-07-02 00:02
14        2017-07-02 00:03
15        2017-07-02 00:03
16        2017-07-02 00:03
17        2017-07-02 00:03
18        2017-07-02 00:03
19        2017-07-02 00:03
20        2017-07-02 00:03
21        2017-07-02 00:03
22        2017-07-02 00:03
23        2017-07-02 00:03
24        2017-07-02 00:04
25        2017-07-02 00:04
26        2017-07-02 00:04
27        2017-07-02 00:04
28        2017-07-02 00:04
29        2017-07-02 00:04
                ...       
463261    2017-07-07 23:56
463262    2017-07-07 23:56
463263    2017-07-07 23:57
463264    2017-07-07 23:57
463265    2017-07-07 23:57
463266    2017-07-07 23:57
4

Observation2: The rows are chronologically ordered.

In [253]:
click = train[train['is_click']==1]
noclick = train[train['is_click']==0]
click_col = 'blue'
noclick_col = 'red'

print("Clicked: %i (%.1f percent), Not Clicked: %i (%.1f percent), Total: %i"\
      %(len(click), 1.*len(click)/len(train)*100.0,\
        len(noclick), 1.*len(noclick)/len(train)*100.0, len(train)))

Clicked: 31331 (6.8 percent), Not Clicked: 431960 (93.2 percent), Total: 463291


observation 3: the predicted class is umbalanced.

## Feature Engineering

Group the historical log data and extract features: user-product view count, user-product interest count, user view count, user interest count, user-time active days, user-time unique active days, product view count, product interest count.

In [4]:
user_product_view = log[log['action']=='view'].groupby(['user_id','product','action']).count().rename(columns={'DateTime':'up_view'}).reset_index()
user_product_view.drop(['action'],axis=1,inplace=True)
user_product_interest = log[log['action']=='interest'].groupby(['user_id','product','action']).count().rename(columns={'DateTime':'up_interest'}).reset_index()
user_product_interest.drop(['action'],axis=1,inplace=True)

In [5]:
user_view = log[log['action']=='view'].groupby(['user_id','action']).count().rename(columns={'DateTime':'u_view'}).reset_index()
user_view.drop(['action','product'],axis=1,inplace=True)
user_interest = log[log['action']=='interest'].groupby(['user_id','action']).count().rename(columns={'DateTime':'u_interest'}).reset_index()
user_interest.drop(['action','product'],axis=1,inplace=True)


In [6]:
product_view = log[log['action']=='view'].groupby(['product','action']).count().rename(columns={'DateTime':'p_view'}).reset_index()
product_view.drop(['user_id','action'],axis=1,inplace=True)
product_interest = log[log['action']=='interest'].groupby(['product','action']).count().rename(columns={'DateTime':'p_interest'}).reset_index()
product_interest.drop(['user_id','action'],axis=1,inplace=True)

In [7]:
log['DateTime'] = pd.to_datetime(log['DateTime'])
days_active = log.reset_index().groupby(['user_id'])['DateTime'].agg(lambda x: (x.max() - x.min()).days+1 if (x.max() - x.min()).days !=0 else 1)
unique_days_active = log.reset_index().groupby(['user_id'])['DateTime'].agg(lambda x: len(np.unique(x.dt.dayofyear)))
user_time_features = days_active.reset_index().merge(unique_days_active.reset_index(),on='user_id',how = 'left')
user_time_features.columns = ['user_id','days_active','unique_days_active']
user_time_features.head()

Unnamed: 0,user_id,days_active,unique_days_active
0,4,1,1
1,19,14,14
2,25,11,5
3,26,14,8
4,30,14,11


In [14]:
combine = pd.merge(combine, user_view,  how='left', on=['user_id'])

combine = combine.merge(user_interest,on=['user_id'],how='left')
combine = combine.merge(product_view, on=['product'],how='left')
combine = combine.merge(product_interest, on=['product'],how='left')
combine = combine.merge(user_product_view, on=['user_id','product'],how='left')
combine = combine.merge(user_product_interest,on=['user_id','product'],how='left')
combine = combine.merge(user_time_features, on=['user_id'],how='left')

In [15]:
combine['new_user_ind'] = combine.days_active.isna().astype(int)

missing_cols_float = ['up_view','up_interest','u_interest','u_view','days_active','unique_days_active']
for col in missing_cols_float:
    combine[col].fillna(0,inplace=True)

** cummulative sum of clicks across different categorical variables combination:** 

set test['is_click']=0

['user_id'],['user_id','weekday'],['user_id','campaign_id'],['user_id','campaign_id','product'],['user_id','campaign_id','product','webpage_id'],['user_id','product'],['user_id','webpage_id'],['campaign_id','day'],['campaign_id','day','hour'],['hour','day'],['hour','day','minute']

In [10]:
def create_features(df,grouped_variable_list):
    df.DateTime = pd.to_datetime(df.DateTime)
    df['weekday'] = df.DateTime.dt.weekday
    df['hour'] = df.DateTime.dt.hour
    if len(grouped_variable_list) > 0:
        for group in grouped_variable_list:
            for i in range(len(group)):
                if i == 0:
                    name = group[i]
                else:
                    name = name + '_' + group[i]
            df['cumcount_'+name] = df.groupby(group)['is_click'].cumcount()
    df['cum_clicks_user_id'] = df.groupby('user_id')['is_click'].cumsum() - df.is_click
    return df

In [16]:
combine.loc[:len(train),'is_click'] = click
combine.loc[len(train):,'is_click'] = 0

combine = create_features(combine,[['user_id'],['user_id','weekday'],['user_id','campaign_id'],
                                   ['user_id','campaign_id','product'],['user_id','campaign_id','product','webpage_id'],
                                   ['user_id','product'],['user_id','webpage_id'],['campaign_id','weekday'],
                                   ['campaign_id','weekday','hour'],['hour','weekday']])

In [21]:
combine.columns

Index(['session_id', 'DateTime', 'user_id', 'product', 'campaign_id',
       'webpage_id', 'product_category_1', 'product_category_2',
       'user_group_id', 'gender', 'age_level', 'user_depth',
       'city_development_index', 'var_1', 'u_view', 'u_interest', 'p_view',
       'p_interest', 'up_view', 'up_interest', 'days_active',
       'unique_days_active', 'new_user_ind', 'is_click', 'weekday', 'hour',
       'cumcount_user_id', 'cumcount_user_id_weekday',
       'cumcount_user_id_campaign_id', 'cumcount_user_id_campaign_id_product',
       'cumcount_user_id_campaign_id_product_webpage_id',
       'cumcount_user_id_product', 'cumcount_user_id_webpage_id',
       'cumcount_campaign_id_weekday', 'cumcount_campaign_id_weekday_hour',
       'cumcount_hour_weekday', 'cum_clicks_user_id'],
      dtype='object')

### Data preprocessing

* convert categorical variables.

* handle missing value

* drop product_category_2 because of large number of missing value

In [22]:
combine.drop(['product_category_2','session_id'], axis=1,inplace=True)
combine['user_group_id'].fillna(13.0,inplace=True)
combine['age_level'].fillna(7.0,inplace=True)
combine['user_depth'].fillna(4.0,inplace=True)
combine['city_development_index'].fillna(5.0,inplace=True)
combine['gender'].fillna('other',inplace=True)

cat_col_names = ['user_id','product','campaign_id','webpage_id','product_category_1',
                 'user_group_id','gender','age_level','user_depth','city_development_index','var_1',
                 'new_user_ind','weekday','hour']

for col in cat_col_names:
    combine[col] = combine[col].astype('category')

In [23]:
combine.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 592149 entries, 0 to 592148
Data columns (total 35 columns):
DateTime                                           592149 non-null datetime64[ns]
user_id                                            592149 non-null category
product                                            592149 non-null category
campaign_id                                        592149 non-null category
webpage_id                                         592149 non-null category
product_category_1                                 592149 non-null category
user_group_id                                      592149 non-null category
gender                                             592149 non-null category
age_level                                          592149 non-null category
user_depth                                         592149 non-null category
city_development_index                             592149 non-null category
var_1                                          

In [24]:
test = combine.iloc[len(train):]
train = combine.iloc[:len(train)]

Split traing set into train and validation set

In [38]:
X_train,X_val,y_train,y_val = train_test_split(train.drop(['DateTime','is_click'],axis=1),click,
                                                 test_size=0.25,random_state = 1994)

### Modeling: CatBoost
CatBoost has the flexibility of giving indices of categorical columns so that it can be encoded as one-hot encoding using one_hot_max_size (Use one-hot encoding for all features with number of different values less than or equal to the given parameter value).

before feature engineering on the click data, the auc of catboost model is 0.63XX.

Let's see the improvement after feature engineering


In [32]:
categorical_features_indices = np.where(X_train.dtypes =='category')[0]

In [36]:
cat_model = CatBoostClassifier(n_estimators=1000, # use large n_estimators deliberately to make use of the early stopping
#                          reg_lambda=1.0,
#                          l2_leaf_reg=4.0,
                         eval_metric='AUC',
                         random_seed=1994,
                         learning_rate = 0.05,
                         depth = 8,
                               
#                          boosting_type = 'Ordered',
#                          subsample = 0.8
                         #rsm = 0.7,
                         #silent=True,
                         #max_ctr_complexity = 5,  # no of categorical cols combined

#                          od_type = 'IncToDec',  #overfitting params
#                          od_wait = 20)
                         #bagging_temperature = 1.0)
                              )
# lr=0.05, no od type of vars -- highest


In [35]:
cat_model.fit(X_train.values,y_train.values,cat_features=categorical_features_indices,eval_set=(X_val, y_val),
        plot=False,early_stopping_rounds=100,use_best_model=True,metric_period=100) # early stopping set to 100 to prevent overfitting



0:	test: 0.6146178	best: 0.6146178 (0)	total: 1.93s	remaining: 32m 7s
100:	test: 0.7762897	best: 0.7763092 (98)	total: 3m 50s	remaining: 34m 9s
200:	test: 0.7790746	best: 0.7812953 (125)	total: 8m 18s	remaining: 33m
300:	test: 0.7877423	best: 0.7877680 (299)	total: 12m 26s	remaining: 28m 54s
400:	test: 0.7897826	best: 0.7897870 (398)	total: 16m 42s	remaining: 24m 57s
500:	test: 0.7905833	best: 0.7905981 (498)	total: 20m 45s	remaining: 20m 40s
600:	test: 0.7953792	best: 0.7957575 (578)	total: 24m 49s	remaining: 16m 28s
700:	test: 0.8013848	best: 0.8013939 (699)	total: 27m 59s	remaining: 11m 56s
800:	test: 0.8076197	best: 0.8076404 (798)	total: 30m 51s	remaining: 7m 40s
900:	test: 0.8123267	best: 0.8123601 (883)	total: 33m 38s	remaining: 3m 41s
999:	test: 0.8152397	best: 0.8152625 (997)	total: 36m 53s	remaining: 0us

bestTest = 0.8152624928
bestIteration = 997

Shrink model to first 998 iterations.


<catboost.core.CatBoostClassifier at 0x1ae81cd29b0>

### LightGBM model 
LightGBM handle categorical features automatically by assuming that all features of type category are treated with categoical treatment.

In [39]:
dtrain = lgb.Dataset(X_train, y_train)
dval = lgb.Dataset(X_val, y_val)
params = {
    
    'num_leaves' : 256,
    'learning_rate':0.003,
    'metric':'auc',
    'objective':'binary',
    'early_stopping_round': 40,
    'max_depth':10,
    'bagging_fraction':0.5,
    'feature_fraction':0.6,
    'bagging_seed':2017,
    'feature_fraction_seed':2017,
    'verbose' : 1
    
    
}

In [40]:
clf = lgb.train(params, dtrain,num_boost_round=800,valid_sets=dval,verbose_eval=20)

Training until validation scores don't improve for 40 rounds.
[20]	valid_0's auc: 0.633238
[40]	valid_0's auc: 0.634146
[60]	valid_0's auc: 0.63355
[80]	valid_0's auc: 0.63418
[100]	valid_0's auc: 0.634111
[120]	valid_0's auc: 0.634384
[140]	valid_0's auc: 0.634502
[160]	valid_0's auc: 0.634778
[180]	valid_0's auc: 0.634893
[200]	valid_0's auc: 0.635089
[220]	valid_0's auc: 0.635291
[240]	valid_0's auc: 0.63565
[260]	valid_0's auc: 0.635849
[280]	valid_0's auc: 0.636044
[300]	valid_0's auc: 0.636271
[320]	valid_0's auc: 0.636431
[340]	valid_0's auc: 0.636571
[360]	valid_0's auc: 0.636752
[380]	valid_0's auc: 0.636992
[400]	valid_0's auc: 0.637171
[420]	valid_0's auc: 0.637368
[440]	valid_0's auc: 0.637537
[460]	valid_0's auc: 0.637674
[480]	valid_0's auc: 0.637752
[500]	valid_0's auc: 0.637845
[520]	valid_0's auc: 0.637991
[540]	valid_0's auc: 0.638116
[560]	valid_0's auc: 0.638201
[580]	valid_0's auc: 0.638299
[600]	valid_0's auc: 0.638421
[620]	valid_0's auc: 0.638511
[640]	valid_0's

In [33]:
preds =  clf.predict(test[cols_to_use])

In [34]:
sub = pd.DataFrame({'session_id':test['session_id'], 'is_click':preds})
sub.to_csv('lgb.csv', index=False)