### Initial model building on subset of avazu data

This notebook is for building a model to predict whether an ad will get clicked on , given features around ad placement, when/how it's seen etc. Using the subset (500000 samples) to explore which model(s) do best before running on the large dataset.

Process:
- use the data subset
- one hot encode features
- test out several sklearn models 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore') #stop those annoying deprecation warnings
%matplotlib inline

In [2]:
from sklearn import metrics
from sklearn.model_selection import cross_val_score, train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler, QuantileTransformer, LabelBinarizer
from sklearn.linear_model import LogisticRegression, SGDClassifier, Perceptron, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
from scipy import stats
import xgboost as xgb

In [3]:
np.random.seed(10)

In [4]:
data = pd.read_pickle('train_subset_df.pkl')
X = data.drop(['click', 'hour', 'day', 'month'], axis=1)
X = pd.get_dummies(columns = X.columns, data=X)
y = data['click']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)

In [11]:
X_train.columns

Index(['C1_1001', 'C1_1002', 'C1_1005', 'C1_1007', 'C1_1008', 'C1_1010',
       'C1_1012', 'banner_pos_0', 'banner_pos_1', 'banner_pos_2',
       'banner_pos_4', 'banner_pos_5', 'banner_pos_7', 'device_type_0',
       'device_type_1', 'device_type_4', 'device_type_5', 'device_conn_type_0',
       'device_conn_type_2', 'device_conn_type_3', 'device_conn_type_5',
       'C15_120', 'C15_216', 'C15_300', 'C15_320', 'C15_480', 'C15_728',
       'C15_768', 'C15_1024', 'C16_20', 'C16_36', 'C16_50', 'C16_90',
       'C16_250', 'C16_320', 'C16_480', 'C16_768', 'C16_1024', 'C18_0',
       'C18_1', 'C18_2', 'C18_3', 'hour_of_day_0', 'hour_of_day_1',
       'hour_of_day_2', 'hour_of_day_3'],
      dtype='object')

#### modelling algorithms:
Try out one each from few different families:

- xgboost classifier (ensembles)
- support vector machine (SVM)
- SGDClassifier (linear model)
- multinomial NB (gaussian)
- KNN classifier (neighbours)

In [5]:
lrC = LogisticRegression()
knnC = KNeighborsClassifier()
svcC = SVC(kernel='rbf', C=1e3, gamma=0.5)
xgbC = xgb.XGBClassifier(**{
   "learning_rate": 0.1,
    "n_estimators": 1000,
    "max_depth": 5, 
    "min_child_weight": 1,
    "random_state": 10
})
nbC = MultinomialNB()

In [6]:
# benchmarking function, which also prints out the confusion matrix + classification report. Also calculates the
# accuracy and the roc_Score
def benchmark(clf, X_train, y_train, X_val, y_val):
    clf.fit(X_train, y_train)
    prediction = clf.predict(X_val)
    accuracy = metrics.accuracy_score(y_val, prediction)
    logloss = metrics.log_loss(y_val, prediction)
    clf_description = str(clf).split('(')[0]
    print(clf_description)
    print("logloss: {}".format(logloss))
    print(metrics.confusion_matrix(y_val, prediction))
    
    return clf, clf_description, accuracy, logloss

In [7]:
models = [nbC, lrC, svcC, xgbC]
for model in models:
    benchmark(model, X_train, y_train, X_test, y_test)

MultinomialNB
logloss: 5.952777230037412
[[81146  2391]
 [14844  1619]]
LogisticRegression
logloss: 5.686118757894146
[[83537     0]
 [16463     0]]
SVC
logloss: 5.687500436885531
[[83521    16]
 [16451    12]]
XGBClassifier
logloss: 5.686464241609786
[[83525    12]
 [16452    11]]


  if diff:


unfortunately, it seems like none of the classifiers do particularly well. This could either be that we are ignoring features that are important, (especially since we are using a small subset of the avaiable features) or the hyperparmaters are not well tuned at all, or none of these models work well. 

Some approaches to try:
- hyperparam tuning on the whole dataset (with the feature subset)
- retry models with all the features (potentially using sklearn's feature selection functions/featurtools)
- one of the models that have seen a lot of success in CTR prediction is factorisation machines, should probably try this.
- use entity embeddings as features 