1. dataset can be found https://www.kaggle.com/c/avazu-ctr-prediction/data
2. download via the kaggle cli: `kaggle competitions download -c avazu-ctr-prediction`
3. as the training data is quite big (> 6GB, with 40428968 lines), subset it first for initial exploration

### Initial exploration

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.display import display
pd.set_option('display.max_columns', None)
sns.set()
%matplotlib inline

In [2]:
train_df = pd.read_csv('data/train_subset.csv')

In [3]:
train_df.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
1,1.000017e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
2,1.000037e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
3,1.000064e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
4,1.000068e+19,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161,0,35,-1,157


Check that there isn't any missing values in the dataset

In [11]:
pd.isnull(train_df).sum(axis=0)

id                  0
click               0
hour                0
C1                  0
banner_pos          0
site_id             0
site_domain         0
site_category       0
app_id              0
app_domain          0
app_category        0
device_id           0
device_ip           0
device_model        0
device_type         0
device_conn_type    0
C14                 0
C15                 0
C16                 0
C17                 0
C18                 0
C19                 0
C20                 0
C21                 0
dtype: int64

Check the unique values in each feature -- this will give an indication of which features you might want to use initially for training (e.g. the id column has as many unique values as the lenght of the dataset so is not useful in building a model)

In [14]:
train_df.nunique()

id                  499999
click                    2
hour                     4
C1                       7
banner_pos               6
site_id               1704
site_domain           1586
site_category           21
app_id                1641
app_domain             122
app_category            20
device_id            41413
device_ip           171304
device_model          3967
device_type              4
device_conn_type         4
C14                    540
C15                      8
C16                      9
C17                    154
C18                      4
C19                     40
C20                    154
C21                     34
dtype: int64

Average CTR for the subset:

In [14]:
train_df.click.mean()

0.16407232814465628

##### Feature engineering

- remove the id column (it's all unique...)
- extract the actual hour/day/month from the 'hour' column (the data is only over 11 consecutive days, so the main influencer is probably the hour. Putting in the extra day, month cols now since they might be of use in the full dataset. )
- use labelbinarizer to convert the unique classes into integers
- use the features that have < 10 uniques to test out a few initial classifiers

In [24]:
data = train_df[['hour', 'C1', 'banner_pos', 'device_type', 'device_conn_type', 'C15', 'C16', 'C18', 'click']]
data['hour_of_day'] = data['hour'].map(lambda x: int(str(x)[-2:]))
data['day'] = data['hour'].map(lambda x: int(str(x)[4:6]))
data['month'] = data['hour'].map(lambda x: int(str(x)[2:4]))
y = data['click']
x = data[['hour_of_day', 'day', 'month', 'C1', 'banner_pos', 'device_type', 'device_conn_type', 'C15', 'C16', 'C18']]
x.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,hour_of_day,day,month,C1,banner_pos,device_type,device_conn_type,C15,C16,C18
0,0,21,10,1005,0,1,2,320,50,0
1,0,21,10,1005,0,1,0,320,50,0
2,0,21,10,1005,0,1,0,320,50,0
3,0,21,10,1005,0,1,0,320,50,0
4,0,21,10,1005,1,1,0,320,50,0


##### Are the features meaningful?
If the features we have selected contributes to the CTR prediction, then it is likely that each different category will correspond to a different CTR. 



In [26]:
# calcutes CTR per feature-- most of the features does seem to have an influence on CTR
for col in data.columns:
    print(data.groupby([col])['click'].mean())

hour
14102100    0.174714
14102101    0.173695
14102102    0.150696
14102103    0.169235
Name: click, dtype: float64
C1
1001    0.071429
1002    0.217071
1005    0.164956
1007    0.029354
1008    0.107807
1010    0.070757
1012    0.041190
Name: click, dtype: float64
banner_pos
0    0.155870
1    0.195157
2    0.121622
4    0.176471
5    0.107807
7    0.066667
Name: click, dtype: float64
device_type
0    0.217071
1    0.164505
4    0.070583
5    0.073107
Name: click, dtype: float64
device_conn_type
0    0.169392
2    0.124203
3    0.081895
5    0.036281
Name: click, dtype: float64
C15
120     0.000000
216     0.171764
300     0.434305
320     0.153875
480     0.500000
728     0.065196
768     0.000000
1024    1.000000
Name: click, dtype: float64
C16
20      0.000000
36      0.171764
50      0.154066
90      0.065196
250     0.463291
320     0.500000
480     0.091667
768     1.000000
1024    0.000000
Name: click, dtype: float64
C18
0    0.166491
1    0.063748
2    0.363755
3    0.109160
