# Feature Engineering Ad Clicks Exercise

The dataset we will be using for this exercise is a sample from 'TalkingData AdTracking' Kaggle competition. All positive samples (where is_attributed == 1) was kept, while 99% of negative samples we're discarded. This sample has roughly 20% positive examples.
The data has the following features: (note id, app, device, os, and channel are encoded).

The overall goal is to predict whether a user will download an app after clicking a mobile ad.

[1. Baseline Modeling](#baseline)  
[2. Categorical Encoding](#encode)  
[3. Feature Generation](#generate)


#### Note to self:
- **click_data** is the raw imported data
- **clicks** is timestamp and label encoded

<a id='baseline'></a>
## 1. Baseline Modeling
Using what we learned in the overview, let's setup our baseline model.


In [1]:
%reload_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from my_modules import data_imports as data
from my_modules import feature_engineering as fe
from termcolor import colored, cprint

click_data = data.import_ad_clicks_data()
click_data.head(10)

score_dict = {}

[32mAd Click data imported[0m


### Basic Feature Engineering
Before jumping into the deepend of feature engineering, we need a base model to build off. So let's do some simple engineering:
- Dealing with Timestamps
- Label Encoding
#### Timestamps

In [2]:
# Timestamps
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

print(clicks.head())


ip  app  device  os  channel          click_time      attributed_time  \
0   89489    3       1  13      379 2017-11-06 15:13:23                  NaN   
1  204158   35       1  13       21 2017-11-06 15:41:07  2017-11-07 08:17:19   
2    3437    6       1  13      459 2017-11-06 15:42:32                  NaN   
3  167543    3       1  13      379 2017-11-06 15:56:17                  NaN   
4  147509    3       1  13      379 2017-11-06 15:57:01                  NaN   

   is_attributed  day  hour  minute  second  
0              0    6    15      13      23  
1              1    6    15      41       7  
2              0    6    15      42      32  
3              0    6    15      56      17  
4              0    6    15      57       1  


#### Label Encoding
For the baseline model, let's just use skikit-learn's ```LabelEncoder``` to create new features in the clicks **DataFrame**. The new columns should be the original name with *_labels* appended.


In [3]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']
encoder = LabelEncoder()

for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

clicks.head()


Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_labels,app_labels,device_labels,os_labels,channel_labels
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23,27226,3,1,13,120
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,110007,35,1,13,10
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32,1047,6,1,13,157
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17,76270,3,1,13,120
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1,57862,3,1,13,120


### Train/Test/Validate Splits
There is one thing we need to be careful with regarding our data.

#### Time Series
Our data is a *time series*. Date and time matter in regards to train and test sets. Since our model needs to predict events in the future, we must also make sure we validate the model on events in the future. If the data is mixed up between training and test sets, then future data will leak in to the model and our validationr esults will overestimate the performance on new data.

Let's first sort in order of increasing time. The first 80% of rows will become the train set, the next 10% will be validation, and last 10% will be our test set.

In [4]:
base_feature_cols = ['day', 'hour', 'minute', 'second',
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']
# Split implementation moved to my_modules
base_train, base_valid, base_test = fe.get_ad_splits(clicks)

### LightGBM Training
Now let's construct LightGBM dataset objects for each of the smaller datasets.

In [5]:
# bst, valid_score = fe.train_ad_model(base_train, base_valid, feature_cols=base_feature_cols)

<a id='encode'></a>
## <font color=blue>2. Categorical Encoding</font>
Now let's get started by implementing some more complicated encoding and testing them using our ```train_ad_model``` function. We'll be implementing the following encodings:
- Count Encoding
- Target Encoding
- CatBoost Encoding

In [6]:
# click_data is the clean imported data
# clicks is timestamp and label encoded

# Let's do a check on unencoded data
print(colored("Unencoded Model", 'yellow'))
train, valid, test = fe.get_ad_splits(click_data)
unencoded_sc = fe.train_ad_model(train, valid)
print(colored("Baseline Model (Timestamp + Simple Label Encoding)", 'yellow'))
baseline_sc = fe.train_ad_model(base_train, base_valid, feature_cols=base_feature_cols)

[33mUnencoded Model[0m
[36mTraining model...[0m
Validation AUC score: 0.9618
[33mBaseline Model (Timestamp + Simple Label Encoding)[0m
[36mTraining model...[0m
Validation AUC score: 0.9623


In [7]:
score_dict["Unencoded"] = unencoded_sc[1]
score_dict["Baseline"] = baseline_sc[1]

#### Quick note on encoding and leakages
This model were working with has calculated statistics and counts. The current columns are:
- ip
- app
- device
- os
- channel
- click_time
- attributed_time
- is_attributed (target)

In regards to data leakages, we need to be careful to make sure that we calculate the encodings from ONLY the training set, to avoid overestimating the model's performance.

### Count encodings
- Count encoding is based on *replacing categories with their counts* computed on the train set.

First off, let's count encode the features ```['ip', 'app', 'device', 'os', 'channel'] ``` using ```CountEncoder```. 

Because we need to avoid data leakage as pointed out above, we need to be sure to first **fit** *then* **transform** our data. This way, our encoding is only fit to the training data and not the valid/test sets.

In [8]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

# count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# learn the encoding from the training set
count_enc.fit(train[cat_features])

# apply the fit encoding to the training and valid sets
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

count_sc = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9650


In [9]:
score_dict["Count"] = count_sc[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650


#### Sidenote: Why is count encoding effective?
One reason count encoding works is because rare values tend to have similar counts (with values like 1 or 2), so you can easily classify rare values together at prediction time. Common values with large counts are unlikely to have the same count as other values. So,m the common/important values get their own groupings with count encoding.

### Target Encoding
- Target encoding is the *process of replacing a categorical value with the mean of the target variable* based on the train set (be carful of data leakage).

We'll be using the same ```['ip', 'app', 'device', 'os', 'channel'] ``` for target encoding the train set.


In [10]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

# target_encoder
target_enc = ce.TargetEncoder(cols=cat_features)

target_enc.fit(train[cat_features], train['is_attributed'])

# apply the encoding
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

target_sc = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9541


In [11]:
score_dict["Target w/ ip"]= target_sc[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541


Well our target score isn't as good as we had hoped. As pointed out by our great friend Kaggle, upon dropping ```ip``` from our categorical encoded features, our model actually improves. Why is that?

Target encoding attemps to measure the population mean of the target for each level in the categorical feature. This means when there is less data per level, the estimated mean will be further away from the 'true' mean. There is very little data per IP address, meaning less data per level, so it's more likely that the estimates are much noisier than they should be. The model will rely heavily on this feature because of it's extremely predictive. This, in a way, overtrains our model and will perform poorly when introduced to new IP addresses. Going forward, let's leave IP out of it.

In [12]:
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

# target_encoder
target_enc = ce.TargetEncoder(cols=cat_features)

target_enc.fit(train[cat_features], train['is_attributed'])

# apply the encoding
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

target_sc_no_ip = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9627


In [13]:
score_dict["Target w/o ip"] = target_sc_no_ip[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627


### CatBoost
CatBoost Encoder is very similar to leave-one-out encoding, but calculates the values 'on-the-fly'. It is said to work fairly well with LightGBM.

#### Sidenote: ```leave-one-out``` encoding
```leave-one-out``` encoding is very similar to target encoding but excludes the current row's target when calculating the mean target for a level.


In [14]:
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

cb_enc = ce.CatBoostEncoder(cols=cat_features)

cb_enc.fit(train[cat_features], train['is_attributed'])

train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))

cb_sc = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9627


In [15]:
score_dict["CatBoosting"] = cb_sc[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
CatBoosting: 0.9627


### Results
Out of all of the encodings we tried, 'Target w/o ip' and 'CatBoosting' performed the best. 
<a id='generate'></a>
## 3. Generating Features
Now let's go over some feature generation with our data (clicks).

#### Adding interaction features
Let's add interaction features (combining categorical features) for each pair of categorical features (app, device, os, channel). The easiest way to accomplish this is by using ```itertools.combinations```. For each new column, we must join the values as strings with an underscore, so 13 and 47 will become ```"13_47"```. We also need to make sure to labelencode the new features.


In [16]:
from itertools import combinations

cat_features = ['ip', 'app', 'device', 'os', 'channel']
inter_clicks = clicks
interactions = pd.DataFrame(index=inter_clicks.index)
label_enc = LabelEncoder()
for comb in combinations(cat_features, 2):
    new_feat = comb[0] + "_" + comb[1]
    interactions[new_feat]= label_enc.fit_transform(
        inter_clicks[comb[0]].astype(str) + "_" + inter_clicks[comb[1]].astype(str)
    )
cprint('Label Encoded Interactions', 'cyan')
print(interactions.head(10))

inter_clicks = inter_clicks.join(interactions)
cprint('Clicks w/ interactions')
print(inter_clicks.head(10))

[36mLabel Encoded Interactions[0m
   app_device  app_os  app_channel  device_os  device_channel  os_channel
0        3543    3973          621        795            1534        1123
1        3486    3715          561        795            1465        1059
2        4180    5063          777        795            1570        1154
3        3543    3973          621        795            1534        1123
4        3543    3973          621        795            1534        1123
5        1187    1306          199        795            1450        1045
6        1097    1032          154        795            1534        1123
7        3415    3374          507        857            1578        3237
8        3085    2110          318        933            1449        8761
9        3543    3985          610        811            1463        1532
Clicks w/ interactions[0m
       ip  app  device  os  channel          click_time      attributed_time  \
0   89489    3       1  13      379 2017-11

In [17]:
train, valid, test = fe.get_ad_splits(inter_clicks)
inter_score = fe.train_ad_model(train, valid)


[36mTraining model...[0m
Validation AUC score: 0.9625


In [18]:
score_dict["Interactions"] = inter_score[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
CatBoosting: 0.9627
Interactions: 0.9625
