# Feature Engineering Ad Clicks Exercise

The dataset we will be using for this exercise is a sample from 'TalkingData AdTracking' Kaggle competition. All positive samples (where is_attributed == 1) was kept, while 99% of negative samples we're discarded. This sample has roughly 20% positive examples.
The data has the following features: (note id, app, device, os, and channel are encoded).

The overall goal is to predict whether a user will download an app after clicking a mobile ad.

[1. Baseline Modeling](#baseline)  
[2. Categorical Encoding](#encode)  
[3. Feature Generation](#generate)  
[4. Feature Selection](#select)


#### Note to self:
- **click_data** is the raw imported data
- **clicks** is timestamp and label encoded

<a id='baseline'></a>
## 1. Baseline Modeling
Using what we learned in the overview, let's setup our baseline model.


In [1]:
%reload_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from my_modules import data_imports as data
from my_modules import feature_engineering as fe
from termcolor import colored, cprint

click_data = data.import_ad_clicks_data()
click_data.head(10)

score_dict = {}

[32mAd Click data imported[0m


### Basic Feature Engineering
Before jumping into the deepend of feature engineering, we need a base model to build off. So let's do some simple engineering:
- Dealing with Timestamps
- Label Encoding
#### Timestamps

In [2]:
# Timestamps
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

print(clicks.head())


ip  app  device  os  channel          click_time      attributed_time  \
0   89489    3       1  13      379 2017-11-06 15:13:23                  NaN   
1  204158   35       1  13       21 2017-11-06 15:41:07  2017-11-07 08:17:19   
2    3437    6       1  13      459 2017-11-06 15:42:32                  NaN   
3  167543    3       1  13      379 2017-11-06 15:56:17                  NaN   
4  147509    3       1  13      379 2017-11-06 15:57:01                  NaN   

   is_attributed  day  hour  minute  second  
0              0    6    15      13      23  
1              1    6    15      41       7  
2              0    6    15      42      32  
3              0    6    15      56      17  
4              0    6    15      57       1  


#### Label Encoding
For the baseline model, let's just use skikit-learn's ```LabelEncoder``` to create new features in the clicks **DataFrame**. The new columns should be the original name with *_labels* appended.


In [3]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']
encoder = LabelEncoder()

for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

clicks.head()


Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_labels,app_labels,device_labels,os_labels,channel_labels
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23,27226,3,1,13,120
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,110007,35,1,13,10
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32,1047,6,1,13,157
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17,76270,3,1,13,120
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1,57862,3,1,13,120


### Train/Test/Validate Splits
There is one thing we need to be careful with regarding our data.

#### Time Series
Our data is a *time series*. Date and time matter in regards to train and test sets. Since our model needs to predict events in the future, we must also make sure we validate the model on events in the future. If the data is mixed up between training and test sets, then future data will leak in to the model and our validationr esults will overestimate the performance on new data.

Let's first sort in order of increasing time. The first 80% of rows will become the train set, the next 10% will be validation, and last 10% will be our test set.

In [4]:
base_feature_cols = ['day', 'hour', 'minute', 'second',
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']
# Split implementation moved to my_modules
base_train, base_valid, base_test = fe.get_ad_splits(clicks)

### LightGBM Training
Now let's construct LightGBM dataset objects for each of the smaller datasets.

In [5]:
# bst, valid_score = fe.train_ad_model(base_train, base_valid, feature_cols=base_feature_cols)

<a id='encode'></a>
## <font color=blue>2. Categorical Encoding</font>
Now let's get started by implementing some more complicated encoding and testing them using our ```train_ad_model``` function. We'll be implementing the following encodings:
- Count Encoding
- Target Encoding
- CatBoost Encoding

In [6]:
# click_data is the clean imported data
# clicks is timestamp and label encoded

# Let's do a check on unencoded data
print(colored("Unencoded Model", 'yellow'))
train, valid, test = fe.get_ad_splits(click_data)
unencoded_sc = fe.train_ad_model(train, valid)
print(colored("Baseline Model (Timestamp + Simple Label Encoding)", 'yellow'))
baseline_sc = fe.train_ad_model(base_train, base_valid, feature_cols=base_feature_cols)

[33mUnencoded Model[0m
[36mTraining model...[0m
Validation AUC score: 0.9618
[33mBaseline Model (Timestamp + Simple Label Encoding)[0m
[36mTraining model...[0m
Validation AUC score: 0.9623


In [7]:
score_dict["Unencoded"] = unencoded_sc[1]
score_dict["Baseline"] = baseline_sc[1]

#### Quick note on encoding and leakages
This model were working with has calculated statistics and counts. The current columns are:
- ip
- app
- device
- os
- channel
- click_time
- attributed_time
- is_attributed (target)

In regards to data leakages, we need to be careful to make sure that we calculate the encodings from ONLY the training set, to avoid overestimating the model's performance.

### Count encodings
- Count encoding is based on *replacing categories with their counts* computed on the train set.

First off, let's count encode the features ```['ip', 'app', 'device', 'os', 'channel'] ``` using ```CountEncoder```. 

Because we need to avoid data leakage as pointed out above, we need to be sure to first **fit** *then* **transform** our data. This way, our encoding is only fit to the training data and not the valid/test sets.

In [8]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

# count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# learn the encoding from the training set
count_enc.fit(train[cat_features])

# apply the fit encoding to the training and valid sets
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

count_sc = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9650


In [9]:
score_dict["Count"] = count_sc[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
[36mBest score so far: Count : 0.9650[0m


#### Sidenote: Why is count encoding effective?
One reason count encoding works is because rare values tend to have similar counts (with values like 1 or 2), so you can easily classify rare values together at prediction time. Common values with large counts are unlikely to have the same count as other values. So,m the common/important values get their own groupings with count encoding.

### Target Encoding
- Target encoding is the *process of replacing a categorical value with the mean of the target variable* based on the train set (be carful of data leakage).

We'll be using the same ```['ip', 'app', 'device', 'os', 'channel'] ``` for target encoding the train set.


In [10]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

# target_encoder
target_enc = ce.TargetEncoder(cols=cat_features)

target_enc.fit(train[cat_features], train['is_attributed'])

# apply the encoding
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

target_sc = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9541


In [11]:
score_dict["Target w/ ip"]= target_sc[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
[36mBest score so far: Count : 0.9650[0m


Well our target score isn't as good as we had hoped. As pointed out by our great friend Kaggle, upon dropping ```ip``` from our categorical encoded features, our model actually improves. Why is that?

Target encoding attemps to measure the population mean of the target for each level in the categorical feature. This means when there is less data per level, the estimated mean will be further away from the 'true' mean. There is very little data per IP address, meaning less data per level, so it's more likely that the estimates are much noisier than they should be. The model will rely heavily on this feature because of it's extremely predictive. This, in a way, overtrains our model and will perform poorly when introduced to new IP addresses. Going forward, let's leave IP out of it.

In [12]:
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

# target_encoder
target_enc = ce.TargetEncoder(cols=cat_features)

target_enc.fit(train[cat_features], train['is_attributed'])

# apply the encoding
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

target_sc_no_ip = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9627


In [13]:
score_dict["Target w/o ip"] = target_sc_no_ip[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
[36mBest score so far: Count : 0.9650[0m


### CatBoost
CatBoost Encoder is very similar to leave-one-out encoding, but calculates the values 'on-the-fly'. It is said to work fairly well with LightGBM.

#### Sidenote: ```leave-one-out``` encoding
```leave-one-out``` encoding is very similar to target encoding but excludes the current row's target when calculating the mean target for a level.


In [14]:
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)

cb_enc = ce.CatBoostEncoder(cols=cat_features)

cb_enc.fit(train[cat_features], train['is_attributed'])

train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))

cb_sc = fe.train_ad_model(train_encoded, valid_encoded)

[36mTraining model...[0m
Validation AUC score: 0.9627


In [15]:
score_dict["CatBoosting"] = cb_sc[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
CatBoosting: 0.9627
[36mBest score so far: Count : 0.9650[0m


### Results
Out of all of the encodings we tried, 'Target w/o ip' and 'CatBoosting' performed the best. 
<a id='generate'></a>
## 3. Generating Features
Now let's go over some feature generation with our data (clicks).

### Adding interaction features
Let's add interaction features (combining categorical features) for each pair of categorical features (app, device, os, channel). The easiest way to accomplish this is by using ```itertools.combinations```. For each new column, we must join the values as strings with an underscore, so 13 and 47 will become ```"13_47"```. We also need to make sure to labelencode the new features.


In [16]:
from itertools import combinations

cat_features = ['ip', 'app', 'device', 'os', 'channel']
interactions = pd.DataFrame(index=clicks.index)
label_enc = LabelEncoder()
for comb in combinations(cat_features, 2):
    new_feat = comb[0] + "_" + comb[1]
    interactions[new_feat]= label_enc.fit_transform(
        clicks[comb[0]].astype(str) + "_" + clicks[comb[1]].astype(str)
    )
cprint('Label Encoded Interactions', 'cyan')
print(interactions.head(10))

clicks = clicks.join(interactions)
cprint('Clicks w/ interactions', 'cyan')
print(clicks.head(10))

[36mLabel Encoded Interactions[0m
   ip_app  ip_device   ip_os  ip_channel  app_device  app_os  app_channel  \
0  838251     327558  844429     1204595        3543    3973          621   
1  324479     110989  324393      473773        3486    3715          561   
2  590903     264762  590544      795240        4180    5063          777   
3  219558      67781  221287      333763        3543    3973          621   
4  163918      44449  166639      260146        3543    3973          621   
5  765699     314168  769653     1077852        1187    1306          199   
6  787895     318291  792428     1116302        1097    1032          154   
7  277551      91661  278376      411425        3415    3374          507   
8   68168      12767   70679      118063        3085    2110          318   
9  674039     296638  674993      919684        3543    3985          610   

   device_os  device_channel  os_channel  
0        795            1534        1123  
1        795            1465  

In [17]:
train, valid, test = fe.get_ad_splits(clicks)
inter_score = fe.train_ad_model(train, valid)


[36mTraining model...[0m
Validation AUC score: 0.9624


In [18]:
score_dict["Interactions"] = inter_score[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
CatBoosting: 0.9627
Interactions: 0.9624
[36mBest score so far: Count : 0.9650[0m


### Adding numerical features
Adding interactions for categorical columns is quick and easy way to create more categorical data. It's also a good idea to add numeric features which will typically improve our model. Numerical features take a little bit of experimenting and brainstorming, but luckily our Kaggle mentors have given us ideas. 

#### Number of events in the past six hours
The first feature we'll create is the number of events from the same ip in the list six hours. It's likely that someone who is visiting often will be more likely to download the app.

Let's implement a function ```count_past_events``` that takes a series of click times (timestamps) and returns another Series with the number of events in the last hour.

In [19]:
def count_past_events(series):
    """
    return number of events in the last hour
        series: Series of click times (timestamps)
    """
    # Assign time as the index so we can sort and roll
    conv = pd.Series(series.index, index=series).sort_index()
    return conv.rolling('6h').count() - 1

past_events = count_past_events(clicks.click_time)
clicks['ip_past_6hr_counts'] = past_events.values
train, valid, _ = fe.get_ad_splits(clicks)
num_feat_score = fe.train_ad_model(train, valid)

[36mTraining model...[0m
Validation AUC score: 0.9619


In [20]:
score_dict["Num Feature"] = num_feat_score[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
CatBoosting: 0.9627
Interactions: 0.9624
Num Feature: 0.9619
[36mBest score so far: Count : 0.9650[0m


#### Time since last events
Next we'll implement a ```time_diff``` method that calculates the time since the last event in seconds from a Series of timestamps.

In [21]:
def time_diff(series):
    """Returns a series with the time since the last timestamp in seconds"""
    return series.diff().dt.total_seconds()

timedeltas = clicks.groupby('ip')['click_time'].transform(time_diff)
timedeltas

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
            ...  
2300556    2820.0
2300557      22.0
2300558      78.0
2300559    1774.0
2300560      23.0
Name: click_time, Length: 2300561, dtype: float64

In [22]:
# lets fill in NaNs with median and reindex
timedeltas = timedeltas.fillna(timedeltas.median()).reindex(clicks.index)
cprint('Fixed timedeltas', 'cyan')
print(timedeltas.head(10))
clicks['time_since_last'] = timedeltas

[36mFixed timedeltas[0m
0    1309.0
1    1309.0
2    1309.0
3    1309.0
4    1309.0
5    1309.0
6    1309.0
7    1309.0
8    1309.0
9    1309.0
Name: click_time, dtype: float64


In [23]:
train, valid, _ = fe.get_ad_splits(clicks)
timedelta_score = fe.train_ad_model(train, valid)

[36mTraining model...[0m
Validation AUC score: 0.9636


In [24]:
score_dict['timedeltas'] = timedelta_score[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
CatBoosting: 0.9627
Interactions: 0.9624
Num Feature: 0.9619
timedeltas: 0.9636
[36mBest score so far: Count : 0.9650[0m


#### Number of previous app downloads
It's likely that if a visitor downlaoded an app previously, it'll affect the likelihood they'll download one again. Let's implement ```previous_attributions``` that returns a Series with the number of trimes an app has been downloaded (where ```is_attributed == 1```) before the current event.

In [25]:
# Hint: Here we want a window that always starts at the first row but expands as we get further in the data. 
# We can use the .expanding methods for this. Also, the current row is included in the window, so we need to subtract that off as well

def previous_attributions(series):
    """ returns a series with the number of previous number of downloads """
    return series.expanding(2).sum() - series

past = previous_attributions(clicks.is_attributed)
clicks['prev_clicks'] = past
train, valid, _ = fe.get_ad_splits(clicks)
prev_clicks_score = fe.train_ad_model(train, valid)

[36mTraining model...[0m
Validation AUC score: 0.9629


In [26]:
score_dict['prev_clicks'] = timedelta_score[1]
fe.print_scores(score_dict)

[32mScores so far...[0m
Unencoded: 0.9618
Baseline: 0.9623
Count: 0.9650
Target w/ ip: 0.9541
Target w/o ip: 0.9627
CatBoosting: 0.9627
Interactions: 0.9624
Num Feature: 0.9619
timedeltas: 0.9636
prev_clicks: 0.9636
[36mBest score so far: Count : 0.9650[0m


<a id='select'></a>
## 4. Feature Selection
In this last part of our exercise, we'll use feature selection algorithms to improve our model. Let's get setup for this section.


In [27]:
# inter_clicks, num_feat_clicks, time_clicks, prev_clicks
# n = clicks.merge(inter_clicks, how='outer', suffixes=('', ''))
train, valid, _ = fe.get_ad_splits(clicks)
cprint('Feature Selection Baseline Score', 'cyan')
_ = fe.train_ad_model(train, valid)
cprint(f'Number of columns: {len(clicks.columns)}', 'green')


[36mFeature Selection Baseline Score[0m
[36mTraining model...[0m
Validation AUC score: 0.9629
[32mNumber of columns: 30[0m


Moving forward, we'll want to make sure to only use the training set for selecting which features to use. Currently, our model has __ features which may cause overfitting. Removing some features will help counteract this overfitting, but may decrease the performance slightly. But at least we'll be making the model smaller and faster without losing too much performance.

### Univariate Feature Selection
Let's start by reviewing/using ```SelectKBest``` with the ```f_classif``` scoring function.

In [28]:
from sklearn.feature_selection import SelectKBest, f_classif

# remove target-related columns and split
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = fe.get_ad_splits(clicks)

# create the selector, let's keep 20 features
selector = SelectKBest(f_classif, k=20)

# there was a slight error with previos data, first value of prev_clicks was NaN, filling in with 0
train['prev_clicks'][0] = 0

# use selector to get the best features
X_new = selector.fit_transform(train[feature_cols], train['is_attributed'])
# X_new

# get back the kept features as a DataFrame with the dropped columns as all 0s
selected_features = pd.DataFrame(selector.inverse_transform(X_new),
                                 index=train.index,
                                 columns=feature_cols)
# selected_features

# Find the columns that were kept/dropped
dropped_columns = selected_features.columns[selected_features.var() == 0]
selected_features = selected_features.columns[selected_features.var() != 0]

In [29]:
cprint(f'Dropped Columns ({len(dropped_columns)})', 'blue')
print(dropped_columns)
cprint(f'Kept Columns ({len(selected_features)})', 'green')
print(selected_features)

[34mDropped Columns (7)[0m
Index(['device', 'os', 'day', 'minute', 'second', 'os_channel',
       'time_since_last'],
      dtype='object')
[32mKept Columns (20)[0m
Index(['ip', 'app', 'channel', 'hour', 'ip_labels', 'app_labels',
       'device_labels', 'os_labels', 'channel_labels', 'ip_app', 'ip_device',
       'ip_os', 'ip_channel', 'app_device', 'app_os', 'app_channel',
       'device_os', 'device_channel', 'ip_past_6hr_counts', 'prev_clicks'],
      dtype='object')


In [30]:
cprint('Feature Selection KBest Score', 'cyan')
_ = fe.train_ad_model(train.drop(dropped_columns, axis=1), valid.drop(dropped_columns, axis=1))
cprint('Feature Selection KBest Score (Using dropped columns)', 'cyan')
_ = fe.train_ad_model(train.drop(selected_features, axis=1), valid.drop(selected_features, axis=1)) 

[36mFeature Selection KBest Score[0m
[36mTraining model...[0m
Validation AUC score: 0.9612
[36mFeature Selection KBest Score (Using dropped columns)[0m
[36mTraining model...[0m
Validation AUC score: 0.9169


With this method, we can find the bst K features, but we still have to choose K ourselves. So how do we know what is the "best" value of K? 

The solution is basically brute force. We would want to train multiple models with increasing values of K, and find which one performed the best.

### L1 Regularization 
Now let's try a more powerful approach using L1 Regularization. Let's implement a function for this (```select_features_l1```) that returns a list of features to keep. This idea can be used in the real world when training a model.

Let's use ```LogisticRegression``` classifier model with an L1 penalty to select the features. For the mode, let's set the random state to 7 and the regularization param to 0.1. We can fit the model then use ```SelectFromModel``` to return a model with the selected features.

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

def select_features_l1(X, y):
    """ Return selected features using logistic regression with an L1 penalty """
    # Regularization param to 0.1, l1 penalty and random_state=7
    log = LogisticRegression(C=0.1, penalty="l1", random_state=7, verbose=1).fit(X, y)
    
    # Select from the model and transform
    model = SelectFromModel(log, prefit=True)
    X_new = model.transform(X)
    # X_new

    # Get the selected features via inverse transform
    selected_features = pd.DataFrame(model.inverse_transform(X_new),
                                     index=X.index,
                                     columns=X.columns) 

    # Dropped columns have all 0s, keep the others
    selected_columns = selected_features.columns[selected_features.var() != 0]

    return selected_columns

train, valid, _ = fe.get_ad_splits(clicks)
train['prev_clicks'][0] = 0
X, y = train[train.columns.drop(['click_time', 'attributed_time', 'is_attributed'])], train['is_attributed']
selected_features = select_features_l1(X, y)
selected_features

[LibLinear]

In [0]:
dropped_cols = selected_features.columns.drop(selected_features.var() == 0)
dropped_cols