# Feature Engineering Ad Clicks Exercise

The dataset we will be using for this exercise is a sample from 'TalkingData AdTracking' Kaggle competition. All positive samples (where is_attributed == 1) was kept, while 99% of negative samples we're discarded. This sample has roughly 20% positive examples.
The data has the following features: (note id, app, device, os, and channel are encoded).

The overall goal is to predict whether a user will download an app after clicking a mobile ad.

## 1. Baseline Modeling
Using what we learned in the overview, let's setup our baseline model.


In [1]:
import pandas as pd
from my_modules import data_imports as data
from my_modules import feature_engineering as fe

click_data = data.import_ad_clicks_data()
click_data.head(10)

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
[32mAd Click data imported[0m


Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,89489,3,1,13,379,2017-11-06 15:13:23,,0
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1
2,3437,6,1,13,459,2017-11-06 15:42:32,,0
3,167543,3,1,13,379,2017-11-06 15:56:17,,0
4,147509,3,1,13,379,2017-11-06 15:57:01,,0
5,71421,15,1,13,153,2017-11-06 16:00:00,,0
6,76953,14,1,13,379,2017-11-06 16:00:01,,0
7,187909,2,1,25,477,2017-11-06 16:00:01,,0
8,116779,1,1,8,150,2017-11-06 16:00:01,,0
9,47857,3,1,15,205,2017-11-06 16:00:01,,0


### Basic Feature Engineering
Before jumping into the deepend of feature engineering, we need a base model to build off. So let's do some simple engineering:
- Dealing with Timestamps
- Label Encoding

#### Timestamps

In [2]:
# Timestamps
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

print(clicks.head())


ip  app  device  os  channel          click_time      attributed_time  \
0   89489    3       1  13      379 2017-11-06 15:13:23                  NaN   
1  204158   35       1  13       21 2017-11-06 15:41:07  2017-11-07 08:17:19   
2    3437    6       1  13      459 2017-11-06 15:42:32                  NaN   
3  167543    3       1  13      379 2017-11-06 15:56:17                  NaN   
4  147509    3       1  13      379 2017-11-06 15:57:01                  NaN   

   is_attributed  day  hour  minute  second  
0              0    6    15      13      23  
1              1    6    15      41       7  
2              0    6    15      42      32  
3              0    6    15      56      17  
4              0    6    15      57       1  


#### Label Encoding
For the baseline model, let's just use skikit-learn's ```LabelEncoder``` to create new features in the clicks **DataFrame**. The new columns should be the original name with *_labels* appended.


In [3]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']
encoder = LabelEncoder()

for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

clicks.head()


Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_labels,app_labels,device_labels,os_labels,channel_labels
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23,27226,3,1,13,120
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,110007,35,1,13,10
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32,1047,6,1,13,157
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17,76270,3,1,13,120
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1,57862,3,1,13,120


### Train/Test/Validate Splits
There is one thing we need to be careful with regarding our data.

#### Time Series
Our data is a *time series*. Date and time matter in regards to train and test sets. Since our model needs to predict events in the future, we must also make sure we validate the model on events in the future. If the data is mixed up between training and test sets, then future data will leak in to the model and our validationr esults will overestimate the performance on new data.

Let's first sort in order of increasing time. The first 80% of rows will become the train set, the next 10% will be validation, and last 10% will be our test set.

In [4]:
feature_cols = ['day', 'hour', 'minute', 'second',
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']
# Split implementation moved to my_modules
train, valid, test = fe.get_ad_splits(clicks)

### LightGBM Training
Now let's construct LightGBM dataset objects for each of the smaller datasets.

In [5]:
bst, valid_score = fe.train_ad_model(train, valid, feature_cols=feature_cols)

[36mTraining model...[0m
Validation AUC score: 0.9623


## 2. Feature Engineering (intermediate encoding)
Now let's get started by implementing some more complicated encoding and testing them using our ```train_ad_model``` function. We'll be implementing the following encodings:
- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD