# Feature Engineering Ad Clicks Exercise

The dataset we will be using for this exercise is a sample from 'TalkingData AdTracking' Kaggle competition. All positive samples (where is_attributed == 1) was kept, while 99% of negative samples we're discarded. This sample has roughly 20% positive examples.
The data has the following features: (note id, app, device, os, and channel are encoded).

The overall goal is to predict whether a user will download an app after clicking a mobile ad.

## 1. Baseline Modeling
Using what we learned in the overview, let's setup our baseline model.


In [ ]:
import pandas as pd
from my_modules import data_imports as data
from my_modules import feature_engineering as fe
from termcolor import colored

click_data = data.import_ad_clicks_data()
click_data.head(10)

### Basic Feature Engineering
Before jumping into the deepend of feature engineering, we need a base model to build off. So let's do some simple engineering:
- Dealing with Timestamps
- Label Encoding

#### Timestamps

In [ ]:
# Timestamps
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

print(clicks.head())


#### Label Encoding
For the baseline model, let's just use skikit-learn's ```LabelEncoder``` to create new features in the clicks **DataFrame**. The new columns should be the original name with *_labels* appended.


In [ ]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']
encoder = LabelEncoder()

for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

clicks.head()


### Train/Test/Validate Splits
There is one thing we need to be careful with regarding our data.

#### Time Series
Our data is a *time series*. Date and time matter in regards to train and test sets. Since our model needs to predict events in the future, we must also make sure we validate the model on events in the future. If the data is mixed up between training and test sets, then future data will leak in to the model and our validationr esults will overestimate the performance on new data.

Let's first sort in order of increasing time. The first 80% of rows will become the train set, the next 10% will be validation, and last 10% will be our test set.

In [ ]:
base_feature_cols = ['day', 'hour', 'minute', 'second',
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']
# Split implementation moved to my_modules
base_train, base_valid, base_test = fe.get_ad_splits(clicks)

### LightGBM Training
Now let's construct LightGBM dataset objects for each of the smaller datasets.

In [ ]:
bst, valid_score = fe.train_ad_model(base_train, base_valid, feature_cols=base_feature_cols)

## 2. Feature Engineering (intermediate encoding)
Now let's get started by implementing some more complicated encoding and testing them using our ```train_ad_model``` function. We'll be implementing the following encodings:
- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD

In [ ]:
# click_data is the clean imported data
# clicks is timestamp and label encoded

# Let's do a check on unencoded data
print(colored("Unencoded Model", 'yellow'))
train, valid, test = fe.get_ad_splits(click_data)
_ = fe.train_ad_model(train, valid)
print(colored("Baseline Model (Timestamp + Simple Label Encoding)", 'yellow'))
_ = fe.train_ad_model(base_train, base_valid, feature_cols=base_feature_cols)

#### Quick note on encoding and leakages
This model were working with has calculated statistics and counts. The current columns are:
- ip
- app
- device
- os
- channel
- click_time
- attributed_time
- is_attributed (target)

In regards to data leakages, we need to be careful to make sure that we calculate the encodings from ONLY the training set, to avoid overestimating the model's performance.

### Count encodings
First off, let's count encodde the features ```['ip', 'app', 'device', 'os', 'channel'] ``` using ```CountEncoder```. 

In [ ]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = fe.get_ad_splits(clicks)