# Feature Engineering Ad Clicks Exercise

The dataset we will be using for this exercise is a sample from 'TalkingData AdTracking' Kaggle competition. All positive samples (where is_attributed == 1) was kept, while 99% of negative samples we're discarded. This sample has roughly 20% positive examples.
The data has the following features: (note id, app, device, os, and channel are encoded).

The overall goal is to predict whether a user will download an app after clicking a mobile ad.

## 1. Baseline Modeling
Using what we learned in the overview, let's setup our baseline model.


In [ ]:
import pandas as pd
from my_modules import data_imports as data

click_data = data.import_ad_clicks_data()
click_data.head(10)

### Basic Feature Engineering
Before jumping into the deepend of feature engineering, we need a base model to build off. So let's do some simple engineering:
- Dealing with Timestamps
- Label Encoding

#### Timestamps

In [ ]:
# Timestamps
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

print(clicks.head())


#### Label Encoding
For the baseline model, let's just use skikit-learn's ```LabelEncoder``` to create new features in the clicks **DataFrame**. The new columns should be the original name with *_labels* appended.


In [ ]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']
encoder = LabelEncoder()

for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

clicks.head()


### Train/Test/Validate Splits
There is one thing we need to be careful with regarding our data.

#### Time Series
Our data is a *time series*. Date and time matter in regards to train and test sets. Since our model needs to predict events in the future, we must also make sure we validate the model on events in the future. If the data is mixed up between training and test sets, then future data will leak in to the model and our validationr esults will overestimate the performance on new data.

Let's first sort in order of increasing time. The first 80% of rows will become the train set, the next 10% will be validation, and last 10% will be our test set.

In [ ]:
feature_cols = ['day', 'hour', 'minute', 'second',
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']
valid_fraction = 0.1
clicks_sorted = clicks.sort_values('click_time')
valid_rows = int(len(clicks_sorted) * valid_fraction)
train = clicks_sorted[ : -valid_rows * 2]
valid = clicks_sorted[-valid_rows * 2 : -valid_rows]
test = clicks_sorted[-valid_rows : ]

### LightGBM Training
Now let's construct LightGBM dataset objects for each of the smaller datasets.

In [ ]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

params = {'num_leaves' : 64, 'objective' : 'binary'}
params['metric'] = 'auc'
num_rounds = 1000
bst = lgb.train(params, dtrain, num_rounds, valid_sets=[dvalid], early_stopping_rounds=10)

### Evaluatoin
And finally let's evaluate the boss model and see how it has done.

In [ ]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test Score: {score}")

## 2.
