<a href="https://colab.research.google.com/github/vijaykriishna/talkingdata_adtracking/blob/master/baseline_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

You will use a small sample of the data, dropping 99% of negative records (where the app wasn't downloaded) to make the target more balanced.

After building a baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

## Setup

Begin by running the code cell below.

## Baseline Model

The first thing you'll do is construct a baseline model. We'll begin by looking at the data.

In [12]:
import pandas as pd

click_data = pd.read_csv('/content/input/train_sample.csv',
                         parse_dates=['click_time'])
click_data.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,89489,3,1,13,379,2017-11-06 15:13:23,,0
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1
2,3437,6,1,13,459,2017-11-06 15:42:32,,0
3,167543,3,1,13,379,2017-11-06 15:56:17,,0
4,147509,3,1,13,379,2017-11-06 15:57:01,,0


### 1) Construct features from timestamps

Notice that the `click_data` DataFrame has a `'click_time'` column with timestamp data.

Use this column to create features for the coresponding day, hour, minute and second. 

Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [13]:
# Add new columns for timestamp features day, hour, minute, and second
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

### 2) Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [14]:
from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
encoder = preprocessing.LabelEncoder()
for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

Run the next code cell to view your new DataFrame.

In [15]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_labels,app_labels,device_labels,os_labels,channel_labels
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23,19608,3,1,13,112
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,93932,34,1,13,10
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32,759,6,1,13,146
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17,62708,3,1,13,112
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1,45648,3,1,13,112


## Train, validation, and test sets
With our baseline features ready, we need to split our data into training and validation sets. We should also hold out a test set to measure the final accuracy of the model.

### Create train/validation/test splits

Here we'll create training, validation, and test splits. First, `clicks` DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

In [16]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

### Train with LightGBM

Now we can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [17]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

[1]	valid_0's auc: 0.957185
Training until validation scores don't improve for 10 rounds.
[2]	valid_0's auc: 0.957406
[3]	valid_0's auc: 0.957654
[4]	valid_0's auc: 0.961531
[5]	valid_0's auc: 0.962276
[6]	valid_0's auc: 0.963218
[7]	valid_0's auc: 0.963622
[8]	valid_0's auc: 0.964125
[9]	valid_0's auc: 0.964551
[10]	valid_0's auc: 0.964523
[11]	valid_0's auc: 0.964676
[12]	valid_0's auc: 0.964746
[13]	valid_0's auc: 0.96536
[14]	valid_0's auc: 0.96534
[15]	valid_0's auc: 0.965495
[16]	valid_0's auc: 0.965761
[17]	valid_0's auc: 0.966125
[18]	valid_0's auc: 0.966389
[19]	valid_0's auc: 0.966697
[20]	valid_0's auc: 0.966788
[21]	valid_0's auc: 0.967056
[22]	valid_0's auc: 0.967468
[23]	valid_0's auc: 0.967599
[24]	valid_0's auc: 0.967768
[25]	valid_0's auc: 0.967887
[26]	valid_0's auc: 0.968028
[27]	valid_0's auc: 0.968135
[28]	valid_0's auc: 0.968268
[29]	valid_0's auc: 0.968473
[30]	valid_0's auc: 0.968637
[31]	valid_0's auc: 0.968777
[32]	valid_0's auc: 0.968898
[33]	valid_0's auc: 0

## Evaluate the model
Finally, with the model trained, we evaluate its performance on the test set. 

In [18]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

Test score: 0.9781856503287598


This will be our baseline score for the model. When we transform features, add new ones, or perform feature selection, we should be improving on this score. However, since this is the test set, we only want to look at it at the end of all our manipulations.