**This notebook is an exercise in the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course.  You can reference the tutorial at [this link](https://www.kaggle.com/matleonard/baseline-model).**

---


# Introduction

In the exercise, you will work with data from the TalkingData AdTracking competition.  The goal of the competition is to predict if a user will download an app after clicking through an ad. 

<center><a href="https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection"><img src="https://i.imgur.com/srKxEkD.png" width=600px></a></center>

For this course you will use a small sample of the data, dropping 99% of negative records (where the app wasn't downloaded) to make the target more balanced.

After building a baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

## Setup

Begin by running the code cell below to set up the exercise.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering.ex1 import *

## Baseline Model

The first thing you'll do is construct a baseline model. We'll begin by looking at the data.

Data fields  
Each row of the training data contains a click record, with the following features.  

- ip: ip address of click.
- app: app id for marketing.
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- os: os version id of user mobile phone
- channel: channel id of mobile ad publisher
- click_time: timestamp of click (UTC)
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
- is_attributed: the target that is to be predicted, indicating the app was downloaded  

Note that ip, app, device, os, and channel are encoded.

The test data is similar, with the following differences:
- click_id: reference for making predictions
- is_attributed: not included

In [None]:
import pandas as pd

click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv',
                         parse_dates=['click_time'])

In [None]:
print(click_data.shape)
click_data.head()

### Competition submission

In [None]:
competition_test_data = pd.read_csv('../input/talkingdata-adtracking-fraud-detection/test.csv',
                         parse_dates=['click_time'])

In [None]:
print(competition_test_data.shape)
competition_test_data.head()

### 1) Construct features from timestamps

Notice that the `click_data` DataFrame has a `'click_time'` column with timestamp data.

Use this column to create features for the coresponding day, hour, minute and second. 

Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [None]:
# Add new columns for timestamp features day, hour, minute, and second
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

# Check your answer
q_1.check()

In [None]:
clicks.head()

In [None]:
# Uncomment these if you need guidance
#q_1.hint()
#q_1.solution()

### Competition submission

In [None]:
# Add new columns for timestamp features day, hour, minute, and second
competition_test_data = competition_test_data.copy()
competition_test_data['day'] = competition_test_data['click_time'].dt.day.astype('uint8')
# Fill in the rest
competition_test_data['hour'] = competition_test_data['click_time'].dt.hour.astype('uint8')
competition_test_data['minute'] = competition_test_data['click_time'].dt.minute.astype('uint8')
competition_test_data['second'] = competition_test_data['click_time'].dt.second.astype('uint8')

In [None]:
competition_test_data.head()

### 2) Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [None]:
type(clicks['app'])

In [None]:
type(clicks['app'].values.reshape(-1, 1))

In [None]:
x = clicks['app'].values.reshape(1, -1)#.tolist()
x

### Question ??? 

class sklearn.preprocessing.LabelEncoder[source]
Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode **target values**, i.e. y, and not the input X.

In [None]:
from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']

#encoder = preprocessing.LabelEncoder() - Incorrect, we need a label encoder for each feature
# Create new columns in clicks using preprocessing.LabelEncoder()

for feature in cat_features:
    encoder = preprocessing.LabelEncoder()
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature+'_labels'] = encoded
    
    #encoded[feature+'_labels'] = clicks[feature].apply(encoder.fit_transform) - Incorrect 
    #ValueError: y should be a 1d array, got an array of shape () instead.
    
    #Competition submission
    #competition_enencoded = encoder.transform(competition_test_data[feature]) 
    #ValueError: y contains previously unseen labels: [0, 2, 3, 4, 5,
    #competition_test_data[feature+'_labels'] = competition_enencoded


# Check your answer
q_2.check()

In [None]:
clicks.head()

### How different is preprocessing.LabelEncoder() from the original dataset encoding ?

In [None]:
print('Original dataset size', clicks.shape)
for feature in cat_features:
    print(feature, '{:.2%} different'.format(sum(clicks[feature] != clicks[feature+'_labels'])/clicks.shape[0]))

In [None]:
# Uncomment these if you need guidance
#q_2.hint()
#q_2.solution()

Run the next code cell to view your new DataFrame.

In [None]:
clicks.head()

### 3) One-hot Encoding

In the code cell above, you used label encoded features.  Would it have also made sense to instead use one-hot encoding for the categorical variables `'ip'`, `'app'`, `'device'`, `'os'`, or `'channel'`?

**Note**: If you're not familiar with one-hot encoding, please check out **[this lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** from the Intermediate Machine Learning course.

Run the following line after you've decided your answer.

In [None]:
print('Original dataset size', clicks.shape)
for feature in cat_features:
    print(feature, len(clicks[feature].unique()), len(clicks[feature+'_labels'].unique()))

In [None]:
# Check your answer (Run this code cell to receive credit!)
q_3.solution()

### Typo ??? The ip column has 58,000 values

## Train, validation, and test sets
With our baseline features ready, we need to split our data into training and validation sets. We should also hold out a test set to measure the final accuracy of the model.

### 4) Train/test splits with time series data
This is time series data. Are there any special considerations when creating train/test splits for time series? If so, what are they?

Uncomment the following line after you've decided your answer.

In [None]:
# Check your answer (Run this code cell to receive credit!)
q_4.solution()

### Create train/validation/test splits

Here we'll create training, validation, and test splits. First, `clicks` DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

In [None]:
clicks.head()

In [None]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

In [None]:
print(clicks.shape,'\n',train.shape,'\n',valid.shape,'\n',test.shape)

### Train with LightGBM

Now we can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [None]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

In [None]:
type(bst)

### ??? TypeError: booster must be dict or LGBMModel

- booster (dict or LGBMModel) – Dictionary returned from lightgbm.train() or LGBMModel instance.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_metric.html

In [None]:
#lgb.plot_metric(bst, metric=metrics.roc_auc_score, dataset_names=[dtrain, dvalid, dtest]) #??? TypeError: booster must be dict or LGBMModel
#, ax=None, xlim=None, ylim=None, title='Metric during training', xlabel='Iterations', ylabel='auto', figsize=None, dpi=None, grid=True)[source]

- evals_result (dict or None, optional (default=None)) –

This dictionary used to store all evaluation results of all the items in valid_sets.

https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

- validation_metrics inspired by:
https://github.com/Microsoft/LightGBM/blob/2e93cdab9eee02d4d7f5cb3b6b31128dec94e25e/examples/python-guide/plot_example.py

In [None]:
#Record eval results for plotting
validation_metrics = {}  

bst = lgb.train(param, 
                dtrain, 
                num_round, 
                valid_sets=[dvalid], 
                early_stopping_rounds=10,
                evals_result=validation_metrics,
                verbose_eval=10)

In [None]:
validation_metrics

### Plot validation AUC during training

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16,9]

ax = lgb.plot_metric(validation_metrics, metric='auc');
#plt.show();

## ML Explainability and taking a closer look at feature importance, individual trees
Inspired by: https://github.com/Microsoft/LightGBM/blob/2e93cdab9eee02d4d7f5cb3b6b31128dec94e25e/examples/python-guide/plot_example.py


In [None]:
print('Plot feature importances...')
ax = lgb.plot_importance(bst, max_num_features=15)
plt.show()

In [None]:
print('Plot 84th tree...')  # one tree use categorical feature to split
ax = lgb.plot_tree(bst, tree_index=83, figsize=(64, 36), show_info=['split_gain'])
plt.show()

In [None]:
print('Plot 84th tree with graphviz...')
graph = lgb.create_tree_digraph(bst, tree_index=83, name='Tree84')
graph.render(view=True)

### Download 'Tree84.gv.pdf' from the working directory --->

## Evaluate the model
Finally, with the model trained, we evaluate its performance on the test set. 

In [None]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

This will be our baseline score for the model. When we transform features, add new ones, or perform feature selection, we should be improving on this score. However, since this is the test set, we only want to look at it at the end of all our manipulations. At the very end of this course you'll look at the test score again to see if you improved on the baseline model.

# Keep Going
Now that you have a baseline model, you are ready to **[use categorical encoding techniques](https://www.kaggle.com/matleonard/categorical-encodings)** to improve it.

# Submit test predictions to TalkingData AdTracking Fraud Detection Challenge competition using the train_sample 2.3M records from this notebook

# Note that the official competition has its own train and test datasets which should be used instead of the train_sample in this notebook!

# Issue with encoding values so we need to use a new notebook utilizing the competion - much larger - dataset
https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/

## Read official competition data

In [None]:
test_data = pd.read_csv('../input/talkingdata-adtracking-fraud-detection/test.csv',
                         parse_dates=['click_time'])

In [None]:
test_data.shape

In [None]:
test_data.head()

## Apply the same Feauture Engineering we applied to the training and validation data

# See notebook:
# TalkingData AdTracking Competition- Baseline Model
# https://www.kaggle.com/georgezoto/talkingdata-adtracking-competition-baseline-model


---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161443) to chat with other Learners.*