# Introduction

In this exercise you'll apply more advanced encodings to encode the categorical variables ito improve your classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

You'll refit the classifier after each encoding to check its performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
# This can take a few seconds, thanks for your patience
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering.ex2 import *

clicks = pd.read_parquet('../input/feature-engineering-data/baseline_data.pqt')

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


Here I'll define a couple functions to help test the new encodings.

In [2]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

Run this cell to get a baseline score. If your encodings do better than this, you can keep them.

In [3]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

Baseline model
Training model!
Validation AUC score: 0.9622743228943659


### 1) Categorical encodings and leakage

These encodings are all based on statistics calculated from the dataset like counts and means. Considering this, what data should you be using to calculate the encodings?

Run the following line after you've decided your answer.

In [4]:
# Check your answer (Run this code cell to receive credit!)
q_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> You should calculate the encodings from the training set only. If you include data from the validation and test sets into the encodings, you'll overestimate the model's performance. You should in general be vigilant to avoid leakage, that is, including any information from the validation and test sets into the model. For a review on this topic, see our lesson on [data leakage](https://www.kaggle.com/alexisbcook/data-leakage)

### 2) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. Using `CountEncoder` from the `category_encoders` library, fit the encoding using the categorical feature columns defined in `cat_features`. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [5]:
train.head(2)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7


In [6]:
valid.head(2)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
1840441,12059,8,1,17,41,2017-11-09 04:50:14,,0,9,4,50,14
1840440,10357,9,1,40,145,2017-11-09 04:50:14,,0,9,4,50,14


In [7]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

# Check your answer
q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Validating how the countEncoder works.
If we groupby the app and count for app type '3' we can see 292,254. The same number that is in the feature app_count

In [8]:
train_encoded.head(2)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_count,app_count,device_count,os_count,channel_count
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,68,292254,1648091,370652,26760
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,4,60114,1648091,370652,41256


In [9]:
train.groupby(['app'])['click_time'].transform('count')

0          292254
1           60114
2           19564
3          292254
4          292254
            ...  
1840445    129507
1840449    140369
1840446       845
1840443    104701
1840442     33072
Name: click_time, Length: 1840449, dtype: int64

In [10]:
# Uncomment if you need some guidance
q_2.hint()
q_2.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> CountEncoder works like scikit-learn classes with a `.fit` method to calculate counts and a `.transform` method to apply the encoding. You can join two dataframes with the same index using `.join` and add suffixes to columns names with `.add_suffix`

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

    # Create the count encoder
    count_enc = CountEncoder(cols=cat_features)

    # Learn encoding from the training set
    count_enc.fit(train[cat_features])

    # Apply encoding to the train and validation sets
    train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
    valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))
    
```

In [11]:
# Train the model on the encoded datasets
# This can take around 30 seconds to complete
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9653051135205329


Count encoding improved our model's score!

### 3) Why is count encoding effective?
At first glance, it could be surprising that Count Encoding helps make accurate models. 
Why do you think is count encoding is a good idea, or how does it improve the model score?

Run the following line after you've decided your answer.

In [12]:
# Check your answer (Run this code cell to receive credit!)
q_3.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
    Rare values tend to have similar counts (with values like 1 or 2), so you can classify rare 
    values together at prediction time. Common values with large counts are unlikely to have 
    the same exact count as other values. So, the common/important values get their own 
    grouping.
    

### 4) Target encoding

Here you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. Create the target encoder from the `category_encoders` library. Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [32]:
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

In [33]:
train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1


In [34]:
# Create the target encoder. You can find this easily by using tab completion.
# Start typing ce. the press Tab to bring up a list of classes and functions.
target_enc = ce.TargetEncoder(cols=cat_features)

# Learn encoding from the training set. Use the 'is_attributed' column as the target.
target_enc.fit(train[cat_features], train['is_attributed'])

TargetEncoder(cols=['app', 'device', 'os', 'channel'], drop_invariant=False,
              handle_missing='value', handle_unknown='value',
              min_samples_leaf=1, return_df=True, smoothing=1.0, verbose=0)

In [35]:
# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))
               
               
# Check your answer
q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#cc3333">Incorrect:</span> Expected `train_encoded` to have column `ip_target`

In [36]:
# Uncomment these if you need some guidance
q_4.hint()
q_4.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> TargetEncoder works like scikit-learn classes with a `.fit` method to learn the encoding and a `.transform` method to apply the encoding. Also note that you'll need to tell it which columns are categorical variables.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

    # Have to tell it which features are categorical when they aren't strings
    target_enc = ce.TargetEncoder(cols=cat_features)

    # Learn encoding from the training set
    target_enc.fit(train[cat_features], train['is_attributed'])

    # Apply encoding to the train and validation sets
    train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
    valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))
    
```

In [37]:
train_encoded.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,app_target,device_target,os_target,channel_target
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,0.028328,0.152087,0.138712,0.034043
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,0.995841,0.152087,0.138712,0.950262
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32,0.009252,0.152087,0.138712,0.019378
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17,0.028328,0.152087,0.138712,0.034043
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1,0.028328,0.152087,0.138712,0.034043


In [38]:
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9627457957514338


### 5) Try removing IP encoding

Try leaving `ip` out of the encoded features and retrain the model with target encoding again. You should find that the score increases and is above the baseline score! Why do you think the score is below baseline when we encode the IP address but above baseline when we don't?

Run the following line after you've decided your answer.

In [39]:
# Check your answer (Run this code cell to receive credit!)
q_5.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
    Target encoding attempts to measure the population mean of the target for each 
    level in a categorical feature. This means when there is less data per level, the 
    estimated mean will be further away from the "true" mean, there will be more variance. 
    There is little data per IP address so it's likely that the estimates are much noisier
    than for the other features. The model will rely heavily on this feature since it is 
    extremely predictive. This causes it to make fewer splits on other features, and those
    features are fit on just the errors left over accounting for IP address. So, the 
    model will perform very poorly when seeing new IP addresses that weren't in the 
    training data (which is likely most new data). Going forward, we'll leave out the IP feature when trying
    different encodings.
    

### 6) CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [47]:
# remove IP from the encoded features
cat_features = ['app', 'device', 'os', 'channel']

train, valid, test = get_data_splits(clicks)

# Create the CatBoost encoder
cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

# Learn encoding from the training set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))

# Check your answer
q_6.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [46]:
# Uncomment these if you need some guidance
q_6.hint()
q_6.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> CatBoostEncoder works like scikit-learn classes with a `.fit` method to learn the encoding and a `.transform` method to apply the encoding. Also note that you'll need to tell it which columns are categorical variables.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

    # Have to tell it which features are categorical when they aren't strings
    cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

    # Learn encoding from the training set
    cb_enc.fit(train[cat_features], train['is_attributed'])

    # Apply encoding to the train and validation sets
    train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
    valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))
    
```

In [49]:
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.962868024575231


The CatBoost encodings work the best, so we'll keep those.

In [50]:
encoded = cb_enc.transform(clicks[cat_features])

In [51]:
encoded

Unnamed: 0,app,device,os,channel
0,0.028329,0.152087,0.138712,0.034049
1,0.995828,0.152087,0.138712,0.950244
2,0.009261,0.152087,0.138712,0.019384
3,0.028329,0.152087,0.138712,0.034049
4,0.028329,0.152087,0.138712,0.034049
...,...,...,...,...
2300556,0.026755,0.152087,0.157243,0.016611
2300557,0.026518,0.152087,0.138712,0.031812
2300558,0.011220,0.026726,0.109914,0.012445
2300559,0.011220,0.152087,0.090235,0.129124


In [52]:
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

In [53]:
clicks

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,app_cb,device_cb,os_cb,channel_cb
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,0.028329,0.152087,0.138712,0.034049
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,0.995828,0.152087,0.138712,0.950244
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32,0.009261,0.152087,0.138712,0.019384
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17,0.028329,0.152087,0.138712,0.034049
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1,0.028329,0.152087,0.138712,0.034049
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300556,9791,2,1,19,166,2017-11-09 15:59:59,,0,9,15,59,59,0.026755,0.152087,0.157243,0.016611
2300557,6240,14,1,13,146,2017-11-09 15:59:59,,0,9,15,59,59,0.026518,0.152087,0.138712,0.031812
2300558,15098,12,2,17,50,2017-11-09 16:00:00,,0,9,16,0,0,0.011220,0.026726,0.109914,0.012445
2300559,10538,12,1,15,41,2017-11-09 16:00:00,,0,9,16,0,0,0.011220,0.152087,0.090235,0.129124
