<a href="https://colab.research.google.com/github/vijaykriishna/talkingdata_adtracking/blob/master/categorical_encodings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Apply more advanced encodings to encode the categorical variables ito improve the classifier model. The encodings implemented are:

- Count Encoding
- Target Encoding
- CatBoost Encoding

Refit the classifier after each encoding to check its performance on hold-out data. 

Begin by running the next code cell to set up the notebook.

The next code cell repeats the work that you did in the previous exercise.

In [7]:
!pip install fastparquet

Collecting fastparquet
[?25l  Downloading https://files.pythonhosted.org/packages/28/b9/844e32d0e3739e5695057dff3a3b9f4abc0fcccff466fdaadb8fedb0ee1d/fastparquet-0.4.1.tar.gz (28.6MB)
[K     |████████████████████████████████| 28.6MB 133kB/s 
Collecting thrift>=0.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
[K     |████████████████████████████████| 61kB 6.0MB/s 
Building wheels for collected packages: fastparquet, thrift
  Building wheel for fastparquet (setup.py) ... [?25l[?25hdone
  Created wheel for fastparquet: filename=fastparquet-0.4.1-cp36-cp36m-linux_x86_64.whl size=7125487 sha256=ed6e36077a7e4544f91bb5ca4a0dabc5932426b5a4d088d8c17892b7dafca467
  Stored in directory: /root/.cache/pip/wheels/10/45/cf/492ccb908adde1dd2551bb509a56e4096cce9487167f525120
  Building wheel for thrift (setup.py) ... [?25l[?25hdone
  Created wheel for thrift: filename=thrift-0.13.0-cp3

In [12]:
pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/44/57/fcef41c248701ee62e8325026b90c432adea35555cbc870aff9cfba23727/category_encoders-2.2.2-py2.py3-none-any.whl (80kB)
[K     |████                            | 10kB 9.9MB/s eta 0:00:01[K     |████████▏                       | 20kB 14.6MB/s eta 0:00:01[K     |████████████▏                   | 30kB 9.9MB/s eta 0:00:01[K     |████████████████▎               | 40kB 8.4MB/s eta 0:00:01[K     |████████████████████▎           | 51kB 3.8MB/s eta 0:00:01[K     |████████████████████████▍       | 61kB 4.3MB/s eta 0:00:01[K     |████████████████████████████▍   | 71kB 4.7MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.1MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2


In [8]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

clicks = pd.read_parquet('/content/input/baseline_data.pqt')

Next, we define a couple functions that you'll use to test the encodings that you implement in this exercise.

In [9]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """Splits a dataframe into train, validation, and test sets.

    First, orders by the column 'click_time'. Set the size of the 
    validation and test sets with the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

Run this cell to get a baseline score. 

In [10]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
baseline_score = train_model(train, valid)

Baseline model
Validation AUC score: 0.9622743228943659


### 1) Categorical encodings and leakage

These encodings are all based on statistics calculated from the dataset like counts and means. 

### 2) Count encodings

Begin by running the next code cell to get started.

In [13]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

  import pandas.util.testing as tm


Next, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. 
- Using `CountEncoder` from the `category_encoders` library, fit the encoding using the categorical feature columns defined in `cat_features`. 
- Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [14]:
# Create the count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

  elif pd.api.types.is_categorical(cols):


Run the next code cell to see how count encoding changes the results.

In [15]:
# Train the model on the encoded datasets
# This can take around 30 seconds to complete
count_enc_score = train_model(train_encoded, valid_encoded)

Validation AUC score: 0.9653051135205329


Count encoding improved our model's score!

### 3) Why is count encoding effective?
At first glance, it could be surprising that count encoding helps make accurate models. 
Why do you think is count encoding is a good idea, or how does it improve the model score?

### 4) Target encoding

Here you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. 
- Create the target encoder from the `category_encoders` library. 
- Then, learn the encodings from the training dataset, apply the encodings to all the datasets, and retrain the model.

In [16]:
# Create the target encoder. You can find this easily by using tab completion.
# Start typing ce. the press Tab to bring up a list of classes and functions.
target_enc = ce.TargetEncoder(cols=cat_features)

# Learn encoding from the training set. Use the 'is_attributed' column as the target.
target_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

  elif pd.api.types.is_categorical(cols):


Run the next cell to see how target encoding affects your results.

In [17]:
target_enc_score = train_model(train_encoded, valid_encoded)

Validation AUC score: 0.9540530347873288


### 5) Try removing IP encoding

If you leave `ip` out of the encoded features and retrain the model with target encoding, you should find that the score increases and is above the baseline score! Why do you think the score is below baseline when we encode the IP address but above baseline when we don't?

### 6) CatBoost Encoding

The CatBoost encoder is supposed to work well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [18]:
# Remove IP from the encoded features
cat_features = ['app', 'device', 'os', 'channel']

# Create the CatBoost encoder
cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

# Learn encoding from the training set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))

  elif pd.api.types.is_categorical(cols):


Run the next code cell to see how the CatBoost encoder changes your results.

In [19]:
catboost_enc_score = train_model(train_encoded, valid_encoded)

Validation AUC score: 0.962868024575231
