In this notebook, we walk through the design of target encoding. We start with a motivating example, `criteo dataset`, to show why target encoding is preferred over one hot encoding and label encoding. The concepts and optimizations of target encoding are introduced step by step. The key takeaway is that target encoding differs from traditional sklearn style encoders in the following aspects:

- The ground truth column `target` is used as input for encoding.
- The training data and test data are transformed differently.
- Multi-column joint transformation is supported by target encoding.

### Table of contents
[1. Motivation](#motivation)<br>
> [Criteo data](#criteo)<br>
[Why not one-hot encoding?](#onehot)<br>
[Label encoding](#lbl)<br>
[Train XGB with label encoding ](#lblxgb)<br>

[2. Target Encoding](#tar)<br>
> [A naive implementation](#naive)<br>
[A K-fold cross validate implementation](#kfold)<br>
[An optimized implementation](#opt)<br>
[Multi-column joint encoding](#multi)<br>

[3. Conclusions](#conclusions)<br>

In [None]:
import os
GPU_id = '0,1,2,3'
os.environ['CUDA_VISIBLE_DEVICES'] = GPU_id
num_gpus = len(GPU_id.split(','))

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import cudf as gd
import cupy as cp
from cuml.preprocessing.LabelEncoder import LabelEncoder
from cuml.preprocessing.TargetEncoder import TargetEncoder
import dask as dask, dask_cudf
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import xgboost as xgb
import matplotlib.pyplot as plt
import time

<a id="motivation"></a>
## 1. Motivation

<a id="criteo"></a>
### Criteo data
The [criteo 1-TB benchmark](https://github.com/rambler-digital-solutions/criteo-1tb-benchmark) is a well-known dataset for click through rate modeling. We only use three categorical features to make it a simple dataset for the problem.

In [None]:
%%time

path = '/datasets/criteo/raw_csvs/split_train_data'
train_name = f'{path}/day_0_part_0000'
valid_name = f'{path}/day_0_part_0001'
num_cols = ['num_%d'%i for i in range(13)]
cat_cols = ['cat_%d'%i for i in range(26)]
cols = ['label']+num_cols+cat_cols
dtypes = {i:'str' if i.startswith('cat_') else 'float32' for i in cols}
train = gd.read_csv(train_name, sep = '\t', header=None, names=cols, dtypes=dtypes)
valid = gd.read_csv(valid_name, sep = '\t', header=None, names=cols, dtypes=dtypes)

used_cols = ['label']+cat_cols[:3]

train = train[used_cols]
valid = valid[used_cols]
train.head()

The categorical columns are strings originally so we need some kind of encoding to turn them into numerical columns.

<a id="onehot"></a>
### Why not one-hot encoding?

In [None]:
for col in cat_cols[:3]:
    print(col,'cardinality',len(train[col].unique()), len(valid[col].unique()))

With such high cardinality, it is inefficient to do one-hot encoding because it leads to either huge memory consumption or very sparse data, which is less optimized when running on GPU.

Therefore, we use label encoding to transform such string columns to numerical columns.

<a id="lbl"></a>
### Label encoding

In [None]:
%%time
for col in cat_cols[:3]:
    train[col] = train[col].fillna('None')
    valid[col] = valid[col].fillna('None')
    lbl = LabelEncoder()
    lbl.fit(gd.concat([train[col],valid[col]]))
    train[col] = lbl.transform(train[col])
    valid[col] = lbl.transform(valid[col])

In [None]:
train.head()

Label encoding transforms string columns to integer columns. However, the mapping from a string to an integer is arbitrary, which makes the encoded features less informative. For example, the first three rows of `cat_2` are `9218`, `5875` and `5199`. Although `5875` is closer to `5199` than `9218`, there is absolutely no guarantee that the string of `5875` is more similar to string of `5199` than string of `9218`. In other words, a tree classifier has make many splits to learn the pattern buried within such encoded features.   

<a id="lblxgb"></a>
### Train XGB with label encoding features

In [None]:
xgb_parms = { 
    'max_depth':6, 
    'learning_rate':0.1, 
    'subsample':0.8,
    'colsample_bytree':1.0, 
    'eval_metric':'auc',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
}

In [None]:
NROUND = 100
VERBOSE_EVAL = 10
ESR = 10

start = time.time(); print('Creating DMatrix...')
dtrain = xgb.DMatrix(data=train.drop('label',axis=1),label=train['label'])
dvalid = xgb.DMatrix(data=valid.drop('label',axis=1),label=valid['label'])
print('Took %.1f seconds'%(time.time()-start))

start = time.time(); print('Training...')
model = xgb.train(xgb_parms, 
                       dtrain=dtrain,
                       evals=[(dtrain,'train'),(dvalid,'valid')],
                       num_boost_round=NROUND,
                       early_stopping_rounds=ESR,
                       verbose_eval=VERBOSE_EVAL) 

As shown above, using label encoding features results in a valid auc score of 0.60. Let's see if target encoding can improve this score.

<a id="tar"></a>
## 2. Target encoding

The idea of target encoding is very simple: we encode the categorical column by the mean value of the `target` of the group associated with each unique value of the categorical column, where `target` is the ground truth column to be predicted. In other words, it is essentially just a simple groupby-aggregation-merge or `groupby-transform`, in pandas terms:<br> `df['fea_encode'] = df.groupby('fea')['target'].transform(lambda x: x.mean())`

<a id="naive"></a>
### A naive implementation
Let's implement targe encoding of the idea above and study where we can improve.

In [None]:
%%time
for col in cat_cols[:3]:
    tmp = train.groupby(col, as_index=False).agg({'label':'mean'})
    tmp.columns = [col, f'{col}_TE']
    train = train.merge(tmp, on=col, how='left')
    valid = valid.merge(tmp, on=col, how='left')
    del tmp
train.head()

We will only use the target encoding features to train XGB.

In [None]:
te_cols = [col for col in train.columns if col.endswith('TE')]
print(te_cols)

start = time.time(); print('Creating DMatrix...')
dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])
dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])
print('Took %.1f seconds'%(time.time()-start))

start = time.time(); print('Training...')
model = xgb.train(xgb_parms, 
                       dtrain=dtrain,
                       evals=[(dtrain,'train'),(dvalid,'valid')],
                       num_boost_round=NROUND,
                       early_stopping_rounds=ESR,
                       verbose_eval=VERBOSE_EVAL) 

However the valid auc is not improved with the naive target encoding. Furthermore, the bigger discrepancy between `train auc` and `valid auc` is alarming. It means the naive target encoding suffers from an overfitting problem. 

In [None]:
labels = ['Label encoding', 'Target encoding naive']
train_auc = [0.65, 0.84]
valid_auc = [0.64, 0.63]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')
rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')

ax.set_ylabel('Auc')
ax.set_title('The overfitting problem of naive target encoding')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

The cause is actually obvious. We use the ground truth column directly in creating the features for the training data, which doesn't generalize to validation data  

<a id="kfold"></a>
### A K-fold cross validate implementation
To alleviate such overfitting, we can encode the traning data in k-folds, so that a sample's ground truth is not touched when creating its target encoding feature. The procedure is shown in the animation below.<br>
![ChessUrl](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F100236%2F64cc45bbe25144503bc93cf4b9e102f1%2Fmte.gif?generation=1594620515929361&alt=media "chess")

In [None]:
# drop the naive TE columns
train = train.drop(te_cols, axis=1)
valid = valid.drop(te_cols, axis=1)

In [None]:
%%time
FOLDS = 10
train['fold'] = cp.arange(len(train))%FOLDS
train['row_id'] = cp.arange(len(train))
mean = train['label'].mean()
for col in cat_cols[:3]:
    res = []
    out_col = f'{col}_TE'
    for i in range(FOLDS):
        tmp = train[train['fold']!=i].groupby(col, as_index=False).agg({'label':'mean'})
        tmp.columns = [col, out_col]
        tr = train[train['fold']==i][['row_id',col]]
        tr = tr.merge(tmp,on=col,how='left')
        res.append(tr)
        del tmp
    res = gd.concat(res)
    res = res.sort_values('row_id')
    train[out_col] = res[out_col].fillna(mean).values
    del res
    tmp = train.groupby(col, as_index=False).agg({'label':'mean'})
    tmp.columns = [col, out_col]
    valid = valid.merge(tmp, on=col, how='left')
    del tmp
train.head()

A key observation here is that training data and test/validation data are encoded differently. The training data is encoded using this *fancy kfold cross validated* fashion while test data is encoded just using *group mean*. Comparing to `LabelEncoder`, the implication is that with `TargetEncoder`we can't use the exactly same api `transform` for both `training data` and `test data`.

```
# Using transform for both data works
lbl = LabelEncoder()
lbl.fit(gd.concat([train[col],valid[col]]))
train[col] = lbl.transform(train[col])
valid[col] = lbl.transform(valid[col])

# Using transform for both data doesn't work
tar = TargetEncoder()
tar.fit(train[col], train['target'])
train[col] = tar.transform(train[col]) 
valid[col] = tar.transform(valid[col])
```

In [None]:
te_cols = [col for col in train.columns if col.endswith('TE')]
print(te_cols)

start = time.time(); print('Creating DMatrix...')
dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])
dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])
print('Took %.1f seconds'%(time.time()-start))

start = time.time(); print('Training...')
model = xgb.train(xgb_parms, 
                       dtrain=dtrain,
                       evals=[(dtrain,'train'),(dvalid,'valid')],
                       num_boost_round=NROUND,
                       early_stopping_rounds=ESR,
                       verbose_eval=VERBOSE_EVAL) 

In [None]:
labels = ['Label encoding', 'Target encoding naive', 'Target encoding kfold for loop']
train_auc = [0.65, 0.84, 0.71]
valid_auc = [0.64, 0.63, 0.7]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
fig.set_figwidth(15)
rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')
rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')

ax.set_ylabel('Auc')
ax.set_title('The overfitting problem is fixed by kfold target encoding')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

<a id="opt"></a>
### An optimized implementation
We can make further improvements:
- calculate the encoding in one shot instead of the for loop.
- encode one column or many columns jointly
- smooth the encoding so that it is not skewed by infrequent values.
- support both single and multi gpus

In [None]:
# drop the previous TE columns
train = train.drop(te_cols, axis=1)
valid = valid.drop(te_cols, axis=1)

Note that the optimized implementation is about 6x faster than the previous `for loop` based implementation.

In [None]:
%%time
SMOOTH = 0.001
SPLIT = 'interleaved'
for col in cat_cols[:3]:
    out_col = f'{col}_TE'
    encoder = TargetEncoder(n_folds=FOLDS, smooth=SMOOTH, split_method=SPLIT)
    #train[out_col] = encoder.fit_transform(train[col], train['label'])
    encoder.fit(train[col], train['label'])
    train[out_col] = encoder.transform(train[col])
    valid[out_col] = encoder.transform(valid[col])

In [None]:
te_cols = [col for col in train.columns if col.endswith('TE')]
print(te_cols)

start = time.time(); print('Creating DMatrix...')
dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])
dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])
print('Took %.1f seconds'%(time.time()-start))

start = time.time(); print('Training...')
model = xgb.train(xgb_parms, 
                       dtrain=dtrain,
                       evals=[(dtrain,'train'),(dvalid,'valid')],
                       num_boost_round=NROUND,
                       early_stopping_rounds=ESR,
                       verbose_eval=VERBOSE_EVAL) 

The optimized version is slightly more accurate and it could be up to 10x faster than the `kfold for loop` implementation.

In [None]:
labels = ['Label encoding', 'Target encoding naive', 'Target encoding kfold for loop', 'Target encoding optimized']
train_auc = [0.65, 0.84, 0.71, 0.72]
valid_auc = [0.64, 0.63, 0.7, 0.704]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
fig.set_figwidth(15)
rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')
rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')

ax.set_ylabel('Auc')
ax.set_title('The overfitting problem is fixed by kfold target encoding')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

<a id="multi"></a>
### Multi-column joint encoding
Instead of encoding one column at a time, we can also encoding multiple columns jointly into one new feature.

In [None]:
%%time
for cols in [['cat_0', 'cat_1'],
             ['cat_0', 'cat_2'],
             ['cat_1', 'cat_2'],
             ['cat_0', 'cat_1', 'cat_2']
            ]:
    out_col = '_'.join(cols)+'_TE'
    encoder = TargetEncoder(n_folds=FOLDS,smooth=SMOOTH, split_method=SPLIT)
    train[out_col] = encoder.fit_transform(train[cols], train['label'])
    valid[out_col] = encoder.transform(valid[cols])
    del encoder

In [None]:
te_cols = [col for col in train.columns if col.endswith('TE')]
print(te_cols)

start = time.time(); print('Creating DMatrix...')
dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])
dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])
print('Took %.1f seconds'%(time.time()-start))

start = time.time(); print('Training...')
model = xgb.train(xgb_parms, 
                       dtrain=dtrain,
                       evals=[(dtrain,'train'),(dvalid,'valid')],
                       num_boost_round=NROUND,
                       early_stopping_rounds=ESR,
                       verbose_eval=VERBOSE_EVAL) 

Although the validation AUC doesn't improve much for this dataset, the functionality of multi-column joint encoding is necessary and might improve the prediction for other datasets.

## 3. Conclusion

In this notebook, we explains the key design choices of target encoding. The takeaways are:
- The ground truth column `target` is used as input for encoding.
- The training data and test data are transformed differently.
- Multi-column joint transformation is supported by target encoding.