We get 24 encoders from 4 libraries:

| library | one-hot encoders | other simple encoders | contrast encoders | target/Bayesian encoders |
| --- | --- | --- | --- | --- |
| [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) | OneHotEncoder | LabelEncoder <br> OrdinalEncoder <br> LabelBinarizer | |
| [category_encoders](https://contrib.scikit-learn.org/category_encoders) | OneHotEncoder | OrdinalEncoder <br> BinaryEncoder <br> BaseNEncoder <br> CountEncoder <br> **HashingEncoder** | HelmertEncoder <br> SumEncoder <br> BackwardDifferenceEncoder <br> **PolynomialEncoder** | **TargetEncoder <br> LeaveOneOutEncoder <br> CatBoostEncoder** <br> MEstimateEncoder <br> WOEEncoder <br> JamesSteinEncoder <br> GLMMEncoder |
| [pandas](https://pandas.pydata.org) | get_dummies | factorize | | |
| [keras.utils](https://keras.io/api/utils) | to_categorical | | | |


<br>This notebook explores step-by-step the Hashing Encoder, the Polynomial Encoder and some flavours/variations of the target encoder. All flavours of target encoders peeks into the target; we therefore need to be mindful of data leakage. Options are available to regulate/control data leakage and overfitting. *TargetEncoder* is the vanilla flavour. *LeaveOneOutEncoder* is the conservative option, where a given sample sees other samples' target but blindfolded from it's own. *CatBoostEncoder* is sensitive to row ordering; a given sample sees the target of preceding samples only. *JamesSteinEncoder* is for normal distributions.

In particular, we have in this notebook
* ```TargetEncoder``` (the vanilla form) demonstrated in detail to tell the principle behind target encoding, which underlies all flavours of target encoding. Manual back-of-envelop derivation is compared with automated output from ```TargetEncoder```.
* ```LeaveOneOutEncoder``` demonstrated in detail as a conservative step up to reduce data leakage and overfitting. Manual back-of-envelop derivation is compared with automated output from ```LeaveOneOutEncoder```.

<br> This notebook continues from an earlier notebook, [Category Encoders: Catalog & Experiments (Part 1)](https://www.kaggle.com/marychin/category-encoders-catalog-experiments-part-1).

<br>
When to use which encoder to solve what problems? There is a good guide here: [Encode Smarter: How to Easily Integrate Categorical Encoding into Your Machine Learning Pipeline](https://innovation.alteryx.com/encode-smarter).

In [None]:
from sklearn.preprocessing import LabelEncoder
from category_encoders import HashingEncoder, PolynomialEncoder
from category_encoders import TargetEncoder, LeaveOneOutEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, CatBoostEncoder, GLMMEncoder
from keras import utils

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings, gc, time
warnings.simplefilter('ignore') # once | error | always | default | module

from tqdm import tqdm_notebook

# We shall be compiling a summary table as we go along.
summary = pd.DataFrame({'inp2out_map': pd.Series(dtype=object),   # input-to-output map
                        'nunique'    : pd.Series(dtype=int),      # number of unique (or distinct) values in output
                        'unique'     : pd.Series(dtype='object'), # unique values in output
                        'shape'      : pd.Series(dtype=int),      # rows-by-columns of output array
                        'tictoc'     : pd.Series(dtype=int)})     # computation time i seconds
summary.index.name = 'encoder'
# The grand summary is printed at the end of this notebook.

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv', index_col='id')
train.sample(5)

In [None]:
# Let's zoom into a single column.
train['cat10'].nunique(), train['cat10'].unique()
# cat10 alone has 299 unique values altogether. This value in termed 'cardinality'.
# This is an extreme case. Cardinalities are usually lower e.g. exam grades = A, B, C, D, E would have cardinality=5.

In [None]:
# Now pick another column; just to have a look.
train['cat5'].nunique(), train['cat5'].unique()
# Lower cardinality in this column; 84 only.

# 1. Hashing Encoder
What is a Hashing Encoder? The question becomes immediately self-explanatory the moment we read the word ```hashing``` in the light of [MD5, SHA, ...](https://en.wikipedia.org/wiki/Secure_Hash_Algorithms). Yes, it's that same ```hash``` that the hashing encoder is about. 

```HashingEncoder``` takes ```n_components``` as an argument. Let us do a test with ```n_components```= 8, 16, 32:

In [None]:
%%time
for n_components in [8, 16, 32]:
    inp = train['cat10']
    tic = time.time()
    out = HashingEncoder(n_components=n_components).fit_transform(inp)
    tictoc = time.time() - tic

    inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
    summary.loc[f'HashingEncoder, {n_components}'] = inp2out_map, len(unik), unik, inp2out_map.shape, tictoc
columns_show = ['nunique', 'unique', 'shape', 'tictoc']
summary[columns_show]
# We find that no matter what n_components we asked for, the mapped values always consist of 0 and 1, and nothing else.
# When we ask for n_components=8, we get 8 columns in the output. When we ask for n_components=16, we get 16 columns. When we ask for n_components=32, we get 32 columns.

In [None]:
summary.loc['HashingEncoder, 8', 'inp2out_map']
# Some rows in inp2out_map contain null values. 
# This shows that some categories in the original input (train['cat10']) are not mapped to anything. 
# HashingEncoder therefore doesn't map one-to-one i.e. some of the original info is lost in the encoding process.

In [None]:
# Now we filter out the null rows and show only non-null rows.
non_null_idx = ~summary.loc['HashingEncoder, 8', 'inp2out_map'].isnull().any(axis=1)
non_null_rows = summary.loc['HashingEncoder, 8', 'inp2out_map'].loc[non_null_idx]
non_null_rows

In [None]:
# Next, we see if there are any duplicate rows that can be removed.
non_null_rows.drop_duplicates()
# We are left with just 8 rows! That means many input categories got mapped to the same output value. This loss of info is called *collision*.

In [None]:
# Let's repeat what we did in the previous two cells for ```n_components``` = 8, 16, 32:
print('{:15s}{}'.format('n_components', 'unique values of output'))
for n_components in [8, 16, 32]:
    non_null_idx = ~summary.loc[f'HashingEncoder, {n_components}', 'inp2out_map'].isnull().any(axis=1)
    non_null_rows = summary.loc[f'HashingEncoder, {n_components}', 'inp2out_map'].loc[non_null_idx]
    print('{:<15d}{}'.format(n_components, len(non_null_rows.drop_duplicates())))
# The lower number of unique values of output, the higher the collisions i.e. we suffer a greater info loss.

# 2. Polynomial encoder

In [None]:
inp = train['cat5']
tic = time.time()
out = PolynomialEncoder().fit_transform(inp)
tictoc = time.time() - tic

inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat5']}, columns=['inp']),
                         pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
inp2out_map.set_index('inp', inplace=True, drop=True)
unik = np.unique(inp2out_map.values)
summary.loc['PolynomialEncoder'] = inp2out_map, len(unik), unik, inp2out_map.shape, tictoc
summary.loc['PolynomialEncoder', 'inp2out_map']
# As shown in [Part 1](https://www.kaggle.com/marychin/category-encoders-catalog-experiments-part-1) of this notebook series, contrast encoders output an ```intercept``` column.

In [None]:
# Now we filter out the null rows and show only non-null rows.
non_null_idx = ~summary.loc['PolynomialEncoder', 'inp2out_map'].isnull().any(axis=1)
non_null_rows = summary.loc['PolynomialEncoder', 'inp2out_map'].loc[non_null_idx]
non_null_rows

In [None]:
# Next, we see if there are any duplicate rows that can be removed.
non_null_rows.drop_duplicates()
# We get 84 rows still. No collision in this case (unlike HashingEncoder).

# 3. Target Encoders

In [None]:
feature = 'cat5'
for which in [TargetEncoder, LeaveOneOutEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, GLMMEncoder, CatBoostEncoder]:
#   Grab the label, apply some minor hiding cosmetics:
    label = str(which).split('.')[-1].split("'")[0]

    tic = time.time()
    out = which().fit_transform(train[feature], train['target'])
    tictoc = time.time() - tic
    inp2out_map = pd.concat([pd.DataFrame({'inp': train[feature]}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
    summary.loc[label] = inp2out_map, len(unik), unik, inp2out_map.shape, tictoc

#   Test if encoding depends on the order of rows.
    shuffled = train[[feature, 'target']].copy()
    shuffled = shuffled.sample(frac=1)
    out_shuffled = which().fit_transform(shuffled[feature], shuffled['target'])
    out.rename(columns={feature: 'tis'}, inplace=True)
    out_shuffled.rename(columns={feature: 'tat'}, inplace=True)
    tistat = pd.concat([out, out_shuffled], names=['tis', 'tat'], axis=1)
    if not np.allclose(tistat['tis'], tistat['tat']):
        print(label, 'is order-dependent.')
columns_show = ['nunique', 'unique', 'shape', 'tictoc']
summary[columns_show]
# Output reports that GLMMEncoder and CatBoostEncoder depend on the order of rows.

# 3.1 Target Encoder (vanilla)

In [None]:
ncoda = LabelEncoder()
x = ncoda.fit_transform(train['cat5'])
y = train['target']
z = TargetEncoder().fit_transform(train['cat5'], train['target'])

fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(projection='3d')
ax.scatter3D(x, y, z, c=z, cmap='hot')
ax.set_xlabel('cat 5'); ax.set_ylabel('target'); ax.set_zlabel(label)
_ = ax.set_xticks(np.arange(0, len(ncoda.classes_), 5))
_ = ax.set_xticklabels(ncoda.classes_[::5])
# 3D plot shows how encoders output depends on both cat5 and target.

In [None]:
# How does TargetEncoder encode? It takes the mean of the target of the given category.
# Let's have a goal doing this manually, then compare with TargetEncoder's output.
manual_auto = pd.DataFrame( {'manual': train.groupby('cat5')['target'].mean()} )
manual_auto = pd.concat([manual_auto, summary.loc['TargetEncoder', 'inp2out_map']], axis=1)
np.allclose(manual_auto['manual'], manual_auto['cat5'], atol=1e-7)
# So it is confirmed that our manual back-of-envelop calculation agrees with the output by TargetEncoder.

# 3.2 Leave-One-Out Encoder
LeaveOneOutEncoder is the conservative step up from the vanilla TargetEncoder. It reduces data leakage and overfitting by taking the target mean from rows other than a given row. A worked back-of-envelop example would explain best:

In [None]:
# train contains too many rows. To avoid prohibitive runtimes let's reduce train to a manageable subset.
reduced = train.sample(10000, random_state=77).reset_index(drop=True)

# Now let us try encoding leave-one-out manually.
manual = pd.Series()
for grp_idx, grp_data in tqdm_notebook(reduced.groupby('cat5'), total=reduced['cat5'].nunique()):
    for row_idx, row_data in grp_data.iterrows():
        manual.loc[row_idx] = grp_data.drop(row_idx)['target'].mean()
manual.name = 'loo_manual'
reduced = pd.concat([reduced, manual], axis=1)
reduced['loo_auto'] = LeaveOneOutEncoder().fit_transform(reduced['cat5'], reduced['target'])
plt.plot(reduced['loo_auto'], reduced['loo_manual'], '.'); plt.axis('square'); plt.grid(True)
# Looks good. Eyeballing the plot suggests good agreement between our manual encoding and LeaveOneOutEncoder's output.

In [None]:
# Next, put the comparison through a quantitative litmus test.
np.allclose(reduced['loo_auto'].values, reduced['loo_manual'].values, atol=1)
# It fails the litmus test. There is disagreement that escaped eyeballing of the plot in the previous cell.

In [None]:
# First suspect: null values?
reduced.loc[reduced['loo_manual'].isnull(), 'cat5']
# Indeed, that's the culprit.

In [None]:
# Next line of investigation: where do those null values originate from?
pblem_grp = reduced.loc[reduced['loo_manual'].isnull(), 'cat5'].values
reduced['cat5'].value_counts().loc[pblem_grp]
# They are from category groups which exist only on a single row. 
# Our manual calculation was correct, because by definition it is not possible to encode Leave-One-Out for category groups with count=1. 
# That's because by definition in Leave-One-Out encoding a given row leaves itself out, in this case it is left with no row.

In [None]:
# So how did LeaveOneOutEncoder get a non-null value?
reduced.loc[reduced['cat5'].isin(pblem_grp)][['target', 'loo_manual', 'loo_auto']]
# So LeaveOneOutEncoder plugs in as surrogate a constant value it found somewhere, 0.2626.

In [None]:
# Let us make a wild guess where LeaveOneOutEncoder found the value 0.2626.
reduced['target'].mean()
# Voila.