We shall explore 24 encoders from 4 libraries:

| library | one-hot encoders | other simple encoders | contrast encoders | target/Bayesian encoders |
| --- | --- | --- | --- | --- |
| [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) | OneHotEncoder | LabelEncoder <br> OrdinalEncoder <br> LabelBinarizer | |
| [category_encoders](https://contrib.scikit-learn.org/category_encoders) | OneHotEncoder | OrdinalEncoder <br> BinaryEncoder <br> BaseNEncoder <br> CountEncoder <br> HashingEncoder| HelmertEncoder <br> SumEncoder <br> BackwardDifferenceEncoder <br> PolynomialEncoder | TargetEncoder <br> MEstimateEncoder <br> WOEEncoder <br> JamesSteinEncoder <br> LeaveOneOutEncoder <br> CatBoostEncoder <br> GLMMEncoder |
| [pandas](https://pandas.pydata.org) | get_dummies | factorize | | |
| [keras.utils](https://keras.io/api/utils) | to_categorical | | | |

<br>
Encoders map the original categories (often dtype=string) to a set of representing values (often dtype=int for simple encoders; dtype=float for target encoders). This notebook walks through a tour of the encoders listed in the table, exploring each non-target encoder one by one, producing a comparison table at the end. Target encoders shall be explored in detail in a separate notebook.
<br><br>
When to use which encoder to solve what problems? There is a good guide here: [Encode Smarter: How to Easily Integrate Categorical Encoding into Your Machine Learning Pipeline](https://innovation.alteryx.com/encode-smarter).

In [None]:
from sklearn import preprocessing
from category_encoders import OrdinalEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, HashingEncoder
from category_encoders import HelmertEncoder, SumEncoder, BackwardDifferenceEncoder, PolynomialEncoder
from category_encoders import TargetEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, LeaveOneOutEncoder, CatBoostEncoder, GLMMEncoder
from keras import utils

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings, gc, time
warnings.simplefilter('ignore') # once | error | always | default | module

# We shall be compiling a summary table as we go along.
summary = pd.DataFrame({'inp2out_map': pd.Series(dtype=object),   # input-to-output map
                        'nunique'    : pd.Series(dtype=int),      # number of unique (or distinct) values in output
                        'unique'     : pd.Series(dtype='object'), # unique values in output
                        'shape'      : pd.Series(dtype=int),      # rows-by-columns of output array
                        'tictoc'     : pd.Series(dtype=int)})     # computation time i seconds
summary.index.name = 'encoder'
# The grand summary is printed at the end of this notebook.

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv', index_col='id') # [['cat10', 'cat5', 'target']]
train.sample(5)

# Is encoding optional?
Not always. Some packages can't digest string-type data without encoding. 'Donkey', 'horse' and 'mule', for instance, would not work whereas 0, 1 and 2 would.

Even when the package can digest data without encoding, they sometimes learn encoded data better.

# Encoder types: a broad-stroke scan
We've got 2 dozen encoders here. Let's take an overview by trying to group them into families according to observable behaviors.

In [None]:
%%time
# Would the output differ whether or not we supply the target as input?
# Let's run a test with 10 encoders which optionally accept the target as input:
pick = train.columns[train.columns.str.startswith('cat')]
for ncoda in [OrdinalEncoder, HelmertEncoder, SumEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, BackwardDifferenceEncoder]:
    tis = ncoda().fit_transform(train[pick])
    tat = ncoda().fit_transform(train[pick], train['target']) 
#   Print 'True' if same; print 'False' otherwise
    print((tis==tat).all().all(), ncoda)

In [None]:
# Some encoders use the target for computing the output; they can't run without being given the target. These are the target encoders.
for ncoda in [TargetEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, LeaveOneOutEncoder, CatBoostEncoder, GLMMEncoder]:
    try:
#       Run without train['target']:
        tis = ncoda().fit_transform(train[pick])
        print('Passed:', ncoda)    
    except Exception as complaint:
        print(complaint)
        print('See, told ya it was going to break:', ncoda)    
    gc.collect()

In [None]:
# Let's do a scan for encoders which output a column named, 'intercept', which suggests contrast encoding, which we will see in the last section.
for ncoda in [OrdinalEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, HashingEncoder,
              HelmertEncoder, SumEncoder, BackwardDifferenceEncoder, PolynomialEncoder]:
    out = ncoda().fit_transform(train[pick])
    if 'intercept' in out.columns:
        print(str(ncoda))

# One-to-one simple, target-independent encoders

In [None]:
# Let's zoom into a single column.
train['cat10'].nunique(), train['cat10'].unique()
# cat10 alone has 299 unique values altogether. This value in termed 'cardinality'.
# This is an extreme case. Cardinalities are usually lower e.g. exam grades = A, B, C, D, E would have cardinality=5.

In [None]:
%%time
for which in [preprocessing.LabelEncoder, preprocessing.OrdinalEncoder, OrdinalEncoder,  # Section 1
              preprocessing.OneHotEncoder, OneHotEncoder,                                # Section 2
              preprocessing.LabelBinarizer, BinaryEncoder, BaseNEncoder,                 # Section 3
              CountEncoder,                                                              # Section 4
              HelmertEncoder, SumEncoder, BackwardDifferenceEncoder]:                    # Section 5
    if which==preprocessing.OrdinalEncoder or which==preprocessing.OneHotEncoder: 
        inp = train['cat10'].values.reshape(-1, 1)
    else:
        inp = train['cat10']

    tic = time.time()
    if which==preprocessing.OneHotEncoder: 
        out = which(sparse=False).fit_transform(inp)
    else:
        out = which().fit_transform(inp)
    tictoc = time.time() - tic

    inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
#   Grab the label, apply some minor hiding cosmetics:
    label = str(which).replace("<class '", "").replace("'>", "")
    if inp2out_map.isnull().any().any():
        print(label, "doesn't map one-to-one")
    summary.loc[label] = inp2out_map, len(unik), unik, inp2out_map.shape, tictoc
columns_show = ['nunique', 'unique', 'shape', 'tictoc']
summary[columns_show]

# 1. Label & Ordinal encoders
From the table we find the first 3 rows:
* sklearn.preprocessing._label.LabelEncoder
* sklearn.preprocessing._encoders.OrdinalEncoder
* category_encoders.ordinal.OrdinalEncoder

rather similar to each other:
* they all output 299 unique numbers, where 299 is the cardinality of the original input;
* they all output a single column;
* they basically do one-to-one mapping of the original input;
* they run quickly compared to the rest.

## 1.1 LabelEncoder vs OrdinalEncoder
* LabelEncoder encodes one variable at a time; meant for encoding target labels (as in classification problems). 
* OrdinalEncoder encodes multiple variables/columns at a time; meant for encoding features (plural).

Let's see that in action:

In [None]:
try:
    out = LabelEncoder().fit_transform(train[['cat10', 'cat5']])
except Exception as complaint:
    print(complaint)
    print('See, told ya it was going to break.')    

In [None]:
out = OrdinalEncoder().fit_transform(train[['cat10', 'cat5']])
# no complains

## 1.2 pandas does label encoding too

In [None]:
def redressOutput(out):
    inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
    return inp2out_map, len(unik), unik, inp2out_map.shape

In [None]:
tic = time.time()
out = pd.factorize(train['cat10'])[0]
tictoc = time.time() - tic
summary.loc['pd.factorize'] = redressOutput(out) + (tictoc, )

labelordinal_encoders = ['sklearn.preprocessing._label.LabelEncoder',
                         'sklearn.preprocessing._encoders.OrdinalEncoder',
                         'category_encoders.ordinal.OrdinalEncoder',
                         'pd.factorize']
summary.loc[labelordinal_encoders, columns_show ]

In [None]:
# like scikit's LabelEncoder, pd.factorize can only handle one column at a time
try:
    out = pd.factorize(train[['cat10', 'cat5']])
except Exception as complaint:
    print(complaint)
    print('See, told ya it was going to break.')    

# 2. One-hot encoders
## 2.1 by scikit-learn and catagory-encoders

In [None]:
summary.loc[ summary.index.str.contains('OneHot') , columns_show ]
# We've got two one-hot encoders so far. One from sklearn.preprocessing; another by category_encoders. Both work in a similar way. We can use either.

Compared to label and ordinal encoders, we find that with one-hot encoders:
* ```nunique``` dropped from 299 to 2;
* the number of columns increased from 1 to 299.

Let's see how a one-hot encoder maps input to output:

In [None]:
inp2out_map = summary.loc['category_encoders.one_hot.OneHotEncoder', 'inp2out_map']
inp2out_map

One-hot encoding is thus name because for each row there is strictly one 1; all other columns must be zero. Let's do a quick check:

In [None]:
for row_idx, row_data in inp2out_map.iterrows():
    vcount = row_data.value_counts().sort_index()
    if not (vcount==pd.Series({0: 298, 1: 1})).all():
        print('oopsy')
# Loop passes without any oopsy, confirming that each row had strictly 1 one and 298 zeros.

In [None]:
# Let's take the chance to visualise the input-to-output mapping.
plt.imshow(inp2out_map, cmap='gray'); plt.axis('equal'); _ = plt.axis('off')
# black = zero; white = one. We find strictly 1 one on each row, zero everywhere else.

## 2.2 by pandas

In [None]:
tic = time.time()
out = pd.get_dummies(train['cat10'])
tictoc = time.time() - tic
summary.loc['pd.get_dummies'] = redressOutput(out) + (tictoc, )

onehot_encoders = ['sklearn.preprocessing._encoders.OneHotEncoder',
                   'category_encoders.one_hot.OneHotEncoder',
                   'pd.get_dummies']
summary.loc[onehot_encoders, columns_show ]

## 2.3 by keras
But with numeric input only. ```cat10``` is string, not numeric. We would need to first convert from string to numeric.

In [None]:
try:
    utils.to_categorical(train['cat10'])
except Exception as complaint:
    print(complaint)
    print('See, told ya it was going to break.')

In [None]:
tic = time.time()
borrow = preprocessing.LabelEncoder().fit_transform(train['cat10'])
out = utils.to_categorical(borrow)
tictoc = time.time() - tic
summary.loc['utils.to_categorical'] = redressOutput(out) + (tictoc, )

onehot_encoders = ['sklearn.preprocessing._encoders.OneHotEncoder',
                   'category_encoders.one_hot.OneHotEncoder',
                   'pd.get_dummies',
                   'utils.to_categorical']
summary.loc[onehot_encoders, columns_show ]
# We have at our disposal 4 one-hot encoders by different libraries.

### Warning
```keras.utils.to_categorical``` doesn't work with negative input.

In [None]:
def redressOutput(out):
    inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
    return inp2out_map, len(unik), unik, inp2out_map.shape

In [None]:
inp = [0, 1, 2, 3, 4]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# All good: 5 unique values in, 5 unique values out.

In [None]:
inp = [-1, 0, 1, 2, 3]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# 5 unique values in but just 4 out. What's happening here?

In [None]:
inp = [-2, -1, 0, 1, 2]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# Now it's even worse: 5 unique values in, just 3 unique values out.

In [None]:
for before, after in zip (inp, out):
    print(before, after)
# This explains why. Negative values weren't mapped the way we thought. 
# -2 was mapped to the same outcome as 1. 
# -1 got mapped to the same outcome as 2.

# 3. Binary & Base-N Encoders
Base-N encoding is the superset of 
* binary encoding (N=2);
* one-hot encoding (N=1).
By default ```category_encoders.BaseNEncoder``` takes N=2; the output is there for identical to ```category_encoders.BinaryEncoder```:

In [None]:
summary.loc[ ['category_encoders.binary.BinaryEncoder', 'category_encoders.basen.BaseNEncoder'] ][ columns_show ]

In [None]:
tis = summary.loc['category_encoders.binary.BinaryEncoder', 'inp2out_map']
tat = summary.loc['category_encoders.basen.BaseNEncoder', 'inp2out_map']
(tis==tat).all().all()

#### But do we really need 10 columns?
2^8, 2^9 = 256, 512 so we should only need 9 columns to binary-encode 299 categories.

In [None]:
tis.apply(lambda x: np.unique(x))
# Column cat10_0 is all zeros and is therefore redundant.

In [None]:
# We can pass the option drop_invariant=True to avoid that redundancy.
BaseNEncoder(drop_invariant=True).fit_transform(train['cat10'])
# Now the redundant column disappears; we get 9 columns instead of 10.

# 4. Count Encoder

In [None]:
summary.loc['category_encoders.count.CountEncoder', 'inp2out_map']
# The count encoder seems to output all sorts of integers.

In [None]:
# Let's take a look where those values come from. For sampling sake we take the last 3 values and try to derive them.
out = CountEncoder().fit_transform(train['cat10'], train['target'])
inp, out.tail(3)

In [None]:
# Where did 3011 come from?
(train['cat10']=='HC').sum()

In [None]:
# Where did 565 come from?
(train['cat10']=='BF').sum()

In [None]:
# Where did 5917 come from?
(train['cat10']=='LM').sum()

# 5. Contrast Encoders
These are [contrast encoders](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables) characterised by the presence of an ```intercept``` in the output.

## 5.1 Helmert Encoder

In [None]:
summary.loc[ 'category_encoders.helmert.HelmertEncoder', 'inp2out_map' ]

## 5.2 Sum Encoder

In [None]:
inp2out_map = summary.loc[ 'category_encoders.sum_coding.SumEncoder', 'inp2out_map' ]
inp2out_map

In [None]:
column_sum = inp2out_map.sum()
column_sum
# This is the signature of sum encoding: except the ```intercept``` column all columns sum to zero.

In [None]:
column_sum[ column_sum!= 0 ]

## 5.3 Backward-Difference Encoder

In [None]:
summary.loc[ 'category_encoders.backward_difference.BackwardDifferenceEncoder', 'inp2out_map' ]

# Grand summary

In [None]:
summary[columns_show]

This notebook is getting a little long. We've covered simple, one-to-one mapping encoders. Let's do target encoders in another notebook!