<a id='table-of-contents'></a>
# Table of Contents
---
* [0. Introduction](#0)
* [1. Label Encoder](#1)
* [2. One-Hot Encoder](#2)
* [3. Sum Encoder](#3)
* [4. Helmert Encoder](#4)
* [5. Target Encoder](#5)
* [6. M-Estimate Encoder](#6)
* [7. Weight Of Evidence Encoder](#7)
* [8. James-Stein Encoder](#8)
* [9. Leave-one-out Encoder](#9)
* [10. Catboost Encoder](#10)
* [11. Validation](#11)
* [12. Reducing Cardinality](#12)

<a id="0"></a>
# 0. Introduction
---
1. **References**
- [11 Categorical Encoders and Benchmark](https://www.kaggle.com/subinium/11-categorical-encoders-and-benchmark) - The encoders' description is from this notebook 
- [CategoricalEncodingBenchmark](https://github.com/DenisVorotyntsev/CategoricalEncodingBenchmark)

2. **Methodology**
- Some Feature Engineeging (Just in `Ticket` and `Cabin`)
- Fill null with `mean`, `median` and `x`
- Remove `Name`
- `KFold(5)` for Cross Validation
- LightGBM for Modeling

In [None]:
import pandas as pd
import numpy as np

from category_encoders.ordinal import OrdinalEncoder
from category_encoders.woe import WOEEncoder
from category_encoders.target_encoder import TargetEncoder
from category_encoders.sum_coding import SumEncoder
from category_encoders.m_estimate import MEstimateEncoder
from category_encoders.leave_one_out import LeaveOneOutEncoder
from category_encoders.helmert import HelmertEncoder
from category_encoders.cat_boost import CatBoostEncoder
from category_encoders.james_stein import JamesSteinEncoder
from category_encoders.one_hot import OneHotEncoder

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
%%time
def concat_cols(df, cols):
    it = iter(cols)
    concated = df[next(it)].copy()
    for i in it:
        concated += df[i]
    return concated

def ticket_features(df):
    s = df['Ticket'].str.split(expand=True)[0]
    check = ~(s.str.isdigit().fillna(True))
    df['letter_ticket'] = s[check].str.lower().str.replace(r"[\.|\s|\/]+", "", regex=True)
    df['number_ticket'] = df['Ticket'].str.extract(r'(\d+)')[0].astype(float)

    return df

def cabin_features(df):
    df['letter_cabin'] = df['Cabin'].str.extract(r'(\D+)')[0].str.lower()
    df['number_cabin'] = df['Cabin'].str.extract(r'(\d+)')[0].astype(float)
    
    return df

def fill_null(df):
    cols = df.dtypes.to_dict()
    
    for i, _type  in cols.items():
        if _type == np.int_:
            df[i] = df[i].fillna(df[i].mean())
        if _type == np.float_:
            df[i] = df[i].fillna(df[i].median())
        if _type == np.object_:
            df[i] = df[i].fillna('x')
    
    return df

df = (
    pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv', index_col='PassengerId')
    .pipe(ticket_features)
    .pipe(cabin_features)
    .pipe(fill_null)
    .drop(['Name', 'Ticket', 'Cabin'], axis=1)
)

cat_features = ['Sex', 'Embarked', 'letter_ticket', 'letter_cabin']
target_col = 'Survived'
target = df[target_col]

df

[back to top](#table-of-contents)
<a id="1"></a>
# 1. Label Encode
---
An encoding method that converts categorical data into numbers.
The code is very simple, and when you encode a specific column you can proceed as follows:

``` python
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()

train[column_name] = label.fit_transform(train[column_name])
```

The simple idea is to convert the same class to a number with the same value.
So the range of numbers maps from 0 to n-1 as labels.

The disadvantage is that the labels are ordered randomly (in the existing order of the data), which can add noise while assigning an unexpected order between labels. In other words, the data becomes ordinary (ordinal, ordered) data, which can lead to unintended consequences.

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
label_encoder = OrdinalEncoder(cat_features).fit(df)

[back to top](#table-of-contents)
<a id="2"></a>
# 2. One-Hot Encoder (OHE, dummy encoder)
---
So what can you do to give values ​​by category instead of ordering them?

If you have data with specific category values, you can create a column. If the base Label Encoder label type is N, then OHE is the way to create N columns.

Since only the row containing the content is given as 1, it is called one-hot encoding. Also called dummy encoding in the sense of creating a dummy.


In this competition:

``` python
traintest = pd.concat([train, test])
dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
train_ohe = dummies.iloc[:train.shape[0], :]
test_ohe = dummies.iloc[train.shape[0]:, :]
train_ohe = train_ohe.sparse.to_coo().tocsr()
test_ohe = test_ohe.sparse.to_coo().tocsr()
```

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
OHE_encoder = OneHotEncoder(cat_features).fit(df)

[back to top](#table-of-contents)
<a id="3"></a>
# 3. Sum Encoder (Deviation Encoder, Effect Encoder)
---
This encoding technique is also known as Deviation Encoding or Effect Encoding. Sum encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1, comparing the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target. 

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
SE_encoder = SumEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="4"></a>
# 4. Helmert Encoder
---

**Helmert Encoding** is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding. 

It compares each level of a categorical variable to the mean of the subsequent levels. 

This type of encoding can be useful in certain situations where levels of the categorical variable are ordered.

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
HE_encoder = HelmertEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="5"></a>
# 5. Target Encoder
---

This is a work in progress for many kernels.

The encoded category values are calculated according to the following formulas:

$$s = \frac{1}{1+exp(-\frac{n-mdl}{a})}$$

$$\hat{x}^k = prior * (1-s) + s * \frac{n^{+}}{n}$$

- mdl means **'min data in leaf'**
- a means **'smooth parameter, power of regularization'**

Target Encoder is a powerful, but it has a huuuuuge disadvantage 

> **target leakage**: it uses information about the target. 

To reduce the effect of target leakage, 

- Increase regularization
- Add random noise to the representation of the category in train dataset (some sort of augmentation)
- Use Double Validation (using other validation)

Let's use while being careful about overfitting.

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
TE_encoder = TargetEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="6"></a>
# 6. M-Estimate Encoder
---

**M-Estimate Encoder** is a **simplified version of Target Encoder**. It has only one hyperparameter (Wrong Fomular but did good work?!)

$$\hat{x}^k = \frac{n^+ + prior * m}{y^+ + m}$$

The higher value of m results into stronger shrinking. Recommended values for m is in the range of 1 to 100.

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
MEE_encoder = MEstimateEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="7"></a>
# 7. Weight of Evidence Encoder 
---
**Weight Of Evidence** is a commonly used target-based encoder in credit scoring. 

It is a measure of the “strength” of a grouping for separating good and bad risk (default). 

It is calculated from the basic odds ratio:

``` python
a = Distribution of Good Credit Outcomes
b = Distribution of Bad Credit Outcomes
WoE = ln(a / b)
```

However, if we use formulas as is, it might lead to **target leakage**(and overfit).

To avoid that, regularization parameter a is induced and WoE is calculated in the following way:

$$nomiinator = \frac{n^+ + a}{y^+ + 2*a}$$

$$denominator = ln(\frac{nominator}{denominator})$$

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
WOE_encoder = WOEEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="8"></a>

# 8. James-Stein Encoder
---
**James-Stein Encoder** is a target-based encoder.

The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could be calculated according to the following formula:

$$\hat{x}^k = (1-B) * \frac{n^+}{n} + B * \frac{y^+}{y} $$

One way to select B is to tune it like a hyperparameter via cross-validation, but Charles Stein came up with another solution to the problem:

$$B = \frac{Var[y^k]}{Var[y^k] + Var[y]}$$

Seems quite fair, but James-Stein Estimator has a big disadvantage — it is defined only for normal distribution (which is not the case for any classification task). 

To avoid that, we can either convert binary targets with a log-odds ratio as it was done in WoE Encoder (which is used by default because it is simple) or use beta distribution.

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
JSE_encoder = JamesSteinEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="9"></a>

# 9. Leave-one-out Encoder (LOO or LOOE)
---

**Leave-one-out Encoding** is another example of target-based encoders.

This encoder calculate mean target of category k for observation j if observation j is removed from the dataset:

$$\hat{x}^k_i = \frac{\sum_{j \neq i}(y_j * (x_j == k) ) - y_i }{\sum_{j \neq i} x_j == k}$$

While encoding the test dataset, a category is replaced with the mean target of the category k in the train dataset:

$$\hat{x}^k = \frac{\sum y_j * (x_j == k)  }{\sum x_j == k}$$

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
LOOE_encoder = LeaveOneOutEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="10"></a>

# 10. Catboost Encoder
---
**Catboost** is a recently created target-based categorical encoder. 

It is intended to overcome target leakage problems inherent in LOO. 

If you use `Category-Encoders` it will look like this code below.

In [None]:
%%time
CBE_encoder = CatBoostEncoder(cat_features).fit(df, target)

[back to top](#table-of-contents)
<a id="11"></a>

# 11. Validation

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb

In [None]:
cat_features = ['Sex', 'Embarked', 'letter_ticket', 'letter_cabin']

encoders = {
    'Label Encoder': OrdinalEncoder(cat_features),
    'One-Hot Encoder': OneHotEncoder(cat_features),
    'Sum Encoder': SumEncoder(cat_features),
    'Helmert Encoder': HelmertEncoder(cat_features),
    'Target Encoder': TargetEncoder(cat_features),
    'M-Estimate Encoder': MEstimateEncoder(cat_features),
    'Weight Of Evidence Encoder': WOEEncoder(cat_features),
    'James-Stein Encoder': JamesSteinEncoder(cat_features),
    'Leave-one-out Encoder': LeaveOneOutEncoder(cat_features),
    'Catboost Encoder': CatBoostEncoder(cat_features),
}

In [None]:
def run_model(df_raw, model, encoder, target='Survived'):
    features_raw = df_raw.columns.drop(target)

    X = df_raw[features_raw]
    y = df_raw[target]

    cv = KFold()
    auc_scores = []
    for train, test in cv.split(X, y):
        y_train = y.iloc[train]
        X_train = encoder.fit_transform(X.iloc[train], y_train)
        y_test = y.iloc[test]
        X_test = encoder.transform(X.iloc[test])
        
        model.fit(X_train, y_train)
        auc_scores.append(roc_auc_score(y_test, model.predict(X_test)))
    return np.mean(auc_scores)

In [None]:
%%time
results = {}
for name, i in encoders.items():
    results[name] = run_model(df, lgb.LGBMRegressor(), i)

In [None]:
pd.DataFrame.from_dict(results, orient='index', columns=['auc']).style.highlight_max()

[back to top](#table-of-contents)
<a id="12"></a>

# 12. Reducing Cardinality

In [None]:
def concat_cols(df, cols):
    it = iter(cols)
    aux = df[next(it)]
    for i in it:
        aux += df[i]
    return aux

def ticket_reducing_cat(df):
    contain_letter = lambda letter, regex: df['letter_ticket'].str.contains(letter, regex=regex).astype('int').astype('str')
    letter_dict = {
        'letter_ticket__a': contain_letter('a', regex=False),
        'letter_ticket__c': contain_letter('c', regex=False),
        'letter_ticket__p': contain_letter('p', regex=False),
        'letter_ticket__s': contain_letter(r'(?!stono|sotono)s', regex=True),
        'letter_ticket__stono': contain_letter(r'(stono|sotono)', regex=True),
        'letter_ticket__paris': contain_letter(r'(paris)', regex=True),
    }
    
    new_df = df.assign(**letter_dict)
    new_df['letter_ticket__concat'] = concat_cols(new_df, letter_dict.keys())
    new_df = new_df.drop(list(letter_dict.keys())+['letter_ticket'], axis=1)
    return new_df


In [None]:
cat_features = ['Sex', 'Embarked', 'letter_ticket', 'letter_cabin', 'letter_ticket__concat']

encoders = {
    'Label Encoder': OrdinalEncoder(cat_features),
    'One-Hot Encoder': OneHotEncoder(cat_features),
    'Sum Encoder': SumEncoder(cat_features),
    'Helmert Encoder': HelmertEncoder(cat_features),
    'Target Encoder': TargetEncoder(cat_features),
    'M-Estimate Encoder': MEstimateEncoder(cat_features),
    'Weight Of Evidence Encoder': WOEEncoder(cat_features),
    'James-Stein Encoder': JamesSteinEncoder(cat_features),
    'Leave-one-out Encoder': LeaveOneOutEncoder(cat_features),
    'Catboost Encoder': CatBoostEncoder(cat_features),
}

df_filter = ticket_reducing_cat(df)

In [None]:
%%time
results = {}
for name, i in encoders.items():
    results[name] = run_model(df_filter, lgb.LGBMRegressor(), i)

In [None]:
pd.DataFrame.from_dict(results, orient='index', columns=['auc']).style.highlight_max()

[back to top](#table-of-contents)