In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Explore the data:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

import warnings
warnings.simplefilter(action='ignore')

# use this code to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

train_test = pd.concat((train, test), sort=False).reset_index(drop=True)
train_test.drop(['Id'], axis=1, inplace=True)

train_test.head()

In [None]:
print('Number of categorial types: ', len(train.select_dtypes(include='object').columns))
print('Categorial types: ', train.select_dtypes(include='object').columns)

In [None]:
# see different categorical encoders

from category_encoders.ordinal import OrdinalEncoder
from category_encoders.woe import WOEEncoder
from category_encoders.target_encoder import TargetEncoder
from category_encoders.sum_coding import SumEncoder
from category_encoders.m_estimate import MEstimateEncoder
from category_encoders.leave_one_out import LeaveOneOutEncoder
from category_encoders.helmert import HelmertEncoder
from category_encoders.cat_boost import CatBoostEncoder
from category_encoders.james_stein import JamesSteinEncoder
from category_encoders.one_hot import OneHotEncoder

**We will use different types encoders only for one categorical column, because it is well demonstrate work and we have not any problems with RAM.**


# 1 - Ordinal Encoder

One of the most popular encoders is Ordinal Encoder (OE). In OE, each unique category value is assigned an integer value. For example, 'Inside' in the LotConfig column means 1, 'FR2' means 2 and etc. It also called integer encoding and it is easy to reverse.

It used, if an ordinal encoding may be enough for variables. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

This approach has a serious disadvantage: for categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.


In [None]:
%%time
encoder = OrdinalEncoder()
ordinal_encoder_example  = encoder.fit_transform(train_test['LotConfig'])

In [None]:
# see  data
ordinal_encoder_example['Original_data'] = train_test['LotConfig']
ordinal_encoder_example = ordinal_encoder_example.rename(columns={'LotConfig': 'Ordinal_data'})
ordinal_encoder_example.head()

# 2 - WOE Encoder

Weight Of Evidence is a usually used target-based encoder in credit scoring.

It is a measure of the “strength” of a grouping for separating good and bad risk (default). It is calculated from the basic odds ratio:

$a = Distribution of Good Credit Outcomes$

$b = Distribution of Bad Credit Outcomes$

$WoE = ln(a / b)$

In the formulas above: 
- $Distribution of Goods$ - % of Good Customers in a particular group;
- $Distribution of Bads$ - % of Bad Customers in a particular group;
- $ln$ - Natural Log;

Keep in mind,that positive WOE means $Distribution of Goods > Distribution of Bads$. If we have negative WOE, that means $Distribution of Goods < Distribution of Bads$.

*Hint :* Log of a number > 1 means positive value. If less than 1, it means negative value.

But if we use formulas as is, it might lead to target leakage and overfit. To avoid that, regularization parameter $a$ is induced and WoE is calculated in the following way:

$ nominator = (n^+ + a ) / (y^+ + 2*a) $

$ denominator = (n - n^+ + a) / (y - y^+ + 2*a) $

$ x^k = ln(nominator/denominator) $


**But, on this dataset perfom WOE Encoder is difficult, becase it needs a binary target value.**

Documentation: fit(X, y) - Fit encoder according to X and **binary** y.

# 3 - Target Encoder

Target Encoding is the most popular type of encoding in Kaggle competitions. It takes information about the target to encode categories, which makes it extremely powerful. The encoded category values are calculated according to the following formulas:

$ s = 1 / (1 + exp(-(n-mdl)/a)) $

$ x^k = prior * (1 - s) + s * n^+ / n $

- $mdl$ - min data (samples) in leaf,
- $a$ - smoothing parameter, which representing the power of regularization.

Recommended values for them are in the range (1, 100). New values of category and values with just a single appearance in the training dataset are replaced with the prior ones.

However, it has a huge disadvantage — target leakage: it uses information about the target. Because of the target leakage, the model overfits the training data which results in unreliable validation and lower test scores.

But you can reduce the effect of target leakage by increase regularization, add random noise to the representation of the category in the training dataset (some sort of augmentation), or use Double Validation.

In [None]:
%%time
TE_encoder = TargetEncoder()
train_te = TE_encoder.fit_transform(train['LotShape'], train['SalePrice'])
test_te = TE_encoder.transform(test['LotShape'])

In [None]:
# see for train data
target_encoder_example = train_te.rename(columns={'LotShape': 'Target_encoder_data'})
target_encoder_example['Original_data'] = train['LotShape']
target_encoder_example.head()

# 4 - Sum Encoder

It also called Deviation Encoding or Effect Encoding.

Sum Encoder compares the mean of the target variable for a given level of a categorical column to the overall mean of the target. Sum Encoder is commonly used in Linear Regression (LR) types of models. The model with Sum Encoder the intercept represents the grand mean (across all conditions) and the coefficients can be interpreted directly as the main effects.

In [None]:
%%time
SE_encoder = SumEncoder('GarageType')
train_se = SE_encoder.fit_transform(train['GarageType'], train['SalePrice'])
test_se = SE_encoder.transform(test['GarageType'])

In [None]:
# see for train data
sum_encoder_example = train_se.rename(columns={'GarageType': 'Sum_encoder_data'})
sum_encoder_example['Original_data'] = train['GarageType']
sum_encoder_example.head()

# 5 - M-Estimate Encoder

M-probability estimate of likelihood.

M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyperparameter — m, which represents the power of regularization. The higher value of m results into stronger shrinking. Recommended values for m is in the range of 1 to 100.



In [None]:
%%time
MEE_encoder = MEstimateEncoder()
train_mee = MEE_encoder.fit_transform(train['KitchenQual'], train['SalePrice'])
test_mee = MEE_encoder.transform(train_test['KitchenQual'])

In [None]:
# see for train data
me_encoder_example = train_mee.rename(columns={'KitchenQual': 'ME_encoder_data'})
me_encoder_example['Original_data'] = train['KitchenQual']
me_encoder_example.head()

# 6 - Leave One Out Encoder

Leave-one-out Encoding (LOO or LOOE) is another example of target-based encoders. The name of the method clearly speaks for itself: we calculate the mean target of category k for observation j if observation j is removed from the dataset:

$x_i^k = (sum_{j\not=i} y_i * (x_j ==k)) - y_i / (sum_{j\not=i}x_j == k )$

While encoding the test dataset, a category is replaced with the mean target of the category k in the train dataset:

$x^k = sum ( y_i * (x_j ==k)) / sum (x_j == k) $

One of the disadvantages of LOO, just like with all other target-based encoders, is target leakage (the similar problem we have with Target encoder). But when it comes to LOO, this problem gets really dramatic, as far as we may perfectly classify the training dataset by making a single split: the optimal threshold for category k could be calculated with the following formula:

$t^k = sum ( y_i * (x_j ==k)) - 0.5 / sum (x_j == k) $

Another problem with LOO is a shift between values in the train and the test samples. It has a high influence if we work with tree-based models.

If we go deeper, the encoding algorithm is slightly different between the training and test data sets. For the training data set, the record under consideration is left out, hence the name Leave One Out. The encoding is as follows for a certain value of a certain categorical variable. For the validation data or prediction data set, the definition is slightly different. We don’t need to leave the current record out and we don’t need the randomness factor. 

In [None]:
%%time
LOOE_encoder = LeaveOneOutEncoder()
train_looe = LOOE_encoder.fit_transform(train['GarageFinish'], train['SalePrice'])
test_looe = LOOE_encoder.transform(test['GarageFinish'])

In [None]:
# see for train data
loo_encoder_example = train_looe.rename(columns={'GarageFinish': 'LOO_encoder_data'})
loo_encoder_example['Original_data'] = train['GarageFinish']
loo_encoder_example.head()

# 7 - Helmert Encoder

Helmert coding is a also commonly used type of categorical encoding for regression. 

It compares each level of a categorical variable to the mean of the subsequent levels.

This type of encoding can be useful in certain situations where levels of the categorical variable are ordered, say, from lowest to highest, or from smallest to largest.

In [None]:
%%time
HE_encoder = HelmertEncoder('Foundation')
train_he = HE_encoder.fit_transform(train['Foundation'], train['SalePrice'])
test_he = HE_encoder.transform(test['Foundation'])

In [None]:
# see for train data
he_encoder_example = train_he.rename(columns={'Foundation': 'HE_encoder_data'})
he_encoder_example['Original_data'] = train['Foundation']
he_encoder_example.head()

# 8 - CatBoost Encoder

Catboost is a recently created target-based categorical encoder. It is intended to overcome target leakage problems inherent in LOO. In order to do that, the authors of Catboost introduced the idea of “time”: the order of observations in the dataset. Clearly, the values of the target statistic for each example rely only on the observed history. To calculate the statistic for observation j in train dataset, we may use only observations, which are collected before observation j, i.e. i≤j:

$x_i^k = (sum_{j=0}^{j≤i} y_i * (x_j ==k)) - y_i + prior) / (sum_{j=0}^{j≤i}x_j == k )$


To prevent overfitting, the process of target encoding for train dataset is repeated several times on shuffled versions of the dataset and results are averaged. Encoded values of the test data are calculated the same way as in LOO Encoder:

$x^k = sum ( y_i * (x_j ==k)) +prior / sum (x_j == k) $




In [None]:
%%time
CB_encoder = CatBoostEncoder()
train_cb = CB_encoder.fit_transform(train['Neighborhood'], train['SalePrice'])
test_cb = CB_encoder.transform(test['Neighborhood'])

In [None]:
# see for train data
cb_encoder_example = train_cb.rename(columns={'Neighborhood': 'CB_encoder_data'})
cb_encoder_example['Original_data'] = train['Neighborhood']
cb_encoder_example.head()

# 9 - James-Stein Encoder

James-Stein Encoder is a target-based encoder.

The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could be calculated according to the following formula:

$ x^k = (1-B) * n^+/ n + B*y^+/y$

Encoding is aimed to improve the estimation of the category’s mean target (first member of the amount) by shrinking them towards a more central average (second member of the amount). The only hyperparameter in the formula is $B$ — the power of shrinking. It could be understood as the power of regularization, i.e. the bigger values of $B$ will result in the bigger weight of global mean (underfit), while the lower values of $B$ are, the bigger weight of condition mean (overfit).
One way to select $B$ is to tune it like a hyperparameter via cross-validation, but Charles Stein came up with another solution to the problem:

$B = Var[y^k] / (Va[y^k]+Var[y])$

In [None]:
%%time
JS_encoder = JamesSteinEncoder()
train_js = JS_encoder.fit_transform(train['SaleCondition'], train['SalePrice'])
test_js = JS_encoder.transform(test['SaleCondition'])

In [None]:
# see for train data
js_encoder_example = train_js.rename(columns={'SaleCondition': 'JS_encoder_data'})
js_encoder_example['Original_data'] = train['SaleCondition']
js_encoder_example.head()

# 10 - OneHotEncoder

The One Hot Encoding is another simple way to work with categorical columns. It takes a categorical column that has been Label Encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s depending on which column has what value.
OHE expands the size of your dataset, which makes it memory-inefficient encoder. There are several strategies to overcome the memory problem with OHE, one of which is working with sparse not dense data representation.

In [None]:
%%time
# with  category_encoders, but the most common approach with pandas dummy.
OHE_encoder = OneHotEncoder('RoofStyle')
train_ohe = OHE_encoder.fit_transform(train['RoofStyle'])
test_ohe = OHE_encoder.transform(test['RoofStyle'])

In [None]:
# see for train data
oh_encoder_example = train_ohe.rename(columns={'RoofStyle': 'OH_encoder_data'})
oh_encoder_example['Original_data'] = train['RoofStyle']
oh_encoder_example.head()