A common technique to deal with categorical features for non-deep machine learning models is one-hot encoding(For deep learning, they can be embedded in a vector space as in **Factorization Machines**). 

However, this has some drawbacks in practice. For example, this creates a lot of sparse features if the number of categories is large, which can confuse the machine learning models. For tree based models, a common technique to overcome overfitting is to use a proper subset of features for each tree or level. In this case, too many sparse features can dilute the amount of useful information. 

Another popular encoding is **mean encoding**, i.e. encode a category by its target mean. This however can still cause potential trouble, as categories with the same or very close target mean can be mixed together, but we still want to differentiate them as they might have very different interactions with other features. After all, the main goal of encoding is to encode, not providing correlation with target. The decision trees can find out the correlations automatically during training.

Therefore I chose to encode all categorical features by their **rankings** of target mean, which is also a very nice application of pandas. In case of a tie, we just break the tie arbitrarily.

In [1]:
from tqdm import tnrange, tqdm_notebook, tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.ion()
import warnings; warnings.simplefilter('ignore')
tqdm.pandas(tqdm_notebook)
from scipy.stats import skew, skewtest
from scipy.special import boxcox1p

Using pandas' `groupby` method, we can conveniently realize the encoding. Here we will treat `nan` values just as another category. Since `groupby` does not recognize `nan`, we first replace it by an impossible value, here for example `-1000000`. Then we calculate the target mean of each group and sort them, breaking ties arbitrarily.

Now to get the integer ranking, we use the `reset_index()` method, which automatically assigns integer ordering to the groups. Finally we replace the category names by the rankings, and return the encoding information.

In [2]:
def rank_encoding(df, col, tar):
    if df[col].isnull().sum()>0:
        df.fillna(value = {col: -1000000}, inplace = True)
    prob = df.groupby(col)[tar].mean().sort_values().reset_index()
    
    coding = {}
    for ind, row in prob.iterrows():
        if row[col] == -1000000:
            key = np.nan
        else:
            key = row[col]
        coding[key] = int(ind)
    df.replace({col: -1000000}, np.nan, inplace = True)
    df.replace({col:coding}, inplace = True)
    return coding

In [None]:
df = pd.read_csv('train.csv', index_col = 0)

`cat_dict` stores all the encodings for the categorical features.

In [None]:
cat_dict = {}
for col in df.columns:
    if df[col].dtype == 'object':
        coding = rank_encoding(df, col, 'TARGET')
        cat_dict[col] = coding