Hi guys, I know there are a lot of similar notebooks/blog/papers/discussion regarding this already, but you know sometimes you feel that it is just not enough, or that you would want a more compiled version of notes. Well, for me this is my compiled version. If there are more, of course I would add into this notebook.

References will be at the bottom based on where I compiled the infos from.

As of 25/11/2019, I am just going to compile the notes. I'll include the codes to update at the next update of this notebook. :) cheers

28/11/2019 added the codes :)

Here are the seven data types:
- Useless — useless for machine learning algorithms, that is — discrete
- Nominal — groups without order — discrete
- Binary — either/or — discrete
- Ordinal — groups with order — discrete
- Count — the number of occurrences — discrete
- Time — cyclical numbers with a temporal component — continuous
- Interval — positive and/or negative numbers without a temporal component — continuous

Here’s the list of Category Encoders functions with their descriptions and the type of data they would be most appropriate to encode.

Classic Encoders
- Ordinal — convert string labels to integer values 1 through k. Ordinal.
- OneHot — one column for each value to compare vs. all other values. Nominal, ordinal.
- Binary — convert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal.
- BaseN — Ordinal, Binary, or higher encoding. Nominal, ordinal. Doesn’t add much functionality. Probably avoid.
- Hashing — Like OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal.

Contrast Encoders
- The five contrast encoders all have multiple issues that I argue make them unlikely to be useful for machine learning. They all output one column for each column value. I would avoid them in most cases. Their stated intents are below.
- Helmert (reverse) — The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.
- Sum — compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels.
- Backward Difference — the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level.
- Polynomial — orthogonal polynomial contrasts. The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable.

Bayesian Encoders
- The Bayesian encoders use information from the dependent variable in their encodings. They output one column and can work well with high cardinality data.
- Target — use the mean of the DV, must take steps to avoid overfitting/ response leakage. Nominal, ordinal. For classification tasks.
- LeaveOneOut — similar to target but avoids contamination. Nominal, ordinal. For classification tasks.
- WeightOfEvidence — added in v1.3. Not documented in the docs as of April 11, 2019. The method is explained in this post(https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html).
- James-Stein — forthcoming in v1.4. Described in the code here (https://github.com/scikit-learn-contrib/categorical-encoding/blob/master/category_encoders/james_stein.py).
- M-estimator — forthcoming in v1.4. Described in the code here (https://github.com/scikit-learn-contrib/categorical-encoding/blob/master/category_encoders/m_estimate.py). Simplified target encoder.



Terminology

- the more common terminologies used in this context, you are free to describe them any way you want. as long as you understand it.
- k is the original number of unique values in your data column. High cardinality means a lot of unique values ( a large k). A column with hundreds of zip codes is an example of a high cardinality feature.
- High dimensionality means a matrix with many dimensions. High dimensionality comes with the Curse of Dimensionality — a thorough treatment of this topic can be found here. The take away is that high dimensionality requires many observations and often results in overfitting.
- Sparse data is a matrix with lots of zeroes relative to other values. If your encoders transform your data so that it becomes sparse, some algorithms may not work well. Sparsity can often be managed by flagging it, but many algorithms don’t work well unless the data is dense.


Okay since this is meant to focus on categorical encoding, therefore we will just focus on that and I will take codes from some other reference to quickly preprocess our data, or data cleaning.

Data preprocessing consists of multiple steps such as: (a) Correcting (b) Completing (c) Creating (d) Converting (e) Correlating (f) Classifying

You can read more about them at these notebooks. Feel free to go through them. :) https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy https://www.kaggle.com/startupsci/titanic-data-science-solutions

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.model_selection import KFold
import xgboost as xgb

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls


# Load dataset.
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

PassengerId = test['PassengerId']

#fill NaN values in the age column with the median of that column
train['Age'].fillna(train['Age'].mean(), inplace = True)
#fill test with the train mean to test
test['Age'].fillna(train['Age'].mean(), inplace = True)

#fill NaN values in the embarked column with the mode of that column
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace = True)
#fill test NaN values in the embarked column with the mode from the train set
test['Embarked'].fillna(train['Embarked'].mode()[0], inplace = True)

#fill NaN values in the fare column with the median of that column
train['Fare'].fillna(train['Fare'].median(), inplace = True)
test['Fare'].fillna(train['Fare'].median(), inplace = True)

#delete the cabin feature/column and others 
drop_column = ['PassengerId','Cabin', 'Ticket']
train.drop(drop_column, axis=1, inplace = True)
test.drop(drop_column, axis=1, inplace = True)

#create a new column which is the combination of the sibsp and parch column
train['FamilySize'] = train ['SibSp'] + train['Parch'] + 1
test['FamilySize'] = test ['SibSp'] + test['Parch'] + 1

#create a new column and initialize it with 1
train['IsAlone'] = 1 #initialize to yes/1 is alone
train['IsAlone'].loc[train['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1
test['IsAlone'] = 1 #initialize to yes/1 is alone
test['IsAlone'].loc[test['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

#quick and dirty code split title from the name column
train['Title'] = train['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
test['Title'] = test['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

#Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
#Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
train['FareBin'] = pd.qcut(train['Fare'], 4)
test['FareBin'] = pd.qcut(train['Fare'], 4)

#alternatively, you can split them yourselves based on the bins you prefer, and you can do the same for the age too
#     #Mapping Fare
#     dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] 						        = 0
#     dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
#     dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
#     dataset.loc[ dataset['Fare'] > 31, 'Fare'] 							        = 3
#     # Mapping Age
#     dataset.loc[ dataset['Age'] <= 16, 'Age'] 					       = 0
#     dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
#     dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
#     dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
#     dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

#Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
train['AgeBin'] = pd.cut(train['Age'].astype(int), 5)
test['AgeBin'] = pd.cut(train['Age'].astype(int), 5)

#so create stat_min and any titles less than 10 will be put into Misc category
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (train['Title'].value_counts() < stat_min) #this will create a true false series with title name as index
title_names_test = (test['Title'].value_counts() < stat_min)

#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
train['Title'] = train['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
test['Title'] = test['Title'].apply(lambda x: 'Misc' if title_names_test.loc[x] == True else x)

In [None]:
train.tail()

The most common methods of encoding are usually one-hot and ordinal, and you pretty much see it on most of the posts. Nevertheless, I will describe a few more here, and where their use cases might be.

-	Ordinal
    - If your column values are truly ordinal, that means that the integer assigned to each value is meaningful. Assignment should be done with intention. Say your column had the string values “First”, “Third”, and “Second” in it.
    - LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.

In [None]:
#This is to show the codes for before and after
train_enc1 = train.copy()
train_enc2 = train.copy()
train_enc3 = train.copy()
train_enc4 = train.copy()
train_enc5 = train.copy()
train_enc1.head()
# train_enc2.info()

In [None]:
#There are a few ways to do this, so I will demonstrate for each of them.
from sklearn.preprocessing import LabelEncoder

#1st method
label = LabelEncoder()  
train_enc1['Sex_Code'] = label.fit_transform(train_enc1['Sex'])
#train_enc1.head() #now look at the Sex_code column that is created

#2nd method
train_enc1['Sex'].replace(['male','female'],[0,1],inplace=True) #the Sex column is replaced.
train_enc1.head() #now look at the sex column
# train['Embarked_Code'] = label.fit_transform(train['Embarked'])

#of course there might be other methods that do such similar things, you are free to choose :)

- OneHot
    - The one-hot encoder creates one column for each value to compare against all other values. For each new column, a row gets a 1 if the row contained that column’s value and a 0 if it did not.
    - One-hot encoding can perform very well, but the number of new features is equal to k, the number of unique values. This feature expansion can create serious memory problems if your data set has high cardinality features. One-hot-encoded data can also be difficult for decision-tree-based algorithms — see discussion here(https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/).
    - In another context/explanation, One-Hot-Encoding has the advantage that the result is binary rather than ordinal and that everything sits in an orthogonal vector space. The disadvantage is as above, that for high cardinality, the feature space can really blow up quickly and you start fighting with the curse of dimensionality. In these cases, I typically employ one-hot-encoding followed by PCA for dimensionality reduction. I find that the judicious combination of one-hot plus PCA can seldom be beat by other encoding schemes. PCA finds the linear overlap, so will naturally tend to group similar features into the same feature. (more on this on future updates)
    - For example, your dataset has the feature "Temperature", which can be Cool, Mild, or Hot. In this case, the relationship Cool < Mild < Hot is semantically meaningful, so you can probably get away with ordinal encoding if you want.
    - However, you still have to ensure that the ordinal relationship is preserved in your encoding, and scikit-learn's LabelEncoder does not take care of this for you. To illustrate, look at how "Temperature" is being encoded:
    - Hot => 1
    - Mild => 2
    - Cool => 0
    - That's no good! You want Hot to have the highest value.

In [None]:
#There are a few ways to do this, so I will demonstrate for each of them.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import category_encoders as ce
from sklearn.compose import ColumnTransformer

#1st method
oneHot = OneHotEncoder(handle_unknown='ignore')
ce_ohe = ce.OneHotEncoder(cols = ['Sex'])
train_enc2= ce_ohe.fit_transform(train_enc2)
print(ce_ohe)#here you can will the Sex_1 and Sex_2 replaces the original Sex column
#based on the doc, it seems there isn't any param to handle the removing one column to prevent the dummy variable trap, so you might need to remove it manually

#2nd method
# ct = ColumnTransformer(
#     [('oh_enc', OneHotEncoder(sparse=False), [1]),],  # the column numbers I want to apply this to
#     remainder='passthrough'  # This leaves the rest of my columns in place
# )
# ct2 = ct.fit_transform(train_enc2)
# # df = pd.DataFrame(ct2)
# # df.head()
# print(ct.get_feature_names)
#using columntransformer will lose the column names, Idk if they will change it in the near future or not.
#you can use the get_feature names to get the column names, but still needs extra steps, refer to the link below:
#https://stackoverflow.com/questions/54646709/sklearn-pipeline-get-feature-name-after-onehotencode-in-columntransformer

#2.1 method: using sklearn's OneHotEncoder is slightly more complicated, you can refer more at the link below:
#https://stackoverflow.com/questions/43588679/issue-with-onehotencoder-for-categorical-features
#IMO, 2nd and 2.1 method is not that friendly, so why bother with extra headaches when the others seem to be able to do the same thing?

#3rd method
train_enc2 = pd.get_dummies(train_enc2, columns=['Embarked'])
train_enc2.head()#as you can see 3 new columns are created (Embarked_C, Embarked_Q and so on),you can use the drop_first param to remove one column to prevent the Dummy Variable trap

#of course there might be other methods that do such similar things, you are free to choose :)

-	Binary
    - Binary can be thought of as a hybrid of one-hot and hashing encoders. Binary creates fewer features than one-hot, while preserving some uniqueness of values in the the column. It can work well with higher dimensionality ordinal data.
    - Here’s how it works:
        - The categories are encoded by OrdinalEncoder if they aren’t already in numeric form.
        - Then those integers are converted into binary code, so for example 5 becomes 101 and 10 becomes 1010
        - Then the digits from that binary string are split into separate columns. So if there are 4–7 values in an ordinal column then 3 new columns are created: one for the first bit, one for the second, and one for the third.
    - Each observation is encoded across the columns in its binary form.
    #- The first column has no variance, so it isn’t doing anything to help the model. (code to be added)
    - With only three levels, the information embedded becomes muddled. There are many collisions and the model can’t glean much information from the features. Just one-hot encode a column if it only has a few values.
    - In contrast, binary really shines when the cardinality of the column is higher — with the 50 US states, for example.
    - Binary encoding creates fewer columns than one-hot encoding. It is more memory efficient. It also reduces the chances of dimensionality problems with higher cardinality.
    - Most similar values overlap with each other across many of the new columns. This allows many machine learning algorithms to learn the values similarity. Binary encoding is a decent compromise for ordinal data with high cardinality.

In [None]:
import category_encoders as ce
ce_bin = ce.BinaryEncoder(cols = ['Embarked'])
train_enc3 = ce_bin.fit_transform(train_enc3)
train_enc3.head()

-	BaseN
    - When the BaseN base = 1 it is basically the same as one hot encoding. When base = 2 it is basically the same as binary encoding. McGinnis said, “Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems.”
    - The main reason for its existence is to possibly make grid searching easier. You could use BaseN with gridsearchCV. However, if you’re going to grid search with some of these encoding options, you’re going to make that search part of your workflow anyway. I don’t see a compelling reason to use BaseN. If you do, please share in the comments

In [None]:
import category_encoders as ce
ce_bin = ce.BaseNEncoder(cols = ['Embarked'])
train_enc4 = ce_bin.fit_transform(train_enc4)
train_enc4.head()

- Hashing
    - HashingEncoder implements the hashing trick. It is similar to one-hot encoding but with fewer new dimensions and some info loss due to collisions. The collisions do not significantly affect performance unless there is a great deal of overlap. An excellent discussion of the hashing trick and guidelines for selecting the number of output features can be found here.
    - You can pass a hashing algorithm of your choice to HashingEncoder; the default is md5. Hashing algorithms have been very successful in some Kaggle competitions. It’s worth trying HashingEncoder for nominal and ordinal data if you have high cardinality features.



In [None]:
import category_encoders as ce
ce_bin = ce.HashingEncoder(cols = ['Embarked'])
train_enc5 = ce_bin.fit_transform(train_enc5)
train_enc5.head()

Specific case for Tree based methods:
-	When categorical feature is ordinal label encoding can lead to better quality if it preserves correct order of values. In this case a split made by a tree will divide the feature to values 'lower' and 'higher' that the value chosen for this split.


Conclusion:
- For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot for high cardinality columns and decision tree-based algorithms.
- For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum, BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason you might want to try them.
- The Bayesian encoders can work well for some machine learning tasks. For example, Owen Zhang used the leave one out encoding method to perform well in a Kaggle classification challenge (https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions).
- IMO, if your dataset is small, give it a try for both, but this might introduce overfitting a bit, as you are constantly making decisions based on the validation set, and YES, there is such a thing as overfitting to the validation set, for more info please google :) Overfitting is a very big topic, and will be discussed elsewhere.


References:
- *https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159*
- *https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor*
- *https://towardsdatascience.com/choosing-the-right-encoding-method-label-vs-onehot-encoder-a4434493149b*
- *https://datascience.stackexchange.com/questions/9777/one-hot-vector-representation-vs-label-encoding-for-categorical-variables*
- *https://datascience.stackexchange.com/questions/61048/should-we-use-one-hot-encoder-class-in-data-having-2-as-maximum-numeric-represen*
- *https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f*