# Dealing with Categorical Varaibles

1. Binary, Ordinal, and Nominal Variables 
2. Binary Encoding
3. Ordinary Encoding
4. Nominal Encoding
5. Other Encoding Techniques

## 1. Binary, Ordinal, and Nominal Variables

Categorical data is a common type of non-numerical data that contains label values and not numbers. Some examples include:
<br>Colors: Red, Green, Blue
<br>Cities: New York, Austin, Denver
<br>Gender: Male, Female
<br>Place: First, Second, Third

<br>According to Wikipedia, “a categorical variable is a variable that can take on one of a limited, and usually fixed number of possible values.”

<br>It is common to refer to a possible value of a categorical variable as a level.

<br>There are several different types of categorical data including:
<br>Binary: A variable that has only 2 values. For example, True/False or Yes/No.
<br>Ordinal: A variable that has some order associated with it like our place example above.
<br>Nominal: A variable that has no numerical importance, for example color or city.

<br>Many machine learning algorithms cannot work with categorical data directly. They require data to be numeric. Therefore, it is essential to know how to encode categorical variables.

## 2. Binary Encoding

Binary features are those with only two possible values.

In [None]:
# Load labraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os 

%matplotlib inline

# Display all the columns of the dataframe
pd.pandas.set_option('display.max_columns', None)

In [None]:
# Create dataframe
df = pd.DataFrame([
    ['green', 'M',   10.1, 'class1'],
    ['red',   'L',   13.5, 'class2'],
    ['blue',  'XL',  15.3, 'class1'],
    ['black', 'XXL', 17.1, 'class2'],
    ['grey',  np.NAN, 19.1, 'class1'],
    ])

#Create column names
df.columns = ['Color','Size','Price','ClassLabel']

# See dataframe
df

In [None]:
# Because these are binary features, we can use Panda’s replace() to encode them
# Here we pass a dictionary to replace() with the current value as the key and the desired value as the value

df['ClassLabel_Encoded'] = df['ClassLabel'].replace({'class1':0, 'class2':1})

df

Using ```replace()``` is very helpful, for binary features, but what if we have categorical features with more categories?

## 3. Ordinal Feature Encoding

Ordinal features are those with some order associated with them. We can tell from our sample of ordinal features above these features have an order that may be important.

The machine learning model may be able to use the order information to make better predictions and we want to preserve it.

In [None]:
# See our dataframe
df

For ordinal features, we use integer encoding. To integer encode our data we simply convert labels to integer values.

While there are many methods for integer encoding, we will discuss two here:
- Sklearn’s LabelEncoder()
- Panda’s map()
        
We can label encode data with Sklearn’s LabelEncoder():

In [None]:
# import library
from sklearn.preprocessing import LabelEncoder

# Copy data set
ordinal_features = df.copy()

#label encoder can't handle missing values
ordinal_features['Size'] = ordinal_features['Size'].fillna('None')

# Label encode Size feature
label_encoder = LabelEncoder()

ordinal_features['Size_Encoded'] = label_encoder.fit_transform(ordinal_features['Size'])

# Print sample of dataset
ordinal_features

Above we see the encoded feature ```Size```. We can see the value has been encoded acording to the alphabetic order (L-0, M-1, NAN-2, XL-3, XXL-4).

While using ```LabelEncoder()``` is very quick and easy, it may not be the best choice here: the order of our encoding is not exactly right. Also, we had to handle our null values before being able to use it.

Another downside to ```LabelEncoder()``` is the fact that the documentation states it should be used for encoding target values (y) and not for the inputs (x). Let’s explore a different method for encoding our ordinal features.

Another option here is to use ```map()```.

Panda’s ```map()``` substitutes each value with another specified value, similar to ```replace()``` that we used above. Here we create a dictionary with our desired mapping and apply the mapping to our series:

In [None]:
# Create dictionary of ordinal to integer mapping
size_map = {
            np.NAN : 0,
            'L'    : 1, 
            'M'    : 2, 
            'XL'   : 3, 
            'XXL'  : 4
        }

# Apply using map
df['Size_Encoded'] = df['Size'].map(size_map)

# See our data
df

Using ```map()``` allowed us to specify the order of the values in our categorical feature to ensure they are in a meaningful arrangement.

These methods should only be used for ordinal features, where the order matters. For features where order is not important we must explore other techniques.

## 4. Nominal Features

Nominal features are categorical features that have no numerical importance. Order does not matter like in our example of color feature.

**One-hot encoding is a better technique when order doesn’t matter.**

In [None]:
# Working with Color - Nominal Feature
df

In one hot encoding, a new binary (dummy) variable is created for each unique value in the categorical variable. In our dataset, we have 5 unique colors and so we create 5 new features, one for each color. If the value is true, the integer 1 is placed in the field, if false then a 0.

Here we can use Sklearn’s `OneHotEncoder()` to one hot encode our nominal features.


In [None]:
# Import Library
from sklearn.preprocessing import OneHotEncoder

# Creating one hot encoder object 
onehotencoder = OneHotEncoder()

# Reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object 
X = onehotencoder.fit_transform(df['Color'].values.reshape(-1,1)).toarray()

# To add this back into the original dataframe 
dfOneHot = pd.DataFrame(X, columns = ["Color_"+str(onehotencoder.categories_[0][i]) for i in range(len(onehotencoder.categories_[0]))]) 
df1 = pd.concat([df, dfOneHot], axis=1)

# Droping the color column 
df1= df1.drop(['Color'], axis=1) 

# Printing to verify 
df1.head()


Here we can use Panda’s `get_dummies()` to one hot encode our nominal features.

This method converts a categorical variable to dummy variables and returns a dataframe. The `drop_first` parameter is helpful to get k-1 dummies by removing the first level.

In [None]:
# One-Hot Encoding without froping first column, drop_first = False
nominal_features_F = pd.get_dummies(df[['Color', 'Size_Encoded', 'Price', 'ClassLabel_Encoded']], drop_first=False)
nominal_features_F

In [None]:
# One-Hot Encoding with froping first column, drop_first = True
nominal_features_T = pd.get_dummies(df[['Color', 'Size_Encoded', 'Price', 'ClassLabel_Encoded']], drop_first=True)
nominal_features_T

We can see from the sample of our one hot encoded nominal features above that this type of encoding can greatly increase our number of columns. We input 1 column and after encoding now have 4! It introduces sparsity in the dataset i.e several columns having 0s and a few of them having 1s. In other words, it creates multiple dummy features in the dataset without adding much information.

In the situation of high cardinality features, those with many possible values, we may need to do some manipulation prior to encoding. For example, for values occurring only a small percent of the time, we could group them into an “other” category.

Also, they might lead to a Dummy variable trap. It is a phenomenon where features are highly correlated. That means using the other variables, we can easily predict the value of a variable.

Due to the massive increase in the dataset, coding slows down the learning of the model along with deteriorating the overall performance that ultimately makes the model computationally expensive. Further, while using tree-based models these encodings are not an optimum choice.

## 5. Other Encoding Techniques

THere are many other techniques to deal with categorical variables that i will reffer you to explore on your own.
Here is the list some of them:
1. Helmert Encoding
2. Frequency Encoding
3. Mean Encoding
4. Weight of Evidence Encoding
5. Probability Ratio Encoding
6. Hashing Encoding
7. Backward Difference Encoding
8. James-Stein Encoding
9. M-estimator Encoding