In [186]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from feature_engine.categorical_encoders import OneHotCategoricalEncoder

from IPython.display import display # Allows the use of display() for DataFrames

import warnings
warnings.filterwarnings('ignore')

## What is Categorical Data

Categorical variables are those values which are selected from a group of categories or
labels. Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. These are also often known as classes or labels in the context of attributes or variables which are to be predicted by a model (popularly known as response variables). These discrete values can be text or numeric in nature (or even unstructured data like images!).

#### There are two major classes of categorical data, nominal and ordinal.

In any nominal categorical data attribute, there is no concept of ordering amongst the values of that attribute. Consider a simple example of weather categories like - sunny, cloudy, rainy etc. These are without any concept or notion of order (windy doesn’t always occur before sunny nor is it smaller or bigger than sunny).

 For example, the variable may be “color” and may take on the values “red,” “green,” and “green.” Or the variable Gender with the values of male or female is categorical, and so is the variable marital status with the values of never married, married, divorced, or widowed.

Another example, in survey about preferred brand of car they owned, the result would be categorical (e.g. Tesla, Toyota, Ford, None, etc.). Responses fall into a fixed set of categories.

**Ordinal categorical** attributes have some sense or notion of order amongst its values. For instance say shirt sizes. It is quite evident that order or in this case ‘size’ matters when thinking about shirts (S is smaller than M which is smaller than L and so on).

---

I will get error if you try to plug these variables into most machine learning models without "encoding" them first.

Almost all Machine learning and deep learning neural networks algorithms require that input and output variables are numbers, requiring that categorical data must be encoded to numbers before we can use it to feed to models and evaluate a model.

There are quite a few techniques to encode categorical variables for modeling, although the three most common are as follows:

- Integer Encoding: Where each unique label is mapped to an integer.

- One Hot Encoding: Where each label is mapped to a binary vector.

- Learned Embedding: Where a distributed representation of the categories is learned.

In some categorical variables, the labels have an intrinsic order, for example, in the variable Student's grade, the values of A, B, C, or Fail are ordered, A being the highest grade and Fail the lowest. These are called ordinal categorical variables. Variables in which the categories do not have an intrinsic order are called nominal categorical variables, such as the variable City, with the values of London, Manchester, Bristol, and so on.

The values of categorical variables are often encoded as strings. Scikit-learn, does not support strings as values, therefore, we need to transform those strings into numbers. The act of replacing strings with numbers is called categorical encoding.

## One-hot Encoding

One-hot encoding is where you represent each possible value for a category as a separate feature.

In one-hot encoding, we represent a categorical variable as a group of binary variables, where each binary variable represents one category. The binary variable indicates whether the category is present in an observation (1) or not (0).

One hot encoding is the most widespread approach, and it works very well unless our categorical variable takes on a large number of values (e.g. more than 20 different values)

![img](https://i.imgur.com/5td19b8.jpg)

Another example with a variable named 'color'. The values in the variable are Red, Yellow and Green. And then we create a separate column for each possible value. Wherever the original value was Red, we put a 1 in the Red column.

![img](https://i.imgur.com/kdltIHI.png)

From the above Gender variable, we can derive the binary variable of Female, which shows the value of 1 for females, or the binary variable of Male, which takes the value of 1 for the males in the dataset.
For the categorical variable of Color with the values of red, green, and green, we can create three variables called red, green, and green. These variables will take the value of 1 if the
observation is red, green, or green, respectively, or 0 otherwise.

A categorical variable with k unique categories can be encoded in k-1 binary variables. For Gender, k is 2 as it contains two labels (male and female), therefore, we need to create only one binary variable (k - 1 = 1) to capture all of the information. For the color variable, which has three categories (k=3; red, green, and green), we need to create two (k - 1 = 2) binary variables to capture all the information, so that the following occurs:

- If the observation is red, it will be captured by the variable red (red = 1, green = 0).

- f the observation is green, it will be captured by the variable green (red = 0, green = 1).

- If the observation is green, it will be captured by the combination of red and green (red = 0, green = 0).

There are a few occasions in which we may prefer to encode the categorical variables with k binary variables:

- When training decision trees, as they do not evaluate the entire feature space at the same time
- When selecting features recursively
- When determining the importance of each category within a variable

In [187]:
breast_cancer_df = pd.read_csv('../input/breast-cancer-data/breast-cancer.data')
print('Breast Cancer df number of rows and columns are ', breast_cancer_df.shape)

Breast Cancer df number of rows and columns are  (285, 10)


In [188]:
# Replace the question marks in the dataset with NumPy NaN values:
breast_cancer_df = breast_cancer_df.replace('?', np.nan)

In [189]:
# Create a list with the variable names:
# There are 10 columns as we know the from the shape of the dataframe
# So create list of 10 column-headings starting with 'A1' and ending with 'A-10'
# Meaning I have to traverser a range of 1 to 11
column_labels = ['A' + str(s) for s in range(1, 11)]
column_labels

['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10']

In [190]:
# Now assign the above list of as column-label
breast_cancer_df.columns = column_labels

In [191]:
# Make lists with categorical and numerical variables:

category_columns = [c for c in breast_cancer_df.columns if breast_cancer_df[c].dtypes == 'O' ]
numeric_columns = [c for c in breast_cancer_df.columns if breast_cancer_df[c].dtypes != 'O' ]

print('breast_cancer_category_columns ', category_columns)
print('breast_cancer_numeric_columns ', numeric_columns)

breast_cancer_category_columns  ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A8', 'A9', 'A10']
breast_cancer_numeric_columns  ['A7']


From the above we see that column 'A7' is the Numeric Column and the, rest all are categorical column.
Now, re-cast numerical variables to float types:

In [192]:
breast_cancer_df['A7'] = breast_cancer_df['A7'].astype(float)

## Binary encoding - Re-code the target variable as binary:
Binary encodings are a special case of category features. Here's a way to do this, do it to the column label of 'A10'. That is making each 'yes' as 1 and each 'no' as 0 (zero)

In [193]:
breast_cancer_df['A10'] = breast_cancer_df['A10'].map({'yes':1, 'no':0})
breast_cancer_df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10
0,no-recurrence-events,40-49,premeno,20-24,0-2,no,2.0,right,right_up,0
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2.0,left,left_low,0
2,no-recurrence-events,60-69,ge40,15-19,0-2,no,2.0,right,left_up,0
3,no-recurrence-events,40-49,premeno,0-4,0-2,no,2.0,right,right_low,0
4,no-recurrence-events,60-69,ge40,15-19,0-2,no,2.0,left,left_low,0


In [194]:
# Fill in the missing data
breast_cancer_df[numeric_columns] =  breast_cancer_df[numeric_columns].fillna(0)
breast_cancer_df[category_columns] = breast_cancer_df[category_columns].fillna(0)

In [195]:
# separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_df.drop(labels=['A10'], axis=1), breast_cancer_df['A10'], test_size=0.3, random_state=0)

In [196]:
#  Let's inspect the unique categories of the A3 variable
X_train['A3'].unique()

array(['ge40', 'premeno', 'lt40'], dtype=object)

So I have the unique values as

array(['ge40', 'premeno', 'lt40'], dtype=object)

---

## one-hot encoding using pandas get_dummies()

Let's encode A3 into k-1 binary variables using pandas and then inspect the first five rows of the resulting dataframe:

In [197]:
tmp_1 = pd.get_dummies(X_train['A3'], drop_first=True)
tmp_1.head()

Unnamed: 0,lt40,premeno
18,0,0
156,0,1
235,0,1
233,0,0
234,0,1


In [198]:
tmp_2 = pd.get_dummies(X_train['A3'], drop_first=False)
tmp_2.head()

Unnamed: 0,ge40,lt40,premeno
18,1,0,0
156,0,0,1
235,0,0,1
233,1,0,0
234,0,0,1


`get_dummies` pandas function converts categorical variables into indicator variables and  ignores missing data, unless we specifically indicate otherwise, in which case, it will return missing data as an additional category

 To encode the variable into k binaries, use instead `drop_first=False`.

From the output above we can see each label is now a binary variable and there's two (because we used k - 1 ) new columns for the label-names.

To understand how the get_dummies() implementation take a look at the below code

In [199]:
df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea']})
df_get_dummied = pd.get_dummies(df['country'], prefix='country')
df_get_dummied

Unnamed: 0,country_australia,country_germany,country_korea,country_russia
0,0,0,0,1
1,0,1,0,0
2,1,0,0,0
3,0,0,1,0


![img](https://i.imgur.com/DgTHD0B.jpg)

To encode all categorical variables at the same time, let's first make a list with their names: i.e.
 - I am excluding A7 (which is numerical data) and
 - A10 (which is the target variable and I have make it to be binary previously.
 - Also excluding all the age ranges i.e 'A2', 'A4', 'A5', 'A8'

In [200]:
vars_categorical = ['A1', 'A3', 'A6', 'A8', 'A9' ]

# Now, let's encode all of the categorical variables into k-1 binaries each, capturing the result in a new dataframe:

X_train_dummy_encoded_pandas = pd.get_dummies(X_train[vars_categorical], drop_first=True)
X_test_dummy_encoded = pd.get_dummies(X_test[vars_categorical], drop_first=True )

X_train_dummy_encoded_pandas.head()

Unnamed: 0,A1_recurrence-events,A3_lt40,A3_premeno,A6_no,A6_yes,A8_right,A9_left_low,A9_left_up,A9_right_low,A9_right_up
18,0,0,0,1,0,1,0,0,0,1
156,0,0,1,1,0,0,0,0,0,0
235,1,0,1,1,0,1,1,0,0,0
233,1,0,0,1,0,1,0,1,0,0
234,1,0,1,0,1,0,1,0,0,0


So as we can see above, the pandas' `get_dummies()` function will create one binary variable per found category. Hence, if there are more categories in the train set than in the test set, get_dummies() will return more columns in the transformed train set than in the transformed test set.

---

## Now one-hot encoding using scikit-learn

First, Create a label (category) encoder object with LabelEncoder() which is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1.

#### Why we need Label Encoding

Datasets in Machine Learning, usually contains multiple labels in one or more than one columns. These labels can be in the form of words, to make the data understandable i.e. to keep it in human readable form.

Label Encoding refers to converting these labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

In [201]:
example_df = pd.DataFrame(['India', 'Australia', 'USA'], columns= ['Country'])
example_df

Unnamed: 0,Country
0,India
1,Australia
2,USA


#### How do we do Numeric encoding from the above DataFrame for the Country feature?

Ans is with Scikit learn transformation, called LabelEncoder:

In [202]:
example_df['Country_encoded'] = LabelEncoder().fit_transform(example_df['Country'])
example_df

Unnamed: 0,Country,Country_encoded
0,India,1
1,Australia,0
2,USA,2


Let's take a closer look at what the LabelEncoder is doing

In [None]:
encoder = LabelEncoder()
encoder.fit(example_df['Country'])
encoder.classes_

Given the output - array(['Australia', 'India', 'USA'], dtype=object)

We see that the ordering of the list of classes above corresponds to their numeric values. Transformation is then as follows:


#### Now apply LabelEncoder() to our Breast-Cancer dataset

In [203]:
enc = LabelEncoder()

enc.fit(vars_categorical)

# View the labels (if you want)
print("label (category) encoder List: ", list(enc.classes_))
# ['A1', 'A3', 'A6', 'A8', 'A9']

new_cat_features = enc.transform(vars_categorical)
print(new_cat_features) # [0 1 2 3 4]

new_cat_features = new_cat_features.reshape(-1, 1)

label (category) encoder List:  ['A1', 'A3', 'A6', 'A8', 'A9']
[0 1 2 3 4]


Then create a OneHotEncoder transformer that encodes into k-1 binary variables and returns a NumPy array:

Scikit-learn's `OneHotEncoder()` function will only encode the categories learned from the train set. If there are new categories in the test set, we can instruct the encoder to ignore them or to return an error with the `handle_unknown='ignore'` argument or the `handle_unknown='error'` argument, respectively.

setting the `categories='auto'` argument so that the transformer learns the categories to encode from the train set; `drop='first'` so that the transformer drops the first binary variable, returning k-1 binary features per categorical variable; and sparse=False so that the transformer returns a NumPy array (the default is to return a sparse matrix).

Now, let's create the NumPy arrays with the binary variables for train and test sets:

In [204]:
ohe_scikit = OneHotEncoder(sparse=False, categories='auto', drop='first')

#### Now fit i.e. make scikit_learn to learn the encoder to a slice of the train set with the categorical variables so it identifies the categories to encode:

In [205]:
output = ohe_scikit.fit_transform(new_cat_features)
print(output)

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


Unfortunately, the feature names are not preserved in the NumPy array, therefore, identifying which feature was derived from which variable is not straightforward.

The beauty of pandas' `get_dummies()` function is that it returns feature names that clearly indicate which variable and which category each feature represents. On the downside, `get_dummies()` does not persist the information learned from the train set to the test set.

Contrarily, scikit-learn's `OneHotEncoder()` function can persist the information from the train set, but it returns a NumPy array, where the information about the meaning of the features is lost.

Scikit-learn's `OneHotEncoder()` function will create binary indicators from all variables in the dataset, so be mindful not to pass numerical variables when fitting or transforming your datasets.

##  Implement one-hot encoding with Feature-engine

`Feature-engine` has multiple advantages:

- first, it allows us to select the variables to encode directly in the transformer.
- Second, it returns a pandas dataframe with clear variable names, and
- third, it preserves the information learned from the train set, therefore returning the same number of columns in
both train and test sets.

#### With that, Feature-engine overcomes the limitations of pandas' `get_dummies()` method and scikit-learn's `OneHotEncoder()` class.

From its [documentation](https://feature-engine.readthedocs.io/en/latest/encoders/OneHotCategoricalEncoder.html)

The OneHotCategoricalEncoder() replaces categorical variables by a set of binary variables, one per unique category. The encoder has the option to create k or k-1 binary variables, where k is the number of unique categories.

The encoder can also create binary variables for the n most popular categories, n being determined by the user. This means, if we encode the 6 more popular categories, we will only create binary variables for those categories, and the rest will be dropped.

The OneHotCategoricalEncoder() works only with categorical variables. A list of variables can be indicated, or the encoder will automatically select all categorical variables in the train set.

In [None]:
one_hot_enc_feature_engine = OneHotCategoricalEncoder(top_categories=None, drop_last=True)

With top_categories=None, we indicate that we want to encode all of the categories present in the categorical variables.
Feature-engine detects the categorical variables automatically. To encode only a subset of the categorical variables, we can pass the variable names in a list like below:

`one_hot_enc_feature_engine = OneHotCategoricalEncoder(variables=['A1', 'A4'])`

Now, let's fit the encoder to the train set so that it learns the categories and variables to encode:


In [None]:
one_hot_enc_feature_engine.fit(X_train)

# Let's encode the categorical variables in train and test sets, and display the first five rows of the encoded train set:

X_train_enc_feature_engine = one_hot_enc_feature_engine.transform(X_train)
X_test_enc_feature_engine = one_hot_enc_feature_engine.transform(X_test)

X_train.head()

## one-hot encoding of frequent categories

#### What is high cardinality of a dataset.

A dataset which has columns(feature) with high number of unique values. Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality. If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically, which is not an ideal situation to be in.

One approach used by many in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category.

But the the cost of the above strategy is some loss of information, because I am effectively turning a categorical feature into a "popularity" feature.

Check the "Count Encoding" section on [this link](https://www.kaggle.com/matleonard/categorical-encodings)

While dealing with highly cardinal dataset, one thing to ensure that the cardinality of the categorical information in the training set resembles that in the test/validation sets. That is, if I have a feature with values {A,A,A,B,C,C,D} in train, but test only has {A,B,B}, then eliminating the C and D records, and undersampling the A or oversampling the B records may resist overfitting. Also, for individual featuers with low cardinality, it's often worth bucketing them. In the above example, you may end up replacement values for A and C, and then bucketing B and D into an "Other" category

We will deal with High-Cardinality with **OneHotCategoricalEncoder** from `feature_engine`

One-hot encoding represents each category of a categorical variable with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories only. One-hot encoding of top categories is equivalent to treating the remaining, less frequent categories as a single, unique category




