<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding: 15px;
              color:white;">
            <b> Encoding Categorical and Numerical Data </b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    In this notebook I will be showing you how to encode categorical and numerical columns in python. 
</div>

![](https://2ubrsn5y54ao0ufa2mpsbmg3-wpengine.netdna-ssl.com/wp-content/uploads/2021/02/Hedof-x-Youtube_10-750x371.jpg)

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#EEB03B;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding: 15px;
              color:white;">
            <b> 1. Encoding Categorical Data </b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    There are two main types of data, Numerical and Categorical. For now we will be learning how to handle categorical columns in your dataset. As our machine learning model will not be able to read text data and make a model out of it, say, linear regression, we will always need to convert it into some kind of numerical data. Suppose we have a Sex Column with Male and Female. We can turn the single Sex column to 2 columns of Male and Female. When Male is true, we make the value 1 and when Male is false the value is 0. Vice versa for Female Column. This is how to encode them. We we learn some ways to do the same, below. 
</div>

***
## Types of Categorical Data :
   * **Ordinal Categorical** : This data will stay in order. Suppose you have a column of education degrees, school, undergraduate, graduate and so on. Here, there is an order between the two. If you change school to 1, undergraduate to 2 and graduate to , then you will maintain the order between them. 
       - **One Hot Encoder** : pd.dummies can also be used in place of onehot but as everytime you run pd.dummies the categories will change position in the output, for machine learning, it is best if you use OneHotEncoder. 
       
   
   * **Cardinal Categorical** : This data is not in order. Hence, cardinal. For example, gender column of Male and Female. They do not have any order in them. Female is not better or worse than Male, you cannot fit them in an order. In these cases, you cannot categorize them in an order like 1, 2, 3. As 3 is always more than 2 and 2 is more than 1. You have to change every category to a new column and make it a boolean binary value of 1 and 0. 
       - **Ordinal Encoder** : Used only for input or X columns. 
       - **Label Encoder** : Used only for output or Y columns.
***
## **Dummie Variable Trap :**

When you are using One Hot Encoder, always get rid of 1 category. Suppose you have K categories in the column, always, after encoding them into K number of columns, get rid of any 1. Total number of columns after encoding should be (K - 1). This is the reasin why this is also called **K - 1 Encoding**.


***

In [None]:
# importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme(style="dark")
import matplotlib.pyplot as plt

In [None]:
df_cx = pd.read_csv("../input/customer/customer.csv", index_col=None)

In [None]:
df_cx.head()

In [None]:
plt.figure(figsize=(18,6), facecolor='#f6f5f5')
cols = list(enumerate(df_cx.columns[1:]))
plt.suptitle("Categorical Columns", fontsize=17)
for i in cols:
    plt.subplot(1,4,i[0]+1)
    sns.countplot(data = df_cx, x = i[1], palette = "rainbow")
plt.tight_layout()
plt.show()

***
**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^^^^^ <br>
Here, you can see that we have only 1 cardinal column, that is gender.<br>
Rest 4 are ordinal columns. So we will use OneHotEncoding for only gender and in rest, we will be using Ordinal Encoder for only X and Label encoder for Y. Purchase column is the output or Y here and we only use Label Encoder for Y columns or dependent variables.**
***

In [None]:
# X_train and Y_train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train , y_test = train_test_split(df_cx.iloc[:,0:4], df_cx.iloc[:,4], test_size = 0.2, random_state = 2)
print(X_train.shape, y_train.shape)
print( X_test.shape, y_test.shape)

<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    One Hot Encoding for Gender Column. <br>
    Ordinal Encoding for columns of X or Review and Education column.
</div>

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer


CatEncod = ColumnTransformer(transformers=[
    ("genderEncod", OneHotEncoder(drop="first", sparse=False), [1]), #drop first will remove 1 col for dummie var trap
    ("Xordinal", OrdinalEncoder(), [2,3]) # here [2,3] is the index of columns 
], remainder="passthrough")

In [None]:
CatEncod.fit_transform(X_train)
CatEncod.transform(X_test)

<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    Label Encoding for Purchase column as this is the dependent variable and is ordinal.
</div>

In [None]:
le = LabelEncoder()

le.fit_transform(y_train)
le.transform(y_test)

***

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#EEB03B;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding: 15px;
              color:white;">
            <b> 2. Encoding Numerical Data </b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    We have learned how to encode categorical columns. Now, we will lean how to do the same for Numerical columns. Below we will discuss the types of encoding numerical columns. 
</div>

![binning.PNG](https://cdn.discordapp.com/attachments/517815672613503006/873213661605400666/binning.PNG)

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#0F6395;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding:15px;
              color:white;">
            <b> a. Equal Width Binning (Uniform Binning)</b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    This divides the numerical column into equal widths according to the number of bins in the parameter.
</div>

In [None]:
# this is the only class that we will need for all the encoding that we will do for numerical columns
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
def AutoEncodePlot(strat, numbin, pal):
    NumEncod = ColumnTransformer(transformers = [
        ("equalwidth", KBinsDiscretizer(strategy = strat, n_bins=numbin, encode = "ordinal"), [0])
    ])
    background_color = "#f6f5f5"
    
    plt.figure(figsize=(10,6), facecolor='#f6f5f5')

    plt.subplot(1,2,1)
    sns.histplot(data = X_train["age"], color="#D6265D", legend=False)
    plt.xlabel("AGE")
    plt.ylabel("")
    plt.title("Before Binnning")

    plt.subplot(1,2,2)
    sns.histplot(data = NumEncod.fit_transform(X_train), palette=pal, legend=False)
    plt.xlabel("AGE")
    plt.ylabel("")
    plt.title("After Binning")
    
    plt.show()

In [None]:
AutoEncodePlot("uniform", 4, "YlOrBr") #uniform category of 4 age groups

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#D6265D;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding:15px;
              color:white;">
            <b> b. Equal Frequency Binning (Quantile Binning) </b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    This will create bins of equal frequency. So the data distribution will be equal. Suppose you have 10 bins, then there will be equal frequency of ages in each bin. For example, in total 100 age values, 10 age will be in each bin as in total there are 10 bins in the parameter. 
</div>

In [None]:
AutoEncodePlot("quantile", 5, "cool") #quantile binning of 5 bins

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#7768F8;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding:15px;
              color:white;">
            <b> c. K - Means Binning </b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    This uses K - Means Clustering in order to find out the best frequency for a certain bin. Total number of bins is according to the input parameter. 
</div>

In [None]:
AutoEncodePlot("kmeans", 3, "viridis") #3 bins using k-means binning

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#EEB03B;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding: 15px;
              color:white;">
            <b> 3. Binarization </b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    This is the simplest form of encoding numerical columns. You keep a threshold. Suppose, any age that is greater than 50 will be in category 1 and age below 50 will be in category 0. It will be a binary category, hence the name.
</div>

In [None]:
#importing binarizing class
from sklearn.preprocessing import Binarizer

In [None]:
BinarCol = ColumnTransformer(transformers = [
    ("agebin", Binarizer(threshold=50, copy=False), [0]) #copy false as the output col will not be a new column
]
)

In [None]:
X = {"AgeCat":BinarCol.fit_transform(X_train).reshape(40,),
"Age": X_train["age"]}
pd.DataFrame(X).head(10)

<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    See how the numerical column has been categorized to a binary column. Any age above 50 will be 1 and below 50 will be 0. It is very easy. Try these yourself. <br>
</div>

***

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#EEB03B;
           font-size:220%;
           font-family:Nexa;
           letter-spacing:0.5px">
        <p style="padding: 15px;
              color:white;">
            <b> Conclusion </b>
        </p>
</div>
<div class="alert alert-block alert-info" style="font-size:16px; font-family:Nexa;">
    Binning rarely helps with your accuracy of model. Most of the models categorize the numerical columns by itself however they have their own ways of doing that. Most of the times, the best outcome is visualization and you can gain intuitive understanding of data by simply categorizing the numerical columns. However, try these techniques for yourself and see if your model is improving in any way. It might improve if the number of bins is low and accurately divides the total distribution but for that you will need domain knowledge. A big part of feature engineering is domain knowledge. If you have a data where you know, every 1000 points change will have similar effects on the outcome, then you know what to do. Bin the whole dataset into equal frequencies of 1000 per bin. Simple. That will hugely benefit your hypothesis. <br>
    However, I will suggest you to use trail and error again. You never know what will increase the accuracy and give your model the edge you need. Happy kaggling!
</div>