<a href="https://colab.research.google.com/github/vivekshaoutlook/machine_learning/blob/master/07_Categorical_Encoding_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this short crisp example we will learn how to do:


*   1.1 OneHot Encoding using Pandas
*   1.2 OneHot Encoding using Scikit-Learn
*   2.1 Label Encoding using Scikit-Learn
*   2.2 Label Encoding using Pandas
*   3.1 Ordinal Encoding using Pandas
*   3.2 Ordinal Encoding using Scikit-Learn

In [0]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder,LabelEncoder,OrdinalEncoder

In [0]:
dataframe = pd.DataFrame({"Age":[20,21,22,23,23,45,43,48,50,60],
                         "Fav_Color":["RED","BLUE","BLUE","GREEEN","YELLOW","GREEEN","RED","BLUE","YELLOW","RED"],
                         "IQ_Level":["EXCELLENT","AVERAGE","AVERAGE","GOOD","VERY GOOD","EXCELLENT","AVERAGE","AVERAGE","GOOD","VERY GOOD"]}
                         )


In [69]:
dataframe

Unnamed: 0,Age,Fav_Color,IQ_Level
0,20,RED,EXCELLENT
1,21,BLUE,AVERAGE
2,22,BLUE,AVERAGE
3,23,GREEEN,GOOD
4,23,YELLOW,VERY GOOD
5,45,GREEEN,EXCELLENT
6,43,RED,AVERAGE
7,48,BLUE,AVERAGE
8,50,YELLOW,GOOD
9,60,RED,VERY GOOD


In [70]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
Age          10 non-null int64
Fav_Color    10 non-null object
IQ_Level     10 non-null object
dtypes: int64(1), object(2)
memory usage: 368.0+ bytes


As we can see above, there are two object type columns - "FavColor" and "IQLevel".Closer (actually a quick) examination tells us that IQLevel values can be comapred among themselves: AVERAGE<GOOD<VERY GOOD<EXCELLENT.So FavColor is Ordinal.('O' for order among values and 'O' for Ordinal, that's how i remember it)





# SECTION1:ONE HOT ENCODING

1.1 OneHot Encoding using Pandas

In [0]:
#we use pandas get_dummies() method which returns an one-hot-encoded dataframe without the original column on which the
#one-hot-encoding was applied. The other columns of original "dataframe" are also retained
dataframe_onehot_pd = pd.get_dummies(prefix=["Fav_Color"],columns=["Fav_Color"],data=dataframe)

Note that the original column-"**FavColor**"-on which we applied One Hot has not been included in the new dataframe that was created during One Hot Encoding - **dataframe_onehot_pd**. In case we are interested in retaining that column in dataframe_onehot_pd, we need to specifically do that

In [0]:
dataframe_onehot_pd["Fav_Color"] = dataframe["Fav_Color"]

In [73]:
dataframe_onehot_pd

Unnamed: 0,Age,IQ_Level,Fav_Color_BLUE,Fav_Color_GREEEN,Fav_Color_RED,Fav_Color_YELLOW,Fav_Color
0,20,EXCELLENT,0,0,1,0,RED
1,21,AVERAGE,1,0,0,0,BLUE
2,22,AVERAGE,1,0,0,0,BLUE
3,23,GOOD,0,1,0,0,GREEEN
4,23,VERY GOOD,0,0,0,1,YELLOW
5,45,EXCELLENT,0,1,0,0,GREEEN
6,43,AVERAGE,0,0,1,0,RED
7,48,AVERAGE,1,0,0,0,BLUE
8,50,GOOD,0,0,0,1,YELLOW
9,60,VERY GOOD,0,0,1,0,RED


1.2 OneHot Encoding using Scikit-Learn

In [0]:
oh_instance = OneHotEncoder()
#fit_transform expects a 2d-array. ".values" converts the panda Series to an 1-d array.".reshape" converts it into 2d array with 1 column.So now
#that we have 2d array, the fit_transform is applied on it and we get a "sparse matrix" which consumes less memory.
#However, we need to convert this sparse matrix to a dense 2d numpy array by calling "toarray()"
one_hot = oh_instance.fit_transform(dataframe["Fav_Color"].values.reshape(-1,1)).toarray()
#we got a 2d array but the column header is still missing. we can get the column header from: oh_instance.categories_.
#oh_instance.categories_ returns a list with only one element which is an array of number of categories (i.e.RED,GREEN,BLUE,YELLOW,GREEN) .
#oh_instance.categories_[0] returns this whole array and oh_instance.categories_[0][i] accesses each element of this array so that column headers
#can be created
dataframe_onehot_sk = pd.DataFrame(one_hot,columns=["Fav_Color_"+str(oh_instance.categories_[0][i]) for i in range(len(oh_instance.categories_[0]))])
#dataframe_onehot_sk only has the new columns created during one hot encoding. Normally we want these columns to be added back to the original
#dataframe. (however, here i am doing the opposite (i.e. adding original dataframe columns to dataframe_onehot_sk) since I need dataframe in its original form for showing other encoding techniques)
dataframe_onehot_sk = pd.concat([dataframe,dataframe_onehot_sk],axis=1)

In [75]:
dataframe_onehot_sk

Unnamed: 0,Age,Fav_Color,IQ_Level,Fav_Color_BLUE,Fav_Color_GREEEN,Fav_Color_RED,Fav_Color_YELLOW
0,20,RED,EXCELLENT,0.0,0.0,1.0,0.0
1,21,BLUE,AVERAGE,1.0,0.0,0.0,0.0
2,22,BLUE,AVERAGE,1.0,0.0,0.0,0.0
3,23,GREEEN,GOOD,0.0,1.0,0.0,0.0
4,23,YELLOW,VERY GOOD,0.0,0.0,0.0,1.0
5,45,GREEEN,EXCELLENT,0.0,1.0,0.0,0.0
6,43,RED,AVERAGE,0.0,0.0,1.0,0.0
7,48,BLUE,AVERAGE,1.0,0.0,0.0,0.0
8,50,YELLOW,GOOD,0.0,0.0,0.0,1.0
9,60,RED,VERY GOOD,0.0,0.0,1.0,0.0


In [76]:
print(oh_instance.categories_[0].shape)

(4,)


# SECTION2:LABEL ENCODING 

2.1 Label Encoding using Scikit-Learn

In [77]:
#returns a normal 1d numpy array
dataframe_labelenc_array = LabelEncoder().fit_transform(dataframe["IQ_Level"])
#Create a dataframe out of the array and add a column header
dataframe_IQ_Level = pd.DataFrame(dataframe_labelenc_array,columns=["Label_Encoded"])
#add the above one-column Dataframe to the original "dataframe"
dataframe_labelenc_sk = pd.concat([dataframe,dataframe_IQ_Level],axis=1)
dataframe_labelenc_sk


Unnamed: 0,Age,Fav_Color,IQ_Level,Label_Encoded
0,20,RED,EXCELLENT,1
1,21,BLUE,AVERAGE,0
2,22,BLUE,AVERAGE,0
3,23,GREEEN,GOOD,2
4,23,YELLOW,VERY GOOD,3
5,45,GREEEN,EXCELLENT,1
6,43,RED,AVERAGE,0
7,48,BLUE,AVERAGE,0
8,50,YELLOW,GOOD,2
9,60,RED,VERY GOOD,3


2.2 Label Encoding using Pandas

In [78]:
#factorize returns a normal tuple with 2 entries which are arrays - the first entry being array of label encoded values
dataframe_labelencoded_array = pd.factorize(dataframe["IQ_Level"])[0]
#Create a dataframe out of the array and add a column header
dataframe_IQ_Lvl = pd.DataFrame(dataframe_labelencoded_array,columns=["Label_Encoded"])
#add the above one-column Dataframe to the original "dataframe"
dataframe_labelenc_pd = pd.concat([dataframe,dataframe_IQ_Lvl],axis=1)
dataframe_labelenc_pd

Unnamed: 0,Age,Fav_Color,IQ_Level,Label_Encoded
0,20,RED,EXCELLENT,0
1,21,BLUE,AVERAGE,1
2,22,BLUE,AVERAGE,1
3,23,GREEEN,GOOD,2
4,23,YELLOW,VERY GOOD,3
5,45,GREEEN,EXCELLENT,0
6,43,RED,AVERAGE,1
7,48,BLUE,AVERAGE,1
8,50,YELLOW,GOOD,2
9,60,RED,VERY GOOD,3


# SECTION3:ORDINAL ENCODING 

3.1 Ordinal Encoding using Pandas

In [79]:
label_encode_mapping = {"AVERAGE":0,"GOOD":1,"VERY GOOD":2,"EXCELLENT":3}
series_ordinalenc_pd = dataframe["IQ_Level"].map(label_encode_mapping)
#We got a Series. so,let's change the column header by creating a DataFrame from Series (otherwise the new ordinal 
#encoded column will also have name as "IQ_Level")
dataframe_ordinalenc_pd = pd.DataFrame(series_ordinalenc_pd.values,columns=["IQ_Level_ord_enc"])
dataframe_ordinalenc_pd = pd.concat([dataframe,dataframe_ordinalenc_pd],axis=1)
dataframe_ordinalenc_pd


Unnamed: 0,Age,Fav_Color,IQ_Level,IQ_Level_ord_enc
0,20,RED,EXCELLENT,3
1,21,BLUE,AVERAGE,0
2,22,BLUE,AVERAGE,0
3,23,GREEEN,GOOD,1
4,23,YELLOW,VERY GOOD,2
5,45,GREEEN,EXCELLENT,3
6,43,RED,AVERAGE,0
7,48,BLUE,AVERAGE,0
8,50,YELLOW,GOOD,1
9,60,RED,VERY GOOD,2


3.2 Ordinal Encoding using Pandas

In [0]:
#since fit_tranform() accepts a 2d array, we could have either done: dataframe["IQ_Level"].values.reshape(-1,1) and then applied
#toarray(), the way we did earlier, or we can do dataframe[["IQ_Level"]] which is a DataFrame and hence is 2d. the fit_transform()
#returns a 2d array
label_encoded_2darray = OrdinalEncoder().fit_transform(dataframe[["IQ_Level"]])
#Let's convert this 2d-array to a DataFrame
dataframe_label_encoded = pd.DataFrame(label_encoded_2darray.astype(int),columns=["IQ_Level_ord_enc"])
dataframe_label_encoded = pd.concat([dataframe,dataframe_label_encoded],axis=1)


In [81]:
dataframe_label_encoded

Unnamed: 0,Age,Fav_Color,IQ_Level,IQ_Level_ord_enc
0,20,RED,EXCELLENT,1
1,21,BLUE,AVERAGE,0
2,22,BLUE,AVERAGE,0
3,23,GREEEN,GOOD,2
4,23,YELLOW,VERY GOOD,3
5,45,GREEEN,EXCELLENT,1
6,43,RED,AVERAGE,0
7,48,BLUE,AVERAGE,0
8,50,YELLOW,GOOD,2
9,60,RED,VERY GOOD,3
