
We are going to apply the different encoding techniques on big mart sales data kaggle.

Link : https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data

Things to learn -

* Indentifying data type as ordinal,nominal and continuous.
* Applying different types of encoding.
* Challenges with different encoding techniques.
* Choosing the appropriate encoding techniques.

In [None]:
import pandas as pd #import pandas
import numpy as np #import numpy
from sklearn.preprocessing import LabelEncoder  #importing LabelEncoder


In [None]:
train = pd.read_csv('../input/bigmart-sales-data/Train.csv')

In [None]:
#check the head of dataset
train.head(5)

In [None]:
#check the size of the dataset
print('Data has {} Number of rows'.format(train.shape[0]))
print('Data has {} Number of columns'.format(train.shape[1]))

In [None]:
#check the information of the dataset
train.info()


As we can see here, we have 7 categorical variables and 5 numeric variables. The first task is to identify these categorical variables as nominal or ordinal.

In [None]:
#let's keep our categorical variables in one table
cat_data = train[['Item_Identifier','Item_Fat_Content','Item_Type','Outlet_Identifier','Outlet_Size','Outlet_Location_Type','Outlet_Type']]

In [None]:
cat_data.head()   #check the head of categorical data

In [None]:
cat_data.apply(lambda x: x.nunique()) #check the number of unique values in each column

Now think which encoding technique can we apply here.

* First thought would be to apply one hot encoding on features which has 3-5 unique categories.
* But what if there is some kind of ordering present between them. So firstly we should identify the nominal and ordinal variable
* Let's check one by one

In [None]:
#check the top 10 frequency in Item_Identifier
cat_data['Item_Identifier'].value_counts().head(10)

The values in Item_Identifier has no ordering as we can see. These are nominal categorical variable.

The first column has 1559 unique values. If we try to do one hot encoding here we will have 1558 new features. We cannot feed in these many features in our model. It will make our model complex and it will reduce the model accuracy.

In [None]:
pd.get_dummies(cat_data['Item_Identifier'],drop_first=True)  #applying one hot encoding

As expected from a single feature now we have 1558 features. So it's a bad idea to apply one hot encoding here. We should not apply one hot encoding when there are too many categories.

So one hot encoding has failed us here. Now for rescue we move to LabelEncoding but we are very much aware that if we apply label encoding on a feature it assigns a natural ranking to the categories alphabatically. So we cannot apply Label encoding as well.

So we have 1 thing left (Binary Encoding) that we have learnt previously. Let's apply it and see what we get.

In [None]:
#apply binary encoding on Item_Identifier
import category_encoders as ce                              #import category_encoders
encoder = ce.BinaryEncoder(cols=['Item_Identifier'])        #create instance of binary enocder
df_binary = encoder.fit_transform(cat_data)                 #fit and tranform on cat_data
df_binary.head(5)


Binary encoder has given us 11 new feature which is way less than we were getting from one hot encoding. So we have been rescued here by Binary Encoding.

We have applied binary encoding but it doesn't provide us any intution as how these new features are made. All we know is by using binary encoding Here the labels are firstly encoded ordinal and then they are converted into binary codes. Then the digits from that binary string are converted into different features.

There are other intutive measures to reduce the features. We will look at them later.

**Encoding Item_Fat_Content**

In [None]:
#check the unique values 
cat_data['Item_Fat_Content'].unique()

Here we have 5 unique values but if we look at them closely there are only 2 unique values. Low Fat and Regular, others are just short forms for them or are in small letters

In [None]:
low_fat = ['LF','low fat']
cat_data['Item_Fat_Content'].replace(low_fat,'Low Fat',inplace = True) #replace 'LF' and 'low fat' with 'Low Fat'
cat_data['Item_Fat_Content'].replace('reg','Regular',inplace = True)   #Replace 'reg' with regular

In [None]:
cat_data['Item_Fat_Content'].unique()

Here we have 2 categories in Item_Fat_Content and we have some ordering between the. Low Fat will have less Fat content than the regular Fat. So it is a ordinal variable.

In [None]:
#Apply LabelEncoder
le = LabelEncoder()
cat_data['Item_Fat_Content_temp'] = le.fit_transform(cat_data['Item_Fat_Content'])
print(cat_data['Item_Fat_Content'].head())
print(cat_data['Item_Fat_Content_temp'].head())

Here we only had 2 categories 'Low Fat' and 'Regular' so using LabelEncoding has worked here. It has mapped :-

* Low Fat ------- 0
* Regular ------- 1

Here the natural ranking of alphabets has worked but every time you are not this lucky.

**We can use map to do ordinal encoding**

In [None]:
#prepare a dict to map
mapping = {'Low Fat' : 0,'Regular': 1} #map Low Fat as 0 and Regular as 1
cat_data['Item_Fat_Content_temp1'] = cat_data['Item_Fat_Content'].map(mapping)
cat_data['Item_Fat_Content_temp1'].head()


It is useful when we have ordering in our categories.

**Use Pandas pd.factorize method.**

It does the nominal encoding based on the order in which the categories apper. If Low Fat is at index 0 then it will be encoded as 0 Regular as 1 and vice versa.



In [None]:
factorized,index = pd.factorize(cat_data['Item_Fat_Content'])  #using pd.factorize it gives us factorized array and index values
print(factorized)
print(index)

In this Notebook we have seen 2 new encoding techniques.

* Mapping
* pd.factorize

We have seen the usage of different methods, their advantages and disadvantages.

In [None]:
#Let's look at item type column
print(cat_data['Item_Type'].nunique())  #check number of unique values
print(cat_data['Item_Type'].unique())   #check the unique values

And we don't Have any ordering between them. So we have to apply ordinal encoding technique. i Leave it upto you to decide which technique to apply and we will have look at other techniques in our next Notebook.

Link to kernel 3 : https://www.kaggle.com/krishnaheroor/encoding-technique-3