# Feature Encoding

## Generally in our dataset we have 2 types of features


1.   Numerical (Integer, floats)


2.   Categorical (Nominal, ordinal)

---
We cannot pass in categorical features in Machine Learning models. So we need to convert them into numeric features.



### Categorical Variables are of 2 types Ordinal and Nominal. 

*   Ordinal variables has some kind order. (Good, Better, Best), (First, Second, Third)


*   Nominal variables has no ordering between them. (Cat, Dog, Monkey), (Apple, Banana, Mango)

Based on categorical variables whether they are ordinal or nominal we appply different techniques on them.

In [1]:
# let's create a dataframe
import pandas as pd
df = pd.DataFrame ({'country' : ['India','U.S','Australia','India','Australia','India','U.S'],
                    'Age' : [44,34,28,27,30,42,25],
                    'Salary' : [72000,44000,35000,27000,32000,56000,45000],
                    'Purchased' : ['yes','no','yes','yes','no','yes','no']
                    })

In [2]:
df

Unnamed: 0,country,Age,Salary,Purchased
0,India,44,72000,yes
1,U.S,34,44000,no
2,Australia,28,35000,yes
3,India,27,27000,yes
4,Australia,30,32000,no
5,India,42,56000,yes
6,U.S,25,45000,no


### Explore the Categorical Features

In [4]:
categorical_features = [feature for feature in df.columns if df[feature].dtypes=='O']
categorical_features

['country', 'Purchased']

In [8]:
for feature in categorical_features:
    print('The feature is {} and number of categories are {}'.format(feature,len(df[feature].unique())))

The feature is country and number of categories are 3
The feature is Purchased and number of categories are 2


In [3]:
# Let's check our dataframe
print(df)

     country  Age  Salary Purchased
0      India   44   72000       yes
1        U.S   34   44000        no
2  Australia   28   35000       yes
3      India   27   27000       yes
4  Australia   30   32000        no
5      India   42   56000       yes
6        U.S   25   45000        no


In [4]:
# check the datatypes
df.dtypes

country      object
Age           int64
Salary        int64
Purchased    object
dtype: object

## Here we have 2 categorical feature 

*   Country.
*   Purchased.

---


 Age and Salary have numeric values.



We know it well that we cannot pass in categorical values in our models.


### Label Encoding

#### country

In [5]:
df['country'].unique() # check unique 

array(['India', 'U.S', 'Australia'], dtype=object)

In [6]:
df['country'].value_counts()

India        3
Australia    2
U.S          2
Name: country, dtype: int64

#### Purchased

In [7]:
df['Purchased'].unique()

array(['yes', 'no'], dtype=object)

In [8]:
df['Purchased'].value_counts()

yes    4
no     3
Name: Purchased, dtype: int64

## So Here we have 3 categories in country column.


*   India
*   U.S
*   Australia



In label encoding different categories are given different unique values starting from 0 to (n-1). n is the number of categories. 

In [9]:
from sklearn.preprocessing import LabelEncoder # import the LabelEncoder from sklrean library
le = LabelEncoder()    # create the instance of LabelEncoder

df['country_temp'] = le.fit_transform(df['country'])   # apply LabelEncoding of country column

In [10]:
df['country_temp']

0    1
1    2
2    0
3    1
4    0
5    1
6    2
Name: country_temp, dtype: int32

Here we can see that **country feature** has been tranformed **into numeric values**. **Label encoding is done in alphabatical order** as we can see here.
*   Australia -----> 0
*   India  --------> 1
*   U.S   ---------> 2

### Problem With Label Encoding
Here we have assigned numeric values i.e (0-Australia), (1-India), (2-U.S) in the same column. Problem here is that the machine learning models won't interpret these values as different labels as 0 < 1 < 2. Our model might interpret them in some order. But we don't have any ordering in our country feature. we cannot say Australia < India < U.S .

We use **One Hot encoding** to overcome this problem. It is also known as nominal encoding. Here We create 3 different columns [India, Australia, U.S]. We assign 1 if that label is present in particular row otherwise we marks it as 0.

In [11]:
# we will use get_dummies to do One Hot encoding
pd.get_dummies(df['country']) # get_dummies method also known as nominal encoding

Unnamed: 0,Australia,India,U.S
0,0,1,0
1,0,0,1
2,1,0,0
3,0,1,0
4,1,0,0
5,0,1,0
6,0,0,1


*  Here in first row ['India'] is assigned 1 and Australia and U.S are assigned 0. 
*  Similarly in 2nd row ['U.S'] is assigned 1 and other columns are assigned 0.

We can drop the first column here, it is just increasing the features.
 Reason ---- Even if we just have two columns suppose india and U.S and both are assigned 0. It is understood that when both of these labels are zero The 3rd label is automatically going to be 1.

In [12]:
# Dropping the first column
pd.get_dummies(df['country'],drop_first=True)

Unnamed: 0,India,U.S
0,1,0
1,0,1
2,0,0
3,1,0
4,0,0
5,1,0
6,0,1


Here we have done one hot encoding only on single feature but in real world datasets there will be many categorical features. Suppose our dataset has 50 categorical features with 3 different labels in each features. In that case if we apply one hot encoding, our features will also increase. we will have 100 features. It will make our model more complex.

### Based on the dataset there are different techniques that we can apply to over-come this problem of dimensionality.

### Binary Encoding
This is not intiuative like the previous ones. Here the labels are firstly encoded ordinal and then they are converted into binary codes. Then the digits from that binary string are converted into different features.

In [13]:
# create 1 more column occupation here
df['occupation'] = ['Self-employeed','Freelancer','Family-business','Data-scientist','Pensioner','Manager','Daily-wage-worker']
print(df['occupation'])

0       Self-employeed
1           Freelancer
2      Family-business
3       Data-scientist
4            Pensioner
5              Manager
6    Daily-wage-worker
Name: occupation, dtype: object


We have seven different categories here. And we don't have any ordering in them as well.

In [14]:
# install category_encoders first
# You have to poen your ananconda prompt & conda install -c conda-forge category_encoders & Y/N,Y

!pip install category_encoders



Error processing line 7 of C:\Users\deepusuresh\Anaconda3\lib\site-packages\pywin32.pth:

  Traceback (most recent call last):
    File "C:\Users\deepusuresh\Anaconda3\lib\site.py", line 168, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
  ModuleNotFoundError: No module named 'pywin32_bootstrap'

Remainder of file ignored


In [15]:
# we will use BinaryEncoder from category_encoders library to do binary encoding
import category_encoders as ce
encoder = ce.BinaryEncoder(cols = ['occupation'])
df_binary = encoder.fit_transform(df)
print(df_binary)

     country  Age  Salary Purchased  country_temp  occupation_0  occupation_1  \
0      India   44   72000       yes             1             0             0   
1        U.S   34   44000        no             2             0             0   
2  Australia   28   35000       yes             0             0             0   
3      India   27   27000       yes             1             0             1   
4  Australia   30   32000        no             0             0             1   
5      India   42   56000       yes             1             0             1   
6        U.S   25   45000        no             2             0             1   

   occupation_2  occupation_3  
0             0             1  
1             1             0  
2             1             1  
3             0             0  
4             0             1  
5             1             0  
6             1             1  


### Let's have a look at how 'Binary Encoding' actually works

![binary.PNG](attachment:binary.PNG)

### Try to analyze Why binary encoding plays vital role in Encoding Techniques

- We had **7 different categories** in occupation if we would have used **one hot encoding** it would have given us **7 features**. But by using **Binary Encoding** we have limited it to **3**.


- **Binary Encoding** is very useful when we have **many categories within a single feature**. It help us to reduce the dimensionality.

'''we have seen 3 basic types feature encoding techniques here there are many more.
              we will look at them with some practical uses and with some real world dataset'''