## Feature Encoding in Machine Learning

We can generally divide the categorical variables(features) into 3 types:

 - Binary: 
 
        (Yes, No) , (True, False) 
        
 - Ordinal: Specific ordered Groups.

         economic status (“low income”,”middle income”,”high income”), 
         
         education level (“high school”,”BS”,”MS”,”PhD”), 
         
         income level (“less than 50K”, “50K-100K”, “over 100K”),
         
         satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”) 
        
 - Nominal : Unordered Groups.
 
        (cat, dog, tiger),(pizza, burger, coke)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# load the dataset
df = pd.read_excel("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Feature_Engineering/feature_encoding.xlsx")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   CUSTID             26 non-null     int64 
 1   PAYMENT_MODE_CARD  26 non-null     object
 2   CITY               26 non-null     object
 3   RATING             26 non-null     object
 4   EDUCATION          26 non-null     object
 5   PURCHAGE_AMOUNT    26 non-null     int64 
 6   CUSTOMER_TYPE      20 non-null     object
dtypes: int64(2), object(5)
memory usage: 1.6+ KB


In [4]:
df_train = df.loc[0:19]
df_test = df.loc[20:]

In [5]:
# Lets examine the Binary, Ordinal and Nominal Data in the dataset.
# Binary : "PAYMENT_MODE_CARD","CUSTOMER_TYPE"
# Ordinal : "RATING","EDUCATION"
# Nominal : "CITY"

## Encoding of Binary Features

- Binary datasets only have two (usable) values.
- We can always use simple mapping on binary features. Like we can use replace, apply and ,any other way

In [6]:
df_train.head()

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,Y,A,extremely like,phd,200,Premium
1,2,N,A,like,ms,100,NonPremium
2,3,Y,C,neutral,bs,130,NonPremium
3,4,Y,B,extremely dislike,bs,120,NonPremium
4,5,Y,D,extremely like,ms,160,Premium


In [7]:
df_train = df_train.copy()

In [8]:
df_train['PAYMENT_MODE_CARD'] = df_train['PAYMENT_MODE_CARD'].map({'Y':1,'N':0})

In [9]:
df_train.head(2)

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,1,A,extremely like,phd,200,Premium
1,2,0,A,like,ms,100,NonPremium


## Feature Encoding of Ordinal Features

 - We use this categorical data encoding technique when the categorical feature is ordinal. In this case, retaining the  order is important. Hence encoding should reflect the sequence.
 - In encoding, each label is converted into an integer value.

### OrdinalEncoder- Sklearn

In this technique, each label is assigned a unique integer based on alphabetical ordering.

In [10]:
df_train.head()

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,1,A,extremely like,phd,200,Premium
1,2,0,A,like,ms,100,NonPremium
2,3,1,C,neutral,bs,130,NonPremium
3,4,1,B,extremely dislike,bs,120,NonPremium
4,5,1,D,extremely like,ms,160,Premium


In [11]:
from sklearn.preprocessing import OrdinalEncoder
oenc_feat = OrdinalEncoder()
df_train[["RATING",'EDUCATION']] = oenc_feat.fit_transform(df_train[["RATING",'EDUCATION']])

In [12]:
df_train.head(2)

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,1,A,2.0,2.0,200,Premium
1,2,0,A,3.0,1.0,100,NonPremium


### LabelEncoder- Sklearn

 - In this technique, each label is assigned a unique integer based on alphabetical ordering.
 - Encode target labels with value between 0 and n_classes-1

In [13]:
from sklearn.preprocessing import LabelEncoder
lenc = LabelEncoder()

In [14]:
df_train['RATING'] =lenc.fit_transform(df_train["RATING"]) # only one column
df_train.head(2)

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,1,A,2,2.0,200,Premium
1,2,0,A,3,1.0,100,NonPremium


## Feature Encoding of Nominal Features

### One-Hot Encoding

- Every unique value in the category will be added as a feature and represented as a one-hot vector.
- Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.
- These newly created binary features are known as Dummy variables. 
- The number of dummy variables depends on the levels present in the categorical variable.


- Suppose we have a dataset with a category "City", having different cities like "Delhi", "Mumbai", "Bangalore","Pune". so lets understand how we can encode the one-hot encoding on this CITY feature.

In [15]:
df_train_dummy = pd.get_dummies(df_train["CITY"],drop_first=True)  # Can be done by sklearn as well
df_train = pd.concat([df_train,df_train_dummy.astype(int)],axis = 1)
df_train.head(2)

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE,B,C,D
0,1,1,A,2,2.0,200,Premium,0,0,0
1,2,0,A,3,1.0,100,NonPremium,0,0,0


### Frequency Encoding for Feature Encoding in Machine Learning

We can also encode considering the frequency distribution. This method can be effective at times for nominal features.

In [25]:
df_train.head(2)

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE,B,C,D
0,1,1,A,2,2.0,200,Premium,0,0,0
1,2,0,A,3,1.0,100,NonPremium,0,0,0


In [17]:
# Lets group the "CITY" variable by frequency
frq = df_train.groupby("CITY").size()

In [18]:
frq

CITY
A    7
B    6
C    4
D    3
dtype: int64

In [19]:
len(df_train)

20

In [37]:
frq_dis = (df.groupby("CITY").size())/len(df_train)
frq_dis

CITY
A    0.50
B    0.45
C    0.20
D    0.15
dtype: float64

In [38]:
df_train["CITY_FRQ_ENC"] = df_train.CITY.map(frq_dis)

In [40]:
df_train.head(2)

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE,B,C,D,CITY_FRQ_ENC
0,1,1,A,2,2.0,200,Premium,0,0,0,0.5
1,2,0,A,3,1.0,100,NonPremium,0,0,0,0.5
