# Feature/Data Encoding

Feature encoding is the process of transforming `categorical` features into `numeric` features. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

- Label Encoding
  - Convert to 1,2,3,4,...
- Ordinal Encoding
  - Same like label but in this we define the order.
- One-Hot Encoding
  - It makes identity matrix of 0 and 1.
- Binary Encoding
  - Same like one-hot but it reduces the dimensions.
- Frequency/Count Encoding
  - Encoding on the basis of count_values on feature.

Encoding is necessary to reduce the dimensions and computation power when transforming this data into ML models/algorithms.

Computers can easily understand the numbers instead of text.

## Benefits

Following are the benefits of Feature Encoding:

- Algorithm Compatability.
- Efficiency and Performance.
  - Faster and Storage Efficiency.
- Feature Representation.
  - Different names of same thing in different languages (Biasness arise). We write 1 or 2 to represent that feature.
- Support Unseen Categories.
- Less Memory Usage.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# data load
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

- **Label Encoder**

In [4]:
# let's encode the time in LabelEncoder with sklearn
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded_time'] = le.fit_transform(df['time'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [5]:
df['encoded_time'].value_counts()

encoded_time
0    176
1     68
Name: count, dtype: int64

- **Ordinal Encoding**

In [6]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [7]:
# ordinal encoding the day column using specific order
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [8]:
df['encoded_day'].value_counts()

encoded_day
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64

- **One-Hot Encoding**

In [9]:
# one hot encoding on day column
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df[['sex']]).toarray()

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.

In [10]:
# example of one hot encoding
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [13]:
# onehot_encoder = OneHotEncoder()
# embarked_onehot = onehot_encoder.fit_transform(titanic[['embarked']])
# embarked_onehot_df = pd.DataFrame(embarked_onehot, columns=onehot_encoder.get_feature_names_out(['embarked']))
# titanic = pd.concat([titanic.reset_index(drop=True), embarked_onehot_df.reset_index(drop=True)], axis=1)
# titanic.head()

- **Binary Encoding**

In [15]:
# !pip install category_encoders

In [16]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [17]:
from category_encoders import BinaryEncoder

binary_encoder = BinaryEncoder()
df_binary = binary_encoder.fit_transform(df['day'])

In [18]:
df_binary

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
239,0,1,0
240,0,1,0
241,0,1,0
242,0,1,0


- **Using Pandas for Feature Encoding**

In [19]:
# use pandas for feature encoding

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [20]:
# use pandas get dummies
get_dummies = pd.get_dummies(df, columns=['day'])
get_dummies.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,Female,No,Dinner,2,False,False,False,True
1,10.34,1.66,Male,No,Dinner,3,False,False,False,True
2,21.01,3.5,Male,No,Dinner,3,False,False,False,True
3,23.68,3.31,Male,No,Dinner,2,False,False,False,True
4,24.59,3.61,Female,No,Dinner,4,False,False,False,True


1. `Label encoding`: This is used when the categories are ordinal, and the order of the categories is important. For example, if you have a variable called "education level" with categories "high school", "college", and "graduate school", you can use label encoding to assign the values 0, 1, and 2 to these categories, respectively.

2. `One-hot encoding`: This is used when the categories are nominal, and there is no inherent order or hierarchy between the categories. For example, if you have a variable called "color" with categories "red", "green", and "blue", you can use one-hot encoding to create three binary variables, where each variable represents one category. The value of the variable is 1 if the category is present, and 0 otherwise.

3. `Ordinal encoding`: This is used when the categories are ordinal, and the order of the categories is important. For example, if you have a variable called "income level" with categories "low", "medium", and "high", you can use ordinal encoding to assign the values 0, 1, and 2 to these categories, respectively.

4. `Hash encoding`: This is used when the categories are nominal, and there are too many categories to use one-hot encoding. Hash encoding uses a hash function to map the categories to a fixed number of binary variables, where each variable represents a hash value. The value of the variable is 1 if the hash value is present, and 0 otherwise.

5. `Frequency encoding`: This is used when the categories are nominal, and the frequency of each category is important. For example, if you have a variable called "city" with categories "New York", "Los Angeles", and "Chicago", you can use frequency encoding to assign the values 0.33, 0.25, and 0.17 to these categories, respectively, based on their frequency in the dataset.