# Feature Encoding

Feature encoding is the process of transforming `categorical features` into `numeric features`. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

- Label encoding
- Ordinal encoding
- One-hot encoding
- Binary encoding

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# data load
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [7]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [10]:
df['encoded_time'].value_counts()

encoded_time
0    176
1     68
Name: count, dtype: int64

In [15]:
# let's encode the time in labelencoder with sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
le = LabelEncoder()
df['encoded_time'] = le.fit_transform(df['time'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [16]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [22]:
df['encoded_day'].value_counts()

encoded_day
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64

In [21]:
# ordinal encoding the day column using specific order
oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [25]:
# one hot encoding on day column
ohe = OneHotEncoder()
ohe.fit_transform(df[['sex']]).toarray()

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.

In [27]:
# example of one hot encoding
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [26]:
# example of one hot encoding
titanic = sns.load_dataset('titanic')

onehot_encoder = OneHotEncoder(sparse=False)
embarked_onehot = onehot_encoder.fit_transform(titanic[['embarked']])
embarked_onehot_df = pd.DataFrame(embarked_onehot, columns=onehot_encoder.get_feature_names_out(['embarked']))
titanic = pd.concat([titanic.reset_index(drop=True), embarked_onehot_df.reset_index(drop=True)], axis=1)
titanic.head()



Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,embarked_C,embarked_Q,embarked_S,embarked_nan
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,0.0,0.0,1.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1.0,0.0,0.0,0.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0.0,0.0,1.0,0.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0.0,0.0,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,0.0,0.0,1.0,0.0


In [2]:
!pip install category_encoders

^C


Collecting category_encoders
  Using cached category_encoders-2.6.3-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting statsmodels>=0.9.0 (from category_encoders)
  Using cached statsmodels-0.14.2-cp311-cp311-win_amd64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.1 (from category_encoders)
  Using cached patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)
Using cached category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
Using cached patsy-0.5.6-py2.py3-none-any.whl (233 kB)
Downloading statsmodels-0.14.2-cp311-cp311-win_amd64.whl (9.9 MB)
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ---------------------------------------- 

In [31]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [35]:
from category_encoders import BinaryEncoder

binary_encoder = BinaryEncoder()
df_binary = binary_encoder.fit_transform(df['day'])

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
239,0,1,0
240,0,1,0
241,0,1,0
242,0,1,0


## Assignment: how many types of feature encoding are there?

# Assignment: When to use which type of feature encoding?
# ================
# When to Use Different Encoding Techniques

When dealing with categorical data in machine learning, different types of encoding techniques are used to convert categorical values into numerical ones. The choice of encoding technique depends on the type of categorical data, the number of unique categories, and the specific requirements of the machine learning model. Here’s a summary of when to use each type of encoding:

## 1. One-Hot Encoding (OHE)

- **Use Case**:
  - When the categorical feature has a small number of unique categories.
  - When the categories are nominal (no inherent order).
  - When the model you're using can handle a large number of features (because OHE increases the dimensionality).
- **How It Works**:
  - Each category is converted into a new binary column (0 or 1). For example, if you have a "Color" column with values "Red," "Green," and "Blue," OHE will create three new columns, one for each color.

## 2. Label Encoding (LE)

- **Use Case**:
  - When the categorical feature has an ordinal relationship (inherent order) between the categories.
  - When there are a large number of unique categories, and you want to avoid the high dimensionality caused by OHE.
  - Suitable for tree-based algorithms like decision trees or random forests, which can handle the ordinal relationship.
- **How It Works**:
  - Each category is assigned a unique integer value. For example, "Low" = 1, "Medium" = 2, "High" = 3.

## 3. Binary Encoding (BE)

- **Use Case**:
  - When the categorical feature has a large number of unique categories.
  - When you want to reduce the dimensionality compared to OHE but still avoid the ordinal nature of Label Encoding.
  - Suitable for algorithms that benefit from reduced dimensionality without assuming any inherent order in the data.
- **How It Works**:
  - Each category is first converted into a numerical value (similar to Label Encoding), and then that number is converted into binary form. The binary digits are used as features.
  - For example, if a category is assigned the integer 5, it would be represented in binary as 101, creating three binary columns for that category.

## Summary of When to Use

- **One-Hot Encoding (OHE)**: Use for nominal categorical data with a small number of categories.
- **Label Encoding (LE)**: Use for ordinal categorical data or when using algorithms that can handle ordinal relationships.
- **Binary Encoding (BE)**: Use for categorical data with a large number of categories where you want to reduce dimensionality without introducing ordinality.

These encoding methods help in converting categorical data into a format that machine learning algorithms can understand and process effectively.



In [5]:
# tell  me different scanerio where i use different type of encoding features
# 1.  encoding features for text data
# 2.  encoding features for image data
# 3.  encoding features for audio data
# 4.  encoding features for categorical data
# 5.  encoding features for numerical data
# 6.  encoding features for time series data
# 7.  encoding features for spatial data
# 8.  encoding features for network data
# 9.  encoding features for graph data
# 10. encoding features for text data with sentiment analysis
# 11. encoding features for text data with named entity recognition
# 12. encoding features for text data with topic modeling
# 13. encoding features for text data with word embeddings
# 14. encoding features for image data with object detection
# 15. encoding features for image data with image segmentation
# 16. encoding features for audio data with speech recognition
# 17. encoding features for categorical data with clustering
# 18. encoding features for numerical data with regression
# 19. encoding features for time series data with forecasting
# 20. encoding features for spatial data with clustering
# now tell the condition where i  One-Hot Encoding (OHE) Label Encoding (LE) Binary Encoding (BE) these types of encoding 


In [36]:
# use pandas for feature encoding

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [41]:
# use pandas get dummies
get_dummies = pd.get_dummies(df, columns=['day'])
get_dummies.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,Female,No,Dinner,2,False,False,False,True
1,10.34,1.66,Male,No,Dinner,3,False,False,False,True
2,21.01,3.5,Male,No,Dinner,3,False,False,False,True
3,23.68,3.31,Male,No,Dinner,2,False,False,False,True
4,24.59,3.61,Female,No,Dinner,4,False,False,False,True
