## [Encoding Categorical Data](https://towardsdatascience.com/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae/)

> Six ways of matchmaking categories and numbers

Categorical data is like the descriptive labels we use in everyday life. It represents characteristics or qualities that can be grouped into categories.

#### Types of Categorical Data

1. **Nominal**: These are categories with no inherent order.
2. **Ordinal**: These categories have a meaningful order.

In [1]:
!pip install -q numpy pandas scikit-learn matplotlib

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pt

import warnings
warnings.filterwarnings('ignore')

data = {
    'Date': ['03-25', '03-26', '03-27', '03-28', '03-29', '03-30', '03-31', '04-01', '04-02', '04-03', '04-04', '04-05'],
    'Weekday': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
    'Month': ['Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr', 'Apr', 'Apr'],
    'Temperature': ['High', 'Low', 'High', 'Extreme', 'Low', 'High', 'High', 'Low', 'High', 'Extreme', 'High', 'Low'],
    'Humidity': ['Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Humid', 'Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Dry'],
    'Wind': ['No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
    'Outlook': ['sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'sunny', 'overcast', 'sunny', 'rainy'],
    'Crowdedness': [85, 30, 65, 45, 25, 90, 95, 35, 70, 50, 80, 45]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

df.head()

Unnamed: 0,Date,Weekday,Month,Temperature,Humidity,Wind,Outlook,Crowdedness
0,03-25,Mon,Mar,High,Dry,No,sunny,85
1,03-26,Tue,Mar,Low,Humid,Yes,rainy,30
2,03-27,Wed,Mar,High,Dry,Yes,overcast,65
3,03-28,Thu,Mar,Extreme,Dry,Yes,sunny,45
4,03-29,Fri,Mar,Low,Humid,No,rainy,25


#### Label Encoding

Label Encoding assigns a unique integer to each category in a categorical variable.

In [3]:
# 1. Label Encoding for Weekday
df['Weekday_label'] = pd.factorize(df['Weekday'])[0]

df.head()

Unnamed: 0,Date,Weekday,Month,Temperature,Humidity,Wind,Outlook,Crowdedness,Weekday_label
0,03-25,Mon,Mar,High,Dry,No,sunny,85,0
1,03-26,Tue,Mar,Low,Humid,Yes,rainy,30,1
2,03-27,Wed,Mar,High,Dry,Yes,overcast,65,2
3,03-28,Thu,Mar,Extreme,Dry,Yes,sunny,45,3
4,03-29,Fri,Mar,Low,Humid,No,rainy,25,4


#### One-Hot Encoding
One-Hot Encoding creates a new binary column for each category in a categorical variable.

In [4]:
# 2. One-Hot Encoding for Outlook
df = pd.get_dummies(df, columns=['Outlook'], prefix='Outlook', dtype=int)

df.head()

Unnamed: 0,Date,Weekday,Month,Temperature,Humidity,Wind,Crowdedness,Weekday_label,Outlook_overcast,Outlook_rainy,Outlook_sunny
0,03-25,Mon,Mar,High,Dry,No,85,0,0,0,1
1,03-26,Tue,Mar,Low,Humid,Yes,30,1,0,1,0
2,03-27,Wed,Mar,High,Dry,Yes,65,2,1,0,0
3,03-28,Thu,Mar,Extreme,Dry,Yes,45,3,0,0,1
4,03-29,Fri,Mar,Low,Humid,No,25,4,0,1,0


#### Binary Encoding
Binary Encoding represents each category as a binary number (0 and 1).

In [5]:
# 3. Binary Encoding for Wind
df['Wind_binary'] = (df['Wind'] == 'Yes').astype(int)

df.head()

Unnamed: 0,Date,Weekday,Month,Temperature,Humidity,Wind,Crowdedness,Weekday_label,Outlook_overcast,Outlook_rainy,Outlook_sunny,Wind_binary
0,03-25,Mon,Mar,High,Dry,No,85,0,0,0,1,0
1,03-26,Tue,Mar,Low,Humid,Yes,30,1,0,1,0,1
2,03-27,Wed,Mar,High,Dry,Yes,65,2,1,0,0,1
3,03-28,Thu,Mar,Extreme,Dry,Yes,45,3,0,0,1,1
4,03-29,Fri,Mar,Low,Humid,No,25,4,0,1,0,0


#### Target Encoding
Target Encoding replaces each category with the mean of the target variable for that category.

In [6]:
# 4. Target Encoding for Humidity
df['Humidity_target'] = df.groupby('Humidity')['Crowdedness'].transform('mean')

df.head()

Unnamed: 0,Date,Weekday,Month,Temperature,Humidity,Wind,Crowdedness,Weekday_label,Outlook_overcast,Outlook_rainy,Outlook_sunny,Wind_binary,Humidity_target
0,03-25,Mon,Mar,High,Dry,No,85,0,0,0,1,0,65.0
1,03-26,Tue,Mar,Low,Humid,Yes,30,1,0,1,0,1,52.0
2,03-27,Wed,Mar,High,Dry,Yes,65,2,1,0,0,1,65.0
3,03-28,Thu,Mar,Extreme,Dry,Yes,45,3,0,0,1,1,65.0
4,03-29,Fri,Mar,Low,Humid,No,25,4,0,1,0,0,52.0


#### Ordinal Encoding
Ordinal Encoding assigns ordered integers to ordinal categories based on their inherent order.

In [7]:
# 5. Ordinal Encoding for Temperature
temp_order = {'Low': 1, 'High': 2, 'Extreme': 3}
df['Temperature_ordinal'] = df['Temperature'].map(temp_order)

df.head()

Unnamed: 0,Date,Weekday,Month,Temperature,Humidity,Wind,Crowdedness,Weekday_label,Outlook_overcast,Outlook_rainy,Outlook_sunny,Wind_binary,Humidity_target,Temperature_ordinal
0,03-25,Mon,Mar,High,Dry,No,85,0,0,0,1,0,65.0,2
1,03-26,Tue,Mar,Low,Humid,Yes,30,1,0,1,0,1,52.0,1
2,03-27,Wed,Mar,High,Dry,Yes,65,2,1,0,0,1,65.0,2
3,03-28,Thu,Mar,Extreme,Dry,Yes,45,3,0,0,1,1,65.0,3
4,03-29,Fri,Mar,Low,Humid,No,25,4,0,1,0,0,52.0,1


#### Cyclic Encoding
Cyclic Encoding/Transformation transforms a cyclical categorical variable into two numerical features that preserve the variable's cyclical nature. 

In [8]:
# 6. Cyclic Encoding for Month
month_order = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
               'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
df['Month_num'] = df['Month'].map(month_order)
df['Month_sin'] = np.sin(2 * np.pi * (df['Month_num']-1) / 12)
df['Month_cos'] = np.cos(2 * np.pi * (df['Month_num']-1) / 12)

df.head()

Unnamed: 0,Date,Weekday,Month,Temperature,Humidity,Wind,Crowdedness,Weekday_label,Outlook_overcast,Outlook_rainy,Outlook_sunny,Wind_binary,Humidity_target,Temperature_ordinal,Month_num,Month_sin,Month_cos
0,03-25,Mon,Mar,High,Dry,No,85,0,0,0,1,0,65.0,2,3,0.866025,0.5
1,03-26,Tue,Mar,Low,Humid,Yes,30,1,0,1,0,1,52.0,1,3,0.866025,0.5
2,03-27,Wed,Mar,High,Dry,Yes,65,2,1,0,0,1,65.0,2,3,0.866025,0.5
3,03-28,Thu,Mar,Extreme,Dry,Yes,45,3,0,0,1,1,65.0,3,3,0.866025,0.5
4,03-29,Fri,Mar,Low,Humid,No,25,4,0,1,0,0,52.0,1,3,0.866025,0.5


In [9]:
# Select and rearrange numerical columns
numerical_columns = [
    'Date','Weekday_label',
    'Month_sin', 'Month_cos',
    'Temperature_ordinal',
    'Humidity_target',
    'Wind_binary',
    'Outlook_sunny', 'Outlook_overcast', 'Outlook_rainy', 
    'Crowdedness'
]

# Display the rearranged numerical columns
print(df[numerical_columns].round(3))

     Date  Weekday_label  Month_sin  Month_cos  Temperature_ordinal  \
0   03-25              0      0.866        0.5                    2   
1   03-26              1      0.866        0.5                    1   
2   03-27              2      0.866        0.5                    2   
3   03-28              3      0.866        0.5                    3   
4   03-29              4      0.866        0.5                    1   
5   03-30              5      0.866        0.5                    2   
6   03-31              6      0.866        0.5                    2   
7   04-01              0      1.000        0.0                    1   
8   04-02              1      1.000        0.0                    2   
9   04-03              2      1.000        0.0                    3   
10  04-04              3      1.000        0.0                    2   
11  04-05              4      1.000        0.0                    1   

    Humidity_target  Wind_binary  Outlook_sunny  Outlook_overcast  \
0      