# 5. Data Encoding

In this notebook, we will explain why data encoding is necessary and provide examples of One-Hot Encoding and Label Encoding.

## Why Data Encoding?

- The presence of qualitative variables (which take modalities) often complicates machine learning algorithms.
- Most machine learning algorithms take numerical values as input, so we need to find a way to transform these modalities into numerical data.

## One-Hot Encoding

- One-Hot Encoding (or 1-of-N encoding) is common in machine learning.
- It involves encoding a variable with n labels on n bits, where the modality taken by the variable takes the value 1, and the others are set to 0.
- **Advantage**: Ease of application.
- **Disadvantage**: The disadvantage is the memory size of the variable since it uses as many bits as there are modalities.

Let's demonstrate One-Hot Encoding with a generated DataFrame.

In [5]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Creating a sample DataFrame
data = {
    'Country': ['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain']
}

df = pd.DataFrame(data)

In [6]:
# One-Hot Encoding
encoder = OneHotEncoder(sparse=False)
one_hot_encoded = encoder.fit_transform(df[['Country']])

df_one_hot_encoded = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(['Country']))

print("Original DataFrame:\n", df)
print("\nOne-Hot Encoded DataFrame:\n", df_one_hot_encoded)


Original DataFrame:
    Country
0   France
1    Spain
2  Germany
3    Spain
4  Germany
5   France
6    Spain

One-Hot Encoded DataFrame:
    Country_France  Country_Germany  Country_Spain
0             1.0              0.0            0.0
1             0.0              0.0            1.0
2             0.0              1.0            0.0
3             0.0              0.0            1.0
4             0.0              1.0            0.0
5             1.0              0.0            0.0
6             0.0              0.0            1.0


## Label Encoding

- Label Encoding involves converting labels (qualitative variables) into a quantitative form to make them readable by the machine.

Let's demonstrate Label Encoding with a generated DataFrame.

In [7]:
from sklearn.preprocessing import LabelEncoder

# Label Encoding
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(df['Country'])

df_label_encoded = df.copy()
df_label_encoded['Country'] = label_encoded

print("Original DataFrame:\n", df)
print("\nLabel Encoded DataFrame:\n", df_label_encoded)

Original DataFrame:
    Country
0   France
1    Spain
2  Germany
3    Spain
4  Germany
5   France
6    Spain

Label Encoded DataFrame:
    Country
0        0
1        2
2        1
3        2
4        1
5        0
6        2
