<a href="https://www.kaggle.com/code/tanviranjomsiddique/label-encoding-one-hot-encoding?scriptVersionId=214368430" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Label Encoding

[Code Link in Kaggle](https://www.kaggle.com/code/tanviranjomsiddique/label-encoding-one-hot-encoding)

## Print Unique values of 1 column: df['species'].unique()

In [1]:
# Import libraries 
import numpy as np 
import pandas as pd 

# Import dataset 
df = pd.read_csv('/kaggle/input/iriscsv/Iris.csv') 

display( df.columns )

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [2]:
display( df['Species'].unique() )

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

## After applying Label Encoding with LabelEncoder() our categorical value will replace with the numerical value[int].

In [3]:
# Import label encoder 
from sklearn import preprocessing 

# label_encoder object knows 
# how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 

# Encode labels in column 'species'. 
df['Species']= label_encoder.fit_transform(df['Species']) 

df['Species'].unique() 

array([0, 1, 2])

## Limitation of label Encoding 
Label encoding converts the categorical data into numerical ones, but it assigns a unique number(starting from 0) to each class of data. This may lead to the `generation of priority issues` during model training of data sets. `A label with a high value may be considered to have high priority` than a label having a lower value.

### Example For Limitation of Label Encoding 
An attribute having output classes Mexico, Paris, Dubai. On Label Encoding, this column lets Mexico is replaced with 0, Paris is replaced with 1, and Dubai is replaced with 2. <br><br>

With this, it can be interpreted that Dubai has high priority than Mexico and Paris while training the model, But actually, there is no such priority relation between these cities here.

# What is One Hot Encoding?
One Hot Encoding is a method for converting categorical variables into a `binary format`. It creates `new binary columns (0s and 1s) for each category` in the original variable. Each category in the original column is represented as a separate column, where a value of 1 indicates the presence of that category, and 0 indicates its absence.

## Why Use One Hot Encoding?
The primary purpose of One Hot Encoding is to ensure that `categorical data can be effectively used` in machine learning models. Key reasons why this technique is beneficial:

### Eliminating Ordinality: 
Many categorical variables have `no inherent order` (e.g., “Male” and “Female”). If we were to assign numerical values (e.g., Male = 0, Female = 1), the model might `mistakenly interpret this as a ranking`, leading to `biased predictions`. One Hot Encoding eliminates this risk by `treating each category independently`.

### Improving Model Performance:
By providing a more detailed representation of categorical variables, One Hot Encoding can help improve the performance of machine learning models. It allows models to `capture complex relationships within the data` that might be missed if categorical variables were treated as single entities.

### Compatibility with Algorithms: 
Many machine learning algorithms, particularly those based on `linear regression and gradient descent`, require `numerical input`. One Hot Encoding ensures that categorical variables are converted into a suitable format.

### How One-Hot Encoding Works: 
An Example To grasp the concept better, let’s explore a simple example. Imagine we have a dataset with fruits, their categorical values, and corresponding prices. Using one-hot encoding, we can transform these categorical values into numerical form. For instance:<br><br>

Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the other fruit columns (like Mango or Orange) will contain 0.
This pattern ensures that each categorical value gets its own column, represented with binary values (1 or 0), making it usable for machine learning models.

## Using Pandas
Pandas offers the `get_dummies` function, which is a simple and effective way to perform one-hot encoding. This method converts categorical variables into multiple binary columns.<br><br>

For example, the `Gender column` with `values 'M' and 'F'` becomes two binary columns: `Gender_F and Gender_M`.<br>
`drop_first=True` in pandas `drops one redundant column` (e.g., keeps only Gender_F to **avoid multicollinearity**).

In [4]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a dummy employee dataset
data = {
    'Employee id': [10, 20, 15, 25, 30],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']
}

# Convert the data into a pandas DataFrame
df = pd.DataFrame(data)
print(f"Original Employee Data:")
display(df)

# Use pd.get_dummies() to one-hot encode the categorical columns
df_pandas_encoded = pd.get_dummies(df, 
                                   columns=['Gender', 'Remarks'], 
                                   drop_first=True)

print(f"One-Hot Encoded Data using Pandas:")
display(df_pandas_encoded)


Original Employee Data:


Unnamed: 0,Employee id,Gender,Remarks
0,10,M,Good
1,20,F,Nice
2,15,F,Good
3,25,M,Great
4,30,F,Nice


One-Hot Encoded Data using Pandas:


Unnamed: 0,Employee id,Gender_M,Remarks_Great,Remarks_Nice
0,10,True,False,False
1,20,False,False,True
2,15,False,False,False
3,25,True,True,False
4,30,False,False,True


### Select Categorical columns

In [5]:
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
print(categorical_columns)

['Gender', 'Remarks']


In [6]:
# Initialize OneHotEncoder
encoder = OneHotEncoder( sparse_output=False )

# Fit and transform the categorical columns
one_hot_encoded = encoder.fit_transform( df[categorical_columns] )

# Create a DataFrame with the encoded columns
one_hot_df = pd.DataFrame(one_hot_encoded, 
                          columns = encoder.get_feature_names_out( categorical_columns ))

# Concatenate the one-hot encoded columns with the original DataFrame
df_sklearn_encoded = pd.concat([ df.drop(categorical_columns, axis=1),
                                 one_hot_df ],
                               axis=1)

print(f"One-Hot Encoded Data using Scikit-Learn:")
display( df_sklearn_encoded )

One-Hot Encoded Data using Scikit-Learn:


Unnamed: 0,Employee id,Gender_F,Gender_M,Remarks_Good,Remarks_Great,Remarks_Nice
0,10,0.0,1.0,1.0,0.0,0.0
1,20,1.0,0.0,0.0,0.0,1.0
2,15,1.0,0.0,1.0,0.0,0.0
3,25,0.0,1.0,0.0,1.0,0.0
4,30,1.0,0.0,0.0,0.0,1.0


We can observe that we have 3 Remarks and 2 Gender columns in the data. However, you can just use n-1 columns to define parameters if it has n unique labels. For example, if we only keep the Gender_Female column and drop the Gender_Male column, then also we can convey the entire information as when the label is 1, it means female and when the label is 0 it means male. This way we can encode the categorical data and reduce the number of parameters as well.

## One Hot Encoding using Scikit Learn Library
Scikit-learn(sklearn) is a popular machine-learning library in Python that provide numerous tools for data preprocessing. It provides a OneHotEncoder function that we use for encoding categorical and numerical variables into binary vectors. Using `df.select_dtypes(include=['object'])` in Scikit Learn Library:

- This selects only the columns with categorical data (data type object).
- In this case, ['Gender', 'Remarks'] are identified as categorical columns.

In [7]:
#one hot encoding using OneHotEncoder of Scikit-Learn

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

#Building a dummy employee dataset for example
data = {'Employee id': [10, 20, 15, 25, 30],
        'Gender': ['M', 'F', 'F', 'M', 'F'],
        'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],
        }
df = pd.DataFrame(data)
print(f"Employee data : \n{df}")

#Extract categorical columns from the dataframe
#Here we extract the columns with object datatype as they are the categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
print( categorical_columns )


Employee data : 
   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice
['Gender', 'Remarks']


In [8]:
encoder = OneHotEncoder(sparse_output=False)
# Apply one-hot encoding to the categorical columns
one_hot_encoded = encoder.fit_transform(df[categorical_columns])

#Create a DataFrame with the one-hot encoded columns
#We use get_feature_names_out() to get the column names for the encoded data
one_hot_df = pd.DataFrame(one_hot_encoded, 
                          columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded dataframe with the original dataframe
df_encoded = pd.concat([df, one_hot_df], axis=1)

# Drop the original categorical columns
df_encoded = df_encoded.drop(categorical_columns, axis=1)
print(f"Encoded Employee data : \n")
display(df_encoded)

Encoded Employee data : 



Unnamed: 0,Employee id,Gender_F,Gender_M,Remarks_Good,Remarks_Great,Remarks_Nice
0,10,0.0,1.0,1.0,0.0,0.0
1,20,1.0,0.0,0.0,0.0,1.0
2,15,1.0,0.0,1.0,0.0,0.0
3,25,0.0,1.0,0.0,1.0,0.0
4,30,1.0,0.0,0.0,0.0,1.0


# One-Hot Encoding: A Key Technique for Handling Categorical Data

## Overview of Tools for One-Hot Encoding
Both **Pandas** and **Scikit-Learn** offer robust solutions for one-hot encoding:

- **Use Pandas `get_dummies()`**:
  - Best for quick and simple encoding tasks.
- **Use Scikit-Learn `OneHotEncoder()`**:
  - Ideal for integration within machine learning pipelines.
  - Provides finer control over encoding behavior.

---

## Advantages and Disadvantages of One-Hot Encoding

### Advantages
1. **Facilitates Model Compatibility**:
   - Enables the use of categorical variables in models requiring numerical inputs.
2. **Improved Model Performance**:
   - Provides richer information about the categorical variable.
3. **Avoids Ordinality Issues**:
   - Prevents misinterpretation of natural ordering (e.g., "small", "medium", "large").

### Disadvantages
1. **Increased Dimensionality**:
   - Creates a separate column for each category, making the model more complex and slower to train.
2. **Sparse Data**:
   - Most entries will have a value of `0` in the encoded columns.
3. **Risk of Overfitting**:
   - Especially problematic with many categories in a variable and a relatively small sample size.

> **Key Insight**: While powerful, one-hot encoding should be used cautiously to avoid issues like dimensionality increase, sparsity, and overfitting. Alternative techniques may be more suitable depending on the context.

---

## Best Practices for One-Hot Encoding

1. **Limit the Number of Categories**:
   - For high-cardinality variables, consider grouping or feature engineering to reduce the number of unique categories.
2. **Use Feature Selection**:
   - Apply feature selection techniques to retain only the most relevant features, reducing dimensionality and enhancing model performance.
3. **Monitor Model Performance**:
   - Regularly evaluate the model after encoding. Address overfitting or inefficiencies by exploring alternative encoding methods.
4. **Understand Your Data**:
   - Assess the nature of categorical variables to determine the appropriateness of one-hot encoding.

---

## Alternatives to One-Hot Encoding

### 1. **Label Encoding**
- Assigns a unique integer to each category.
- Suitable for ordinal variables (e.g., "Low", "Medium", "High").
- Avoids risks of hierarchy misinterpretation in nominal data.

### 2. **Binary Encoding**
- Converts categories into binary numbers and creates binary columns.
- Reduces dimensionality while preserving information.

### 3. **Target Encoding**
- Replaces categories with the mean of the target variable for each category.
- Effective for high-cardinality variables but requires careful handling to avoid data leakage.

---

By understanding the advantages, disadvantages, and best practices of one-hot encoding, you can make informed decisions to handle categorical data effectively in your machine learning projects.
