<a href="https://colab.research.google.com/github/swopnimghimire-123123/Machine-Learning-Journey/blob/main/27_One_Hot_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## One-Hot Encoding

One-hot encoding is a technique used to convert categorical variables into a numerical format that can be used by machine learning algorithms. It creates new binary columns for each unique category in the original column. A value of 1 in a new column indicates the presence of that category, and 0 indicates its absence.

**Example:**

Consider a 'Color' column with values: 'Red', 'Blue', 'Green'. One-hot encoding would create three new columns: 'Color_Red', 'Color_Blue', 'Color_Green'.

| Color | Color_Red | Color_Blue | Color_Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
| Green | 0 | 0 | 1 |

## Dummy Variable Trap and Multicollinearity

The **Dummy Variable Trap** occurs when you include all the dummy variables created by one-hot encoding in your model. This leads to **Multicollinearity**, a situation where two or more predictor variables in a regression model are highly correlated.

In the 'Color' example above, if you include 'Color_Red', 'Color_Blue', and 'Color_Green' in your model, there's a perfect linear relationship: `Color_Red + Color_Blue + Color_Green = 1`. This perfect correlation makes it impossible for the model to estimate the independent effect of each color, leading to unstable and unreliable model coefficients.

To avoid the dummy variable trap and multicollinearity, you should drop one of the dummy variables. This is often done by dropping the dummy variable corresponding to the most frequent category or a reference category.

## One-Hot Encoding using Pandas

You can perform one-hot encoding easily using the `get_dummies()` function in pandas. By default, `get_dummies()` drops the first category to avoid the dummy variable trap.

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Red']}
df = pd.DataFrame(data)

print("Original DataFrame:")
display(df)

# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Color'], drop_first=True)

print("\nDataFrame after one-hot encoding:")
display(df_encoded)

Original DataFrame:


Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Red
4,Blue
5,Red



DataFrame after one-hot encoding:


Unnamed: 0,Color_Green,Color_Red
0,False,True
1,False,False
2,True,False
3,False,True
4,False,False
5,False,True


## One-Hot Encoding using Most Frequent Variables

In some cases, especially with high cardinality categorical features, you might want to limit the number of new columns created by one-hot encoding. You can achieve this by encoding only the most frequent categories and grouping the rest into an 'Other' category or dropping them.

Here's an example of how to encode only the top N most frequent categories using scikit-learn's `OneHotEncoder` and pandas:

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a sample DataFrame with more categories
data = {'City': ['New York', 'London', 'Paris', 'New York', 'Tokyo', 'London', 'Paris', 'New York', 'Tokyo', 'Mumbai']}
df_city = pd.DataFrame(data)

print("Original DataFrame:")
display(df_city)

# Get the top 3 most frequent cities
top_n = 3
top_cities = df_city['City'].value_counts().nlargest(top_n).index.tolist()

# Create a new column with top cities or 'Other'
df_city['City_Encoded'] = df_city['City'].apply(lambda x: x if x in top_cities else 'Other')

# Perform one-hot encoding on the new column
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_cities = encoder.fit_transform(df_city[['City_Encoded']])
encoded_city_df = pd.DataFrame(encoded_cities, columns=encoder.get_feature_names_out(['City_Encoded']))

# Concatenate the encoded DataFrame with the original (optional, depending on your needs)
df_city_final = pd.concat([df_city, encoded_city_df], axis=1)

print("\nDataFrame after encoding top cities:")
display(df_city_final)

Original DataFrame:


Unnamed: 0,City
0,New York
1,London
2,Paris
3,New York
4,Tokyo
5,London
6,Paris
7,New York
8,Tokyo
9,Mumbai



DataFrame after encoding top cities:


Unnamed: 0,City,City_Encoded,City_Encoded_London,City_Encoded_New York,City_Encoded_Other,City_Encoded_Paris
0,New York,New York,0.0,1.0,0.0,0.0
1,London,London,1.0,0.0,0.0,0.0
2,Paris,Paris,0.0,0.0,0.0,1.0
3,New York,New York,0.0,1.0,0.0,0.0
4,Tokyo,Other,0.0,0.0,1.0,0.0
5,London,London,1.0,0.0,0.0,0.0
6,Paris,Paris,0.0,0.0,0.0,1.0
7,New York,New York,0.0,1.0,0.0,0.0
8,Tokyo,Other,0.0,0.0,1.0,0.0
9,Mumbai,Other,0.0,0.0,1.0,0.0


In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('/content/cars.csv')
df.sample(4)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
3315,Toyota,70000,Petrol,Second Owner,145000
5664,Mahindra,110000,Diesel,Second Owner,750000
3071,Maruti,35000,Diesel,First Owner,830000
1827,Ford,93468,Diesel,Second Owner,975000


In [None]:
df.shape

(8128, 5)

In [None]:
df["owner"].value_counts()

Unnamed: 0_level_0,count
owner,Unnamed: 1_level_1
First Owner,5289
Second Owner,2105
Third Owner,555
Fourth & Above Owner,174
Test Drive Car,5


In [None]:
df["fuel"].value_counts()

Unnamed: 0_level_0,count
fuel,Unnamed: 1_level_1
Diesel,4402
Petrol,3631
CNG,57
LPG,38


###**1. ONEHOTENCODING using pandas**

In [None]:
pd.get_dummies(df,columns=["fuel","owner"])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


###**2. K-1 OneHotEncoding**

In [None]:
pd.get_dummies(df,columns=["fuel","owner"],drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


###**3. OneHotEncoding using Sklearn**

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],
                                                 df.iloc[:,-1],
                                                 test_size=0.2,
                                                 random_state=2)

In [None]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
OHE = OneHotEncoder(drop='first',sparse_output=False,dtype=np.int32)

In [None]:
X_train_new = OHE.fit_transform(X_train[["fuel","owner"]])

In [None]:
X_train_new

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int32)

In [None]:
X_train_new.shape

(6502, 7)

In [None]:
np.hstack((X_train[["brand","km_driven"]].values,X_train_new))

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], dtype=object)

###**4. OneHotEncoding with Top Categories**

In [None]:
counts = df["brand"].value_counts()

In [None]:
df['brand'].nunique()

32

In [None]:
threshold = 100
repl = counts[counts <= threshold].index

In [None]:
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuzu', 'Ambassador',
       'Kia', 'MG', 'Daewoo', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [None]:
pd.get_dummies(df["brand"].replace(repl,"uncommon")).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
6280,False,False,False,False,False,False,True,False,False,False,False,False,False
7107,False,False,False,False,False,False,True,False,False,False,False,False,False
7288,False,False,False,False,True,False,False,False,False,False,False,False,False
7363,False,False,False,True,False,False,False,False,False,False,False,False,False
4943,False,False,False,False,False,True,False,False,False,False,False,False,False
