# One Hot Enconding

One-hot encoding is a commonly used technique in data preprocessing and feature engineering, particularly in machine learning tasks. It is used to convert categorical variables into a numerical representation that can be used by machine learning algorithms.

In one-hot encoding, each categorical variable is represented as a binary vector, where each element in the vector corresponds to a distinct category. The length of the binary vector is equal to the total number of categories in the variable. The vector contains all zeros except for the position that corresponds to the category, which is set to 1.

For example, let's say we have a categorical variable "Color" with three categories: "Red," "Green," and "Blue." Using one-hot encoding, we would represent each category as follows:

- "Red" -> [1, 0, 0]
- "Green" -> [0, 1, 0]
- "Blue" -> [0, 0, 1]

By using one-hot encoding, we transform categorical variables into a numerical representation that machine learning algorithms can understand and effectively use for analysis and modeling. One-hot encoding is particularly useful when the categorical variable does not have an intrinsic ordinal relationship among its categories, and each category is equally important.

It's worth noting that one-hot encoding increases the dimensionality of the data, as it creates additional columns/features for each category. This can lead to the curse of dimensionality in some cases, so it's important to consider the trade-offs and potential impacts on the performance of machine learning models.

In [10]:
#Data_source: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data?select=train.csv.zip
import pandas as pd
import numpy as np
df = pd.read_csv('Datasets\Mercedes-Benz.csv', usecols= ['X1', 'X2', 'X3', 'X4', 'X5', 'X6',])
df.head()


Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [2]:
#Number of labels each features has

for col in df.columns:
    print(col, ':', len(df[col].unique()), 'labels')

X1 : 27 labels
X2 : 44 labels
X3 : 7 labels
X4 : 4 labels
X5 : 29 labels
X6 : 12 labels


In [3]:
#How many columns we are going to get after one-hot-enconding on these datatset/features

pd.get_dummies(df, drop_first = True).shape

(4209, 117)

In [5]:
#TOP 20 most frequent catagories fro the variable X2
df.X2.value_counts().sort_values(ascending = False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
ag      19
z       19
Name: X2, dtype: int64

In [7]:
top10 = [x for x in df.X2.value_counts().sort_values(ascending = False).head(10).index]
top10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [13]:
# Now  we make the 10 binary features/variables

for label in top10:
    df[label] = np.where(df['X2'] == label, 1, 0)

df[['X2']+top10].head(10)    

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [16]:
#Get whole set of dummy variables, for all the categorical variables

def one_hot_top_x(data, variable, top_x_labels):
    #function to create the dummy varibales for the most frequent labels
    #we can vary the number of most frequent labels that we encode
    
    for label in top_x_labels:
        data[variable+'_'+label] = np.where(df[variable] == label,1, 0)

#Read the data again
df1 = pd.read_csv('Datasets\Mercedes-Benz.csv', usecols= ['X1', 'X2', 'X3', 'X4', 'X5', 'X6',])


#Encode X2 into the 10 most frequent categories
one_hot_top_x(df, 'X2', top10)
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,...,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [17]:
top10_X1 = [x for x in df.X1.value_counts().sort_values(ascending = False).head(10).index]

one_hot_top_x(df, 'X2', top10_X1)
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,...,X2_f,X2_e,X2_aa,X2_b,X2_l,X2_v,X2_i,X2_a,X2_c,X2_o
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Advantages and Disadvantages

One-hot encoding is a technique used in data preprocessing to represent categorical variables as binary vectors. Each category is transformed into a binary vector where only one element is "hot" (set to 1), and all others are "cold" (set to 0). While one-hot encoding has its advantages, it also comes with certain disadvantages. Let's explore them:

**Advantages of one-hot encoding:**

- Retains categorical information: One-hot encoding allows the representation of categorical variables as binary vectors, preserving the categorical information in a format that can be understood by machine learning algorithms. It enables the inclusion of categorical data in numerical models, which typically operate on numeric inputs.

- Prevents ordinality bias: One-hot encoding treats all categories as independent, preventing any implicit ordering or hierarchy among the categories. This is beneficial when the categorical variable doesn't have an inherent order or when we want to avoid introducing any bias based on the order of the categories.

- Enables handling of multiple categories: One-hot encoding is suitable for variables with a large number of categories. It creates a separate binary feature for each category, allowing models to capture the presence or absence of specific categories effectively.

**Disadvantages of one-hot encoding:**

- Increases dimensionality: One-hot encoding expands the feature space by creating additional binary features for each category. If the categorical variable has many unique categories, this can lead to a significant increase in the dimensionality of the data. High dimensionality can be problematic for memory usage, computational complexity, and model training time, especially with large datasets.

- Generates sparse representations: One-hot encoding often leads to sparse representations, where most of the values in the feature vector are zeros. Sparse data can consume more memory and computational resources, making it inefficient for storage and processing. Sparse data may also negatively impact the performance of some machine learning algorithms.

- Ignores relationships between categories: One-hot encoding treats each category as unrelated to others, disregarding any potential relationships or similarities between them. This can result in the loss of valuable information that might exist in the relationships between different categories of a variable.

- Encodes unseen categories as all zeros: If a new category appears in the test or production data that was not present in the training data, one-hot encoding will represent it as an all-zero vector. This can be problematic if the model needs to make predictions for unseen categories, as it won't have any learned representation for them.

In summary, one-hot encoding is a useful technique for representing categorical variables as binary vectors, but it also has limitations related to dimensionality, sparsity, and the treatment of relationships between categories. It's essential to consider these advantages and disadvantages while preprocessing data and choose the encoding method that best suits the specific requirements of the problem at hand.