# Feature Engineering - Pandas - One Hot Encoding For Multi Categorical Variables

### One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

Dataset - https://www.kaggle.com/yogeerp/mercedes

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv(
    "data/mercedesbenz.csv", usecols=["X1", "X2", "X3", "X4", "X5", "X6"]
)
data.head()

In [None]:
# Lets have a look at how many unique labels each variables has.
for col in data.columns:
    print(col, ": ", len(data[col].unique()), " labels")

In [None]:
# Let's examine how many columns we will obtain after one hot encoding these variables
pd.get_dummies(data, drop_first=True).shape

We can see that from just 6 initial categorical variables, we end up with 117 new variables.

What can we do instead?

In [None]:
# Lets find the top 10 most frequent categories for the variable X2
data.X2.value_counts().sort_values(ascending=False).head(20)

In [None]:
# Lets make a list with the most frequent categories of the variable.
top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

In [None]:
# And now we can make the 10 binary variables

for label in top_10:
    data[label] = np.where(data["X2"] == label, 1, 0)

data[["X2"] + top_10].head(40)

In [None]:
# Get whole set of dummy variables, for all the categorical variables.


def one_hot_top_x(df, variable, top_x_labels):
    # Function to create the dummy variables for the most frequent labels
    # We can vary the number of the most frequent labels that we encode

    for label in top_x_labels:
        df[variable + "_" + label] = np.where(data[variable] == label, 1, 0)


# Read the data again
data = pd.read_csv(
    "data/mercedesbenz.csv", usecols=["X1", "X2", "X3", "X4", "X5", "X6"]
)

# Encode X2 into the 10 most frequent categories
one_hot_top_x(data, "X2", top_10)
data.head()