### Handle Categorical Features

Categorical data is a type of data that is used to group information with similar characteristics, while numerical data is a type of data that expresses information in the form of numbers.

Example of categorical data: gender

**Why do we need encoding?**

Most machine learning algorithms cannot handle categorical variables unless we convert them to numerical values
Many algorithm’s performances even vary based upon how the categorical variables are encoded

### One Hot Encoding

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("titanic.csv",usecols=["Sex"])

In [3]:
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [4]:
pd.get_dummies(df).head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [5]:
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [6]:
df = pd.read_csv("titanic.csv",usecols=["Embarked"])

In [7]:
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [8]:
df["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [9]:
df.dropna(inplace=True)

In [10]:
df["Embarked"].unique()

array(['S', 'C', 'Q'], dtype=object)

In [11]:
pd.get_dummies(df).head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [12]:
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


### One Hot Encoding With Many Categories in a feature

In [13]:
df = pd.read_csv("mercedes_benz.csv",usecols=["X1","X2","X3","X4","X5","X6"])

In [14]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [15]:
for i in df.columns:
    print("Column",i,":",df[i].nunique())

Column X1 : 27
Column X2 : 44
Column X3 : 7
Column X4 : 4
Column X5 : 29
Column X6 : 12


In [16]:
top_10_X1 = df["X1"].value_counts().head(10).index
top_10_X1

Index(['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o'], dtype='object')

In [17]:
for i in top_10_X1:
    df["X1"+"_"+i] = np.where(df["X1"]==i,1,0)
df.head(20)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,v,at,a,d,u,j,0,0,0,0,1,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,1,0,0,0,0,0
5,b,e,c,d,g,h,0,0,1,0,0,0,0,0,0,0
6,r,e,f,d,f,h,0,0,0,0,0,1,0,0,0,0
7,l,as,f,d,f,j,0,0,0,1,0,0,0,0,0,0
8,s,as,e,d,f,i,0,1,0,0,0,0,0,0,0,0
9,b,aq,c,d,f,a,0,0,1,0,0,0,0,0,0,0


`Function:`

In [18]:
df = pd.read_csv("mercedes_benz.csv",usecols=["X1","X2","X3","X4","X5","X6"])

In [19]:
def one_hot_encoding(df,variable):
    top_10 = df[variable].value_counts().head(10).index
    for i in top_10:
        df[variable+"_"+i] = np.where(df[variable]==i,1,0)

In [20]:
for i in df.columns:
    one_hot_encoding(df,i)

In [21]:
df.head(30)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_aa,X1_s,X1_b,X1_l,...,X6_g,X6_j,X6_d,X6_i,X6_l,X6_a,X6_h,X6_k,X6_c,X6_b
0,v,at,a,d,u,j,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
5,b,e,c,d,g,h,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
6,r,e,f,d,f,h,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,l,as,f,d,f,j,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
8,s,as,e,d,f,i,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
9,b,aq,c,d,f,a,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
