# 13_Dummy Variables AKA One-Hot Encoding

In machine learning algorithms for regression or classification it is often necessary to encode categorical data so that it can be evaluated by the learning algorithm. This can be done by encoding each categorical value as a unique boolean feature (i.e. 1 for True, 0 for False). For example, if there was a feature of "Gender" with values of "Male", "Female", or "Other"; this can be encoded into 3 separate boolean columns; as below:


| Gender | Male   | Female | Other |
|------|------|------|------|
|   Male  | 1 | 0 | 0 |
|   Female  | 0 |1 | 0 |
|   Other  | 0 | 0 | 1 |


#### Example:

In [1]:
import pandas as pd

# Load the example timeseries_daily.csv dataset:
df = pd.read_csv("./data_etc/timeseries_daily.csv")
df.head()

Unnamed: 0,Date,feature_1,feature_2,feature_3,feature_4,categorical_feature,weekday
0,01/02/2017,0,0,37,0,foo,Wednesday
1,02/02/2017,0,0,168,0,foo,Thursday
2,03/02/2017,0,0,157,0,other,Friday
3,04/02/2017,0,0,720,0,other,Saturday
4,05/02/2017,0,0,721,0,bar,Sunday


#### Use pandas <code>get_dummies()</code> to create the boolean columns:
Running the function returns a DataFrame with the same length as the categorical feature, and number of columns equal to the number of unique values in the categorical feature.

In [2]:
dummies = pd.get_dummies(df["categorical_feature"])
dummies.tail()

Unnamed: 0,bar,foo,other
315,0,1,0
316,0,0,1
317,1,0,0
318,0,1,0
319,0,0,1


#### This can be joined to the original DataFrame:

In [3]:
new_df = pd.concat([df, dummies], axis=1)
new_df.tail()

Unnamed: 0,Date,feature_1,feature_2,feature_3,feature_4,categorical_feature,weekday,bar,foo,other
315,13/12/2017,657,1079,2128,454,foo,Wednesday,0,1,0
316,14/12/2017,968,1155,2116,163,other,Thursday,0,0,1
317,15/12/2017,820,1212,1934,290,bar,Friday,1,0,0
318,16/12/2017,1201,966,2475,497,foo,Saturday,0,1,0
319,17/12/2017,576,485,1809,285,other,Sunday,0,0,1


#### For clarity, you can add prefixes to the dummy columns:

In [4]:
dummies_prefixed = pd.get_dummies(df["categorical_feature"], prefix="category", prefix_sep=": ")
dummies_prefixed.tail()

Unnamed: 0,category: bar,category: foo,category: other
315,0,1,0
316,0,0,1
317,1,0,0
318,0,1,0
319,0,0,1


#### Don't fall for the "<a href="http://www.algosome.com/articles/dummy-variable-trap-regression.html" target="_blank">Dummy Variable Trap</a>":
For machine learning purposes it is often necessary to drop one category from the dummy features created, and leave this as the "default" option; otherwise we run into problems with regression models where some of the categories may be correlated. Pandas <code>get_dummies</code> includes the option to exclude the first category:

In [5]:
dummies_drop_first = pd.get_dummies(df["categorical_feature"], drop_first=True)
dummies_drop_first.tail()

Unnamed: 0,foo,other
315,1,0
316,0,1
317,0,0
318,1,0
319,0,1


#### Wrap this all up in one line of code:

In [6]:
final_df = df.join(pd.get_dummies(df["categorical_feature"], prefix="category", prefix_sep="_", drop_first=True))
final_df.tail()

Unnamed: 0,Date,feature_1,feature_2,feature_3,feature_4,categorical_feature,weekday,category_foo,category_other
315,13/12/2017,657,1079,2128,454,foo,Wednesday,1,0
316,14/12/2017,968,1155,2116,163,other,Thursday,0,1
317,15/12/2017,820,1212,1934,290,bar,Friday,0,0
318,16/12/2017,1201,966,2475,497,foo,Saturday,1,0
319,17/12/2017,576,485,1809,285,other,Sunday,0,1
