## What is the Problem with Categorical Data?


Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

## How to Convert Categorical Data to Numerical Data?
This involves two steps:

Integer Encoding


One-Hot Encoding

### 1. Integer Encoding


As a first step, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.

### 2. One-Hot Encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

For example:

In [3]:
import pandas as pd

In [4]:
df = pd.DataFrame({'Name':['John Smith', 'Mary Brown'],
 'Gender':['M', 'F'], 'Smoker':['Y', 'N']})

df

Unnamed: 0,Name,Gender,Smoker
0,John Smith,M,Y
1,Mary Brown,F,N


In [5]:
df_with_dummies = pd.get_dummies(df, columns=['Gender', 'Smoker'])

df_with_dummies

Unnamed: 0,Name,Gender_F,Gender_M,Smoker_N,Smoker_Y
0,John Smith,0,1,0,1
1,Mary Brown,1,0,1,0
