# Overview

One of the challenges that people run into when using scikit learn for the first time on classification or regression problems is how to handle categorical features (e.g. a 'City' feature with 'New York', 'London', etc as values).

This notebook acts both as reference and a guide for the questions on this topic that came up in this [kaggle thread](https://www.kaggle.com/c/titanic/forums/t/5379/handling-categorical-data-with-sklearn?page=2). The questions addressed at the end are:

* Scikit-learn doesn't like categorical features as strings, like 'female', it needs numbers. How do I encode this?
* How do I handle categorical features with more than two categories?
* How can I save the categorical encoding for reuse on the prediction set?
* What if I have a categorical feature that's already numbers (e.g. 1 through 6). Do I need to encode them as binary features (i.e. 0 and 1)?
* Is encoding as 1 and 2 equivalent to encoding as 0 and 1?
What's the difference between pandas.get_dummies() and sklearn.preprocessing.LabelEncoder()?
* If I have a categorical column like "City" in which I have four unique values namely "New York", "New Delhi" , "Moscow" and "London" . I want to have four columns like "is_New York", "is_New Delhi" etc. ,which would have 1- yes and 0-no values, how should I proceed?

# Sample data and libraries

We're using both pandas and numpy to manipulate our data. Here are our imports for that:

In [1]:
import pandas as pd
import numpy as np

Here's the sample data we'll be using for this guide:

In [2]:
data = pd.DataFrame(
        [['female', 'New York', 'low', 4], ['female', 'London', 'medium', 3], ['male', 'New Delhi', 'high', 2]],
        columns=['Gender', 'City', 'Temperature', 'Rating'])
data

Unnamed: 0,Gender,City,Temperature,Rating
0,female,New York,low,4
1,female,London,medium,3
2,male,New Delhi,high,2


What can we learn from the sample data? Each feature has different qualities:
* All of the features are categorical data. Most are strings, one is numeric.
* Gender is a binary category. It's either male of female.
* City is nominal category. This is because it's not meaningful to order the cities in any way.
* Temperature is an ordinal category. This is because there is a meaningful order to the category - i.e. greater-than and less-than comparisons are meaningful.
* Rating is also ordinal category, and it's already in numeric form. In addition to greater-than and less-than comparsions, in this case math operations like addition and subtraction are meaningful.

# Types of encoding

There are two fundamentally different types of encoding for categorical data; numeric encoding, and "one-hot" encoding.

## Numeric encoding
The Rating feature is already numerically encoded. But how do we do that for the City feature? Scikit learn provides a transformation for that, called LabelEncoder:

In [3]:
from sklearn.preprocessing import LabelEncoder

data['City_encoded'] = LabelEncoder().fit_transform(data['City'])
data[['City', 'City_encoded']] # special syntax to get just these two columns

Unnamed: 0,City,City_encoded
0,New York,2
1,London,0
2,New Delhi,1


We can take a closer look at what the LabelEncoder is doing (and keep it to apply to a predict dataset) as follows:

In [4]:
encoder = LabelEncoder()
encoder.fit(data['City'])
encoder.classes_

array(['London', 'New Delhi', 'New York'], dtype=object)

The ordering of the list of classes aboves corresponds to their numeric values. Transformation is then as follows:

In [5]:
data['City_encoded'] = encoder.transform(data['City']) # transform as a separate step from fit
data[['City', 'City_encoded']]

Unnamed: 0,City,City_encoded
0,New York,2
1,London,0
2,New Delhi,1


### Specifying an order

Alternately, if you know the order you want for a numeric category (as we do with Temperature), we do the following:

In [6]:
data['Temperature_encoded'] = data['Temperature'].map( {'low':0, 'medium':1, 'high':2})
data[['Temperature', 'Temperature_encoded']]

Unnamed: 0,Temperature,Temperature_encoded
0,low,0
1,medium,1
2,high,2


### Binary encoding

Binary encodings are a special case of categoric features (such as Gender). Here's a way to do this that also happens to preserve any missing values as missing:

In [7]:
data['Male'] = data['Gender'].map( {'male':1, 'female':0} )
data[['Gender', 'Male']]

Unnamed: 0,Gender,Male
0,female,0
1,female,0
2,male,1


## One-hot encoding

One-hot encoding is where you represent each possible value for a category as a separate feature. The most straight-forward way to do this is with pandas (e.g. with the City feature again):

In [8]:
pd.get_dummies(data['City'], prefix='City')

Unnamed: 0,City_London,City_New Delhi,City_New York
0,0,0,1
1,1,0,0
2,0,1,0


To concatenate these new feautres with the existing data, do the following:

In [9]:
data = pd.concat([data, pd.get_dummies(data['City'], prefix='City')], axis=1)
data[['City', 'City_London', 'City_New Delhi', 'City_New York']]

Unnamed: 0,City,City_London,City_New Delhi,City_New York
0,New York,0,0,1
1,London,1,0,0
2,New Delhi,0,1,0


# Questions and answers
These should be addressed by the material above. None the less, I'll answer the questions here.

## Q: Scikit-learn doesn't like categorical features as strings, like 'female', it needs numbers. How do I encode this?
Use either numeric encoding or one-hot encoding from the above sections. As a guide, you'll want to use one-hot encodind when there's no inherent order to your category, and numeric otherwise.

## Q: How do I handle categorical features with more than two categories?
It depends how best to encode them. See the first question.

## Q: How can I save the categorical encoding for reuse on the prediction set?
The `LabelEncoder` is the only encoding shown here that will vary it's output depending on the input data. You save the encoding simply by performing the `fit` on one set of data, and then `transform` on as many datasets as you like. See the "Numeric Encoding" section.

## Q: What if I have a categorical feature that's already numbers (e.g. 1 through 6). Do I need to encode them as binary features (i.e. 0 and 1)?
It depends on the characteristics of the categorical feature. See the first question.

## Q: What's the difference between pandas.get_dummies() and sklearn.preprocessing.LabelEncoder()?
`get_dummies` is for one-hot encoding, `LabelEncoder` is for numeric encoding. Compare the "One-hot encoding" and "Numeric encoding" sections.

## Q: If I have a categorical column like "City" in which I have four unique values namely "New York", "New Delhi" , "Moscow" and "London" . I want to have four columns like "is_New York", "is_New Delhi" etc. ,which would have 1- yes and 0-no values, how should I proceed?
Use `get_dummies` and specify `prefix='is'`. See the "One-hot encoding" section.