## What are Categorical Variables? 
Categorical data is non-numeric data that is grouped in some way, and usually containing a finite list of values, such as hair color. While hair color itself is a category, the categorical variables within that category are varibales such as `["black", "brown", "red]`. 

As with hair color, categorical variables do not have any certain order or ranking. 

#### Examples of Categorical Variables: 
- Level of education
- College Major
- Car brands
- Gender

### Ordinal Variables
Ordinal variables, on the other hand, do have a ranking order, such as `["high", "low"]` or `["first", "second", "third"]`. 

## One Hot Encoding Categorical Variables
Most of the statistical models you will encounter require numeric input. We can use a method called "one hot encoding" to convert our categorical features into numerical input. 

Let's use the Iris dataset again: 

In [1]:
import pandas as pd

# url to get file from
url = "http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data"

# read the file into a dataframe, notice you can update the columnn names here as well
iris = pd.read_csv(url, 
                   header=None, 
                   names=['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Class'])

In order to use the feature information (class labels), we'll need to convert them from catetgorical to numeric: 

In [4]:
iris.Class.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Intuition tells us that we could simply map a number to each unique class: 

In [None]:
map_dict = {
    "Iris-setosa":1,
    "Iris-versicolor":2,
    "Iris-virginica":6
}

However, this could cause the model we use to assume that there was some order or relationship to the variables. For example, in the scenario above, a statistical model would assume that Iris-setosa and Iris-versicolor are more closely related or more closely located than Iris-virginica.

What we can do instead is convert each feature to a binary representation where:
- 0 = feature is not in that catetgory
- 1 = feature is part of that category

Let's make it easy to see this in action by first taking a dataframe of our categorical features: 

In [5]:
feature_df = pd.DataFrame(iris.Class)

In [7]:
feature_df.head()

Unnamed: 0,Class
0,Iris-setosa
1,Iris-setosa
2,Iris-setosa
3,Iris-setosa
4,Iris-setosa


In [9]:
encoded = pd.get_dummies(feature_df['Class'])

In [10]:
encoded.head()

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [11]:
encoded.tail()

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
145,0,0,1
146,0,0,1
147,0,0,1
148,0,0,1
149,0,0,1


We can see that pandas get_dummies() is labeling each row with a 1 for the category that it belongs to, and a 0 for the categories it doesn't. Our dataframe now has three new numerical columns that we can use instead of the original class labels.