# Working with Data

#### Part of the [Inquiryum Machine Learning Fundamentals Course](http://inquiryum.com/machine-learning/)

In the examples we have been working with so far, all the columns had numerical data. For example, the violet classification data looked like:
    
    
Sepal Length|Sepal Width|Petal Length|Petal Width|Class
:--: | :--: |:--: |:--: |:--: 
5.3|3.7|1.5|0.2|Iris-setosa
5.0|3.3|1.4|0.2|Iris-setosa
5.0|2.0|3.5|1.0|Iris-versicolor
5.9|3.0|4.2|1.5|Iris-versicolor
6.3|3.4|5.6|2.4|Iris-virginica
6.4|3.1|5.5|1.8|Iris-virginica

Notice that all the feature columns had numeric data. In addition to **numeric** data datasets often contain **categorical data**. A column that contains **categorical data** means that the values are from a limited set of values. For example:



Movie | Tomato Rating | Genre | Rating | Length 
:---: | :---: | :---: | :---: | :---:  
First Man | 88 | Drama | PG-13 | 138
Can You Ever Forgive Me | 98 | Drama | R | 107
The Girl in the Spider's Web | 41 | Drama | R | -99
Free Solo | 99 | Documentary | PG-13 | 97
The Grinch | 57 | Animation | PG | 86
Overlord | 80 | Action | R | 109
Christopher Robin | 71 | Comedy | PG | -99
Ant Man and the Wasp  |  88 | Science Fiction | PG-13 | 118

Here, the genre and rating columns contains categorical data meaning that the values of those columns are not numeric and they are from a limited set of possibilities. Modern machine learning algorithms are designed to handle numeric data and, as a preprocessing step, we will need to convert the categorical columns to numeric. One solution would be simply to convert each category to an integer. So drama is 1, documentary 2 etc:

index | genre
 :--: | :--:
 1 | Drama
 2 | Documentary
 3 | Animation
 4| Action
 5 | Comedy
 6 | Science Fiction

Using this scheme we can convert the original data to:

Movie | Tomato Rating | Genre | Rating | Length 
:---: | :---: | :---: | :---: | :---:  
First Man | 88 | 1 | 1 | 138
Can You Ever Forgive Me | 98 | 1 | 2 | 107
The Girl in the Spider's Web | 41 | 1 | 2 | -99
Free Solo | 99 | 2 | 1 | 97
The Grinch | 57 | 3 | 3 | 86
Overlord | 80 | 4 | 2 | 109
Christopher Robin | 71 | 5 | 3 | -99
Ant Man and the Wasp  |  88 | 6 | 1 | 118

Numeric columns like `Tomato Rating` and `Length` are fine as is but categorical columns (`Genre` and `Rating`) are problematic for machine learning. Simply converting the category values to integers:

But this solution is problematic. Integers infer  both an ordering and a distance where 2 is closer to 1 than 4. Since in the genre column 1 is drama, 2 is documentary, and 4 is action, our scheme implies that dramas are closer to documentaries than they are to action films, which is clearly not the case. This problem also exists in the rating column. **So clearly this method is not the way to go**!

### One Hot Encoding
The solution is to do what is called one hot encoding. Our original table looked like:


Movie | Tomato Rating | Genre | Rating | Length 
:---: | :---: | :---: | :---: | :---:  
First Man | 88 | Drama | PG-13 | 138
Can You Ever Forgive Me | 98 | Drama | R | 107
The Girl in the Spider's Web | 41 | Drama | R | -99
Free Solo | 99 | Documentary | PG-13 | 97
The Grinch | 57 | Animation | PG | 86
Overlord | 80 | Action | R | 109
Christopher Robin | 71 | Comedy | PG | -99
Ant Man and the Wasp  |  88 | Science Fiction | PG-13 | 118

So, for example, we had the categorical column rating with the possible values drama, documentary, animation, action, comedy and science fiction. Instead of one column with those values, we are going to convert it to a form where each value is its own column. If that instance is of that value then we would put a **one** in that column, otherwise we would put a zero. So we would convert

Movie | Genre 
:---: | :---: 
First Man | Drama 
Can You Ever Forgive Me | Drama
The Girl in the Spider's Web |  Drama 
Free Solo |  Documentary 
The Grinch |  Animation 
Overlord |  Action
Christopher Robin |  Comedy
Ant Man and the Wasp  |   Science Fiction

to

Movie | Drama | Documentary | Animation | Action | Comedy | Science Fiction
:--: | :--: | :--: | :--: | :--: | :--: | :--: 
First Man | 1 | 0 | 0| 0| 0 | 0 
Can You Ever Forgive Me | 1 | 0 | 0| 0| 0 | 0 
The Girl in the Spider's Web | 1 | 0 | 0| 0| 0 | 0 
Free Solo | 0 | 1 | 0| 0| 0 | 0 
The Grinch | 0 | 0 | 1| 0| 0 | 0 
Overlord | 0 | 0 | 0| 1| 0 | 0 
Christopher Robin | 0 | 0 | 0| 0| 1 | 0 
Ant Man and the Wasp | 0 | 0 | 0| 0| 0 | 1 


This is the prefered way of converting categorical data (when we work with text we will see other options). An added benefit to this approach is now an instance can be of multiple categories. For example, we may want to categorize *Ant Man and the Wasp* as both a comedy and science fiction, and that is easy to do in this scheme:

Movie | Drama | Documentary | Animation | Action | Comedy | Science Fiction
:--: | :--: | :--: | :--: | :--: | :--: | :--: 
First Man | 1 | 0 | 0| 0| 0 | 0 
Can You Ever Forgive Me | 1 | 0 | 0| 0| 0 | 0 
The Girl in the Spider's Web | 1 | 0 | 0| 0| 0 | 0 
Free Solo | 0 | 1 | 0| 0| 0 | 0 
The Grinch | 0 | 0 | 1| 0| 0 | 0 
Overlord | 0 | 0 | 0| 1| 0 | 0 
Christopher Robin | 0 | 0 | 0| 0| 1 | 0 
Ant Man and the Wasp | 0 | 0 | 0| 0| 1 | 1 

If we one-hot encoded all the categorical columns in our original dataset it would look like:

Movie            | Tomato Rating | Action | Animation | Comedy | Documentary | Drama | Science Fiction | PG | PG-13 | R | Length 
:---: | :---: | :---: | :---: | :---: |  :---: |  :---: |  :---: |  :---: |  :---: |  :---: |  :---: 
First Man        | 88            |  0     |    0      |   0    | 0           | 1     | 0    | 0 | 1  |    0| 138
Can You Ever Forgive Me | 98 |      0     |    0      |   0    | 0           | 1     | 0    | 0 | 0  |   1|   107
The Girl in the Spider's Web | 41 |  0     |    0      |   0    | 0           | 1     | 0    | 0 | 1  |    0|    -99
Free Solo | 99 |  0     |    0      |   0    | 1           | 0     | 0    | 0 | 1  |    0|   97
The Grinch | 57 |  0     |    1      |   0    | 0           | 0     | 0    | 1 | 0  |    0| 86
Overlord | 80 |  1     |    0      |   0    | 0           | 0     | 0    | 0 | 1  |    0|  109
Christopher Robin | 71  |  0     |    0      |   1   | 0           | 0     | 0    | 1 | 0  |    0| -99
Ant Man and the Wasp   |  0    |    0      |   0    | 0       | 0    | 0     | 1    | 0 | 1  |    0| 118



### Coding
Let's invetigate this a bit with a coding example. 


In [3]:
import pandas as pd
bike = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/bike.csv')
bike = bike.set_index('Day')
bike


Unnamed: 0_level_0,Outlook,Temperature,Humidity,Wind,Bike
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Sunny,Hot,High,Weak,No
2,Sunny,Hot,High,Strong,No
3,Overcast,Hot,High,Weak,Yes
4,Rain,Mild,High,Weak,Yes
5,Rain,Cool,Normal,Weak,Yes
6,Rain,Cool,Normal,Strong,No
7,Overcast,Cool,Normal,Strong,Yes
8,Sunny,Mild,High,Weak,No
9,Sunny,Cool,Normal,Weak,Yes
10,Rain,Mild,Normal,Weak,Yes


Here we are trying to predict whether someone will mountain bike or not based on the outlook, temperature, humidity, and wind. 
Let's forge ahead and see if we can build a decision tree classifier:

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(bike[['Outlook', 'Te', 'Petal Length', 'Petal Width']], iris_train['Class'])