# Supervised Learning

![](https://s-media-cache-ak0.pinimg.com/564x/fe/aa/1a/feaa1a16a315823b2d9ad24da7eccdaf.jpg)

In [None]:
# %load utils/imports.py

import numpy as np
import pandas as pd

from utils import *
from utils.plotting import *

from utils.demo import *
from utils.styles import *

from IPython.display import IFrame

# Vectorisation for Dummies

One problem that you may commonly encounter is that some datasets contain columns that aren't numeric. Often there will be named categories in your datasets and linear regression won't work on these out of the box. In order to deal with this we can use a technique called vectorisation. This is the process of converting a single column of categories to one column per category where a 1 indicates that the column is of that type and a 0 indicates that the column is not of that type.

We'll use the famous [Iris dataset](http://archive.ics.uci.edu/ml/datasets/Iris) to illustrate vectorisation.

In [None]:
iris = sns.load_dataset("iris")
grid(iris.sample(5))

If we try to use linear regression to predict the the sepal length with our current features, we will get an error.

In [None]:
from sklearn.cross_validation import train_test_split

iris_train, iris_test = train_test_split(iris)
X_train = iris_train.drop(['sepal_length'], axis=1)
X_test = iris_test.drop(['sepal_length'], axis=1)
y_train = iris_train[['sepal_length']]
y_test = iris_test[['sepal_length']]

try:
    regr = lin_reg(X_train, y_train, X_test, y_test, graph=False, normalize=True)
except ValueError as e:
    print("ValueError :", e)

Dummies to the rescue!

In [None]:
dummies = pd.get_dummies(iris["species"])
# Add to the original dataframe
iris_with_dummies = pd.concat([iris, dummies], axis=1)
grid(iris_with_dummies.sample(10))

In [None]:
iris_train, iris_test = train_test_split(iris_with_dummies)

X_train = iris_train.drop(['sepal_length', 'species'], axis=1)
X_test = iris_test.drop(['sepal_length', 'species'], axis=1)
y_train = iris_train[['sepal_length']]
y_test = iris_test[['sepal_length']]

regr = lin_reg(X_train, y_train, X_test, y_test, graph=False, normalize=True)

# Scikit-Learn Encoding

Dummies are great, but scikit-learn provides some tools that make life even easier! Scikit-learn's ```LabelEncoder``` helps us fit and transform labels in order to normalize them to a set of numbers from 0 to $N$ where $N$ is the number of unique lables in your dataset. It also remembers the encoding so that you can retrieve the inverse of encoded values later on.

In [None]:
from sklearn.preprocessing import LabelEncoder
colors = ['red', 'green', 'blue', 'red', 'red', 'green']
print("Before:", colors)
le = LabelEncoder()
le.fit(colors)
labels = le.transform(colors)
print("After:", labels)
inverse = le.inverse_transform(labels)
print("Inverse:", inverse)

Unfortunately, label encoding is often not sufficient by itself. If we were to simply use the labels as-is we would be imparting the ordinal property to the variable. In this case, we would be saying that since ```blue = 0, green = 1, and red = 2``` this implies that ```red > green > blue```. Sometimes this is what you want, but often when dealing with categories it is a misleading interpretation of the data.

In order to avoid imparting an ordinal property, we must also make use of another encoding scheme from scikit-learn called the ```OneHotEncoder```. First, we need to convert our labels to an array-like shape using Numpy's ```vstack```.

In [None]:
from sklearn.preprocessing import OneHotEncoder
label_array =  np.vstack(labels)
print("Before:\n", label_array)
enc = OneHotEncoder()
result = enc.fit_transform(label_array)
print("After:\n", result.toarray())