# Supervised Learning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Part 1: Preparing the data

Start by loading the data `data/online_retail.csv`, don't forget to use the column `CustomerID` as a row index.

In [2]:
customers_ml_data = pd.read_csv("data/online_retail.csv", index_col = "CustomerID")
customers_ml_data.head()


Unnamed: 0_level_0,Austria,Belgium,Finland,France,Germany,Italy,Norway,Portugal,Spain,Switzerland,...,balance,max_spent,mean_spent,min_spent,n_orders,total_items,total_items_returned,total_refunded,total_spent,has_returned
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12348,0,0,1,0,0,0,0,0,0,0,...,0.383285,1.250693,1.431525,0.513598,-0.134499,2.257327,0.177632,0.250957,0.375547,no
12350,0,0,0,0,0,0,1,0,0,0,...,-0.555081,-0.636253,-0.573046,0.022799,-0.69584,-0.522217,0.177632,0.250957,-0.557665,yes
12352,0,0,0,0,0,0,1,0,0,0,...,-0.148636,0.097227,-0.374642,-0.760775,0.426842,-0.281712,-1.054751,-1.146316,-0.122088,no
12354,0,0,0,0,0,0,0,0,1,0,...,-0.349333,0.121654,1.005066,2.112152,-0.69584,-0.38324,0.177632,0.250957,-0.353048,yes
12356,0,0,0,0,0,0,0,1,0,0,...,1.103279,3.917886,5.490954,3.525167,-0.41517,1.249951,0.177632,0.250957,1.091586,no


### Learning Activity: Get X and y

In order to feed the data into our classification models in scikit-learn, we need to split our dataset into the feature matrix `X` and the target vector `y`. Use `loc` or `iloc` to select all columns except the last one (`has_returned`) for `X` and only the last column (`has_returned`) for `y`.

In [3]:
# get X and y
X = customers_ml_data.iloc[:,:-1] #except the last one "has_returned"
y = customers_ml_data.iloc[:,-1]


In [4]:
# Check the dimensionality of X and y
print ("X dimensions: ", X.shape)
print ("y dimensions: ", y.shape)


X dimensions:  (3126, 20)
y dimensions:  (3126,)


### Learning Activity - Investigate the y frequencies

An important aspect to understand before applying any classification algorithm is how the output labels are distributed. Are they evenly distributed or not? Imbalances in distribution of labels can often lead to poor classification results for the minority class even if the classification results for the majority class are very good.

Use `value_counts` on `y` to get the frequency of each value.

BONUS: try to call `.plot(kind="bar")` on the array to plot it as a bar plot

In [None]:
# Calculate the y frequencies
y.value_counts()
y.value_counts().plot(kind="bar")


### Learning Activity: Encode categorical values

In our current dataset, you can see that the y values are categorical (i.e. they only take one of a discrete set of values) and have a non-numeric representation, `yes` vs. `no`. This can be problematic for scikit-learn and plotting functions in Python, since they assume numerical values, so we need to map the text categories to numerical representations. To do so, you can use the function `.map()` that takes as argument a dictionary in which you can define as a `key` the value to transform and as `value` the thing you want to change it to (check the documentation for more details).

In [None]:
# Convert the categorical values of y into numbers using map

y = y.map({"yes": 1, "no": 0})


In [None]:
# print the y frequencies now
y.value_counts()


### Learning Activity - Split the data into training and test sets

Training and testing a classification model on the same dataset is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data (poor generalisation). To use different datasets for training and testing, we need to split the dataset into two disjoint sets: train and test (**Holdout method**) using the `train_test_split` function.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split the raw data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


In [None]:
# Print the dimensionality of the individual splits
print ("X_train dimensions: ", X_train.shape)
print ("y_train dimensions: ", y_train.shape)
print ("X_test dimensions: ", X_test.shape)
print ("y_test dimensions: ", y_test.shape)


## Part 2: Training a model

### Learning Activity - Train, optimise and test a KNN algorithm with scikit-learn

To build KNN models using scikit-learn, you will be using the `KNeighborsClassifier` object, which allows you to set the value of K using the `n_neighbors` parameter (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). For every classification model built with scikit-learn, we will follow four main steps: 

1. **Building** the classification model (using either default, pre-defined or optimised parameters), 
2. **Training** the model, 
3. **Testing** the model, and 
4. **Performance evaluation** using various metrics.  
<br/>

The optimal choice for the value K is highly data-dependent: in general a larger K decreases the effects of noise, but makes the classification boundaries less distinct (risk of underfitting). Rather than trying one-by-one predefined values of K, we can automate this process. The scikit-learn library provides the grid search function `GridSearchCV` (http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html), which allows us to exhaustively search for the optimum combination of parameters by evaluating models trained with a particular algorithm with all provided parameter combinations. Further details and examples on grid search with scikit-learn can be found at http://scikit-learn.org/stable/modules/grid_search.html. You can use the `GridSearchCV` function with the validation technique of your choice (in this example, 10-fold cross-validation has been applied) to search for a parametisation of the KNN algorithm that gives a more optimal model:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Create the dictionary of given parameters
n_neighb   = np.arange(1, 101, 2)  
parameters = [{'n_neighbors': n_neighb}] 

# The actual grid search can take a bit of time:
# Optimise and build the model with GridSearchCV
gridCV = GridSearchCV(KNeighborsClassifier(), parameters, cv=10)
gridCV.fit(X_train, y_train) 


In [None]:
# Report the optimal parameters..
bestNeighb = gridCV.best_params_['n_neighbors']
print("Best parameters: n_neighbours=", bestNeighb)

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (X_test). 

So, we are testing our independent X_test dataset using the optimised model:

In [None]:
# Build the classifier using the optimal parameters detected by grid search
knn = KNeighborsClassifier(n_neighbors=5)


In [None]:
# Fit to the training set...
knn.fit(X_train, y_train)


In [None]:
# .. and predict the test data..
y_pred = knn.predict(X_test)


In [None]:
from sklearn.metrics import accuracy_score

# Report the final overall accuracy
print ("Overall Accuracy:", round(accuracy_score(y_test, y_pred), 2))


Check the classification report for your model; you can find the function in `sklearn.metrics`

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))