# Introduction to Machine Learning with Python 


## Module 2

### Learning Activity 1: Load the required libraries


In [2]:
import scipy
import numpy as np
import pandas as pd
import plotly.plotly as py

import visplots

from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from sklearn import preprocessing, metrics
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy.stats.distributions import randint

init_notebook_mode()

print("libraries all imported, ready to go")

libraries all imported, ready to go


### Learning Activity 2: Importing the data

The dataset we will be using throughout this workshop is an adapted version of the Wine Quality case study, available from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). The goal of this case study is to model the wine quality (into "low" or "high" quality) based on physicochemical tests (such as fixed and volatile acidity, citric acid, etc.).


The first thing you will need to do in order to work with the wine quality dataset is to read the contents from the provided `wine_quality.csv` data file using the `read_csv` command. You should also try to explore the first few rows of the imported wine DataFrame using the `head` function from the `pandas` package (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html):

In [4]:
# Import the data and explore the first few rows
wineQ = pd.read_csv('data/spam_dataset.csv')
header = wineQ.columns.values
wineQ.columns

Index([u'email_id', u'is_spam', u'word_freq_will', u'word_freq_original',
       u'word_freq_415', u'word_freq_mail', u'char_freq_#', u'char_freq_$',
       u'word_freq_internet', u'word_freq_edu', u'word_freq_hp',
       u'word_freq_lab', u'char_freq_!', u'word_freq_forest',
       u'word_freq_parts', u'word_freq_order', u'word_freq_credit',
       u'word_freq_direct', u'word_freq_project', u'word_freq_neuron',
       u'word_freq_table', u'word_freq_3d', u'word_freq_650',
       u'word_freq_free', u'word_freq_data', u'word_freq_over',
       u'word_freq_people', u'word_freq_email', u'char_freq_;',
       u'word_freq_1999', u'word_freq_857', u'word_freq_our',
       u'word_freq_make', u'word_freq_pm', u'word_freq_money',
       u'capital_run_length_longest', u'word_freq_report',
       u'word_freq_business', u'word_freq_font', u'word_freq_remove',
       u'capital_run_length_total', u'word_freq_meeting', u'word_freq_your',
       u'word_freq_85', u'word_freq_all', u'word_freq_you', u'w

In order to feed the data into our classification models and sklearn, the imported wine quality DataFrame needs to be converted into a `numpy` array. For more information on numpy arrays, see http://scipy-lectures.github.io/intro/numpy/array_object.html. 

In addition, it is **always** a good practice to check the dimensionality of the imported data using the `shape` command prior to constructing any classification model to make sure you have really imported all the data, and imported it in the correct way (e.g. one common mistake is to get the separator wrong and end up with only one column). 

In [3]:
# Convert to numpy array and check the dimensionality
npArray = np.array(wineQ)
print(npArray.shape)

(1487, 11)


### Test Activity 3: Inspect your data by indexing and index slicing

To select elements in an array, you specify their indices with square bracket notation. For a two-dimensional array, the first index indicates the row number and the second index indicates the column number. Try selecting the values of the first and second columns of the first sample in the npArray:

In [4]:
# Print the 1st row and 1st column of npArray
print(npArray[0])
print(npArray[:,0])
print(npArray[0,0])

[7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 'low']
[7.4 7.8 7.8 ..., 7.9 7.9 7.1]
7.4


In [5]:
# Print the 1st row and 2nd column of npArray
print(npArray[0])
print(npArray[:,1])
print(npArray[0,1])

[7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 'low']
[0.7 0.88 0.76 ..., 0.41 0.44 0.44]
0.7


To select ranges of elements, we use "index slicing". Index slicing is the technical name for the syntax A[lower:upper], where lower refers to the lower bound index that is included, and upper refers to the upper bound index that is not included. Try selecting the first three samples (rows):

In [6]:
# Print the first 3 rows of npArray
print(npArray[:3])

[[7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 'low']
 [7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.2 0.68 'low']
 [7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 'low']]


and also the first three samples (rows) of the last column:

In [7]:
# Print the first 3 rows from the last column of npArray
# print(npArray[:3,10])
print(npArray[:3,-1])

['low' 'low' 'low']


### Learning Activity 4: Split the data into input features, X, and outputs, y

Subsequently, we need to split our initial dataset into the data matrix X (independent variable) and the associated class vector y (dependent or target variable). The input features, _X_,  are the variables that you use to predict the outcome. In this data set, there are ten input features stored in columns 1-10 (index 0-9, although the upper bound is not included so the range for indexing is 0:10), all of which have continuous values. The output label, _y_, holds the information of whether the wine has been rated as high or low quality, and is stored in the final (eleventh) column (index 10). To split the data, we need to assign the columns of the input features and the columns of the output labels to different arrays:

In [8]:
# Split to input matrix X and class vector y
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]

Try printing the size of the input matrix _X_ and class vector _y_ using the "`shape`" command:

In [9]:
# Print the dimensions of X and y
print(X.shape)
print(y.shape)
print(X[:3])

(1487, 10)
(1487,)
[[  7.40000000e+00   7.00000000e-01   0.00000000e+00   1.90000000e+00
    7.60000000e-02   1.10000000e+01   3.40000000e+01   9.97800000e-01
    3.51000000e+00   5.60000000e-01]
 [  7.80000000e+00   8.80000000e-01   0.00000000e+00   2.60000000e+00
    9.80000000e-02   2.50000000e+01   6.70000000e+01   9.96800000e-01
    3.20000000e+00   6.80000000e-01]
 [  7.80000000e+00   7.60000000e-01   4.00000000e-02   2.30000000e+00
    9.20000000e-02   1.50000000e+01   5.40000000e+01   9.97000000e-01
    3.26000000e+00   6.50000000e-01]]


## Exploratory Data Analysis

Visualisation is an integral part of Data Science. Exploratory data analysis (EDA) is the field dealing with the analysis of data sets as a means of summarising their main characteristics, most often using visual methods.

Plotly is an online collaborative data analysis and graphing tool that we will use in order to construct fully interactive graphs. The Plotly API allows you to access all of the library's interactive functionality directly from Python (or other programming languages such as R, JavaScript and MATLAB, among others). Crucially, Plotly has recently been made **open-source**, which now enables plotting **offline** without requiring access to their API. _Plotly Offline_ brings interactive Plotly graphs to the _offline_ Jupyter (IPython) Notebook environment.


### Learning Activity 5:  Investigate the y frequencies

An important aspect to understand before applying any classification algorithms is how the output labels are distributed. Are they evenly distributed? Imbalances in distribution of labels can often lead to poor classification results for the minority class even if the classification results for the majority class are very good. 

In [10]:
# Print the y frequencies
yFreq = scipy.stats.itemfreq(y)
print yFreq

[['high' 500]
 ['low' 987]]


In our current dataset, you can see that the _y_ values are categorical (i.e. they can only take one of a discrete set of values) and have a non-numeric representation, "high" vs. "low". This can be problematic for scikit-learn and plotting functions in Python, since they assume numerical values, so we need to map the text categories to numerical representations using `LabelEncoder`  and the `fit_transform` function from the `preprocessing` module:

In [15]:
# Convert the categorical to numeric values, and print the y frequencies
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

yFreq = scipy.stats.itemfreq(y)
print yFreq[:,0]

[0 1]


Visualising the data in some way is a good way to get a feel for how the data is distributed. As a simple example, try plotting the frequencies of the class labels (held in _yFreq_), 1 and 0, and see how they are distributed using a barplot from Plotly:

In [22]:
# Display the y frequencies in a barplot with Plotly
data = [
    graph_objs.Bar(
        x = ['Hi Q','Lo Q'],
        y = [yFreq[0,1], yFreq[1,1]],
        marker = dict(color=['blue','red'])
    )
]

layout = Layout(
    xaxis = dict(title = "Wine Quality"),
    yaxis = dict(title = "Count"),
    width = 500
)

fig = dict(data = data, layout = layout)
iplot(fig)


More examples on Plotly barplots can be found at https://plot.ly/python/bar-charts/. In addition, a full list of arguments on barplots can be found at https://plot.ly/python/reference/#bar/.


### Learning Activity 6: Data scaling

It is usually advisable to scale your data prior to fitting a classification model to avoid attributes with
greater numeric ranges dominating those with smaller numeric ranges. In order to investigate the range and descriptive statistics of our features, we can apply the `describe()` function from `pandas` to the original `wineQ` DataFrame (**_not_** the numpy array!). For instance:

In [23]:
# Print the descriptive statistics of the wineQ DataFrame
wineQ.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates
count,1487.0,1487.0,1487.0,1487.0,1487.0,1487.0,1487.0,1487.0,1487.0,1487.0
mean,8.272966,0.465716,0.303423,2.898588,0.079183,18.746469,61.755884,0.996225,3.293692,0.639388
std,1.784884,0.192836,0.180989,2.198372,0.047974,12.936183,45.799977,0.002646,0.154856,0.180488
min,4.6,0.08,0.0,0.9,0.012,1.0,6.0,0.989,2.74,0.25
25%,7.0,0.3175,0.17,1.9,0.059,9.0,25.0,0.994965,3.19,0.53
50%,7.9,0.44,0.32,2.2,0.076,15.0,47.0,0.99672,3.3,0.61
75%,9.2,0.6,0.43,2.8,0.088,27.0,93.0,0.997905,3.39,0.72
max,15.9,1.33,1.0,15.5,0.611,82.0,289.0,1.0032,3.9,2.0


Boxplots are a powerful visual aid, commonly used in order to investigate simultaneously the range differences of the input features. Boxplots are a standardised way of displaying the distribution of the data based on the "five number summary" (minimum, first quartile, median, third quartile, and maximum). For example, try and plot the features of the _raw_ matrix _X_ using the script for the boxplots:

In [25]:
# Create a boxplot of the raw data
nrow, ncol = X.shape

data = [
    Box(
        y = X[:,i], # vals for box plot
        name = header[i],
        marker = dict(color = "purple"),
    ) for i in range(ncol)
]

layout = Layout(
    xaxis = dict(title = "Feature"),
    yaxis = dict(title = "Value"),
    showlegend = False
)

fig = dict(data = data, layout = layout)

iplot(fig)

There are many ways of scaling but one common scaling mechanism is auto-scaling, where for each
column, the values are centred around the mean and divided by their standard deviation. This scaling
mechanism can be applied by calling the `scale()` function in scikit-learn’s `preprocessing` module.

In [6]:
help(GridSearchCV)

Help on class GridSearchCV in module sklearn.grid_search:

class GridSearchCV(BaseSearchCV)
 |  Exhaustive search over specified parameter values for an estimator.
 |  
 |  Important members are fit, predict.
 |  
 |  GridSearchCV implements a "fit" and a "score" method.
 |  It also implements "predict", "predict_proba", "decision_function",
 |  "transform" and "inverse_transform" if they are implemented in the
 |  estimator used.
 |  
 |  The parameters of the estimator used to apply these methods are optimized
 |  by cross-validated grid-search over a parameter grid.
 |  
 |  Read more in the :ref:`User Guide <grid_search>`.
 |  
 |  Parameters
 |  ----------
 |  estimator : estimator object.
 |      A object of that type is instantiated for each grid point.
 |      This is assumed to implement the scikit-learn estimator interface.
 |      Either estimator needs to provide a ``score`` function,
 |      or ``scoring`` must be passed.
 |  
 |  param_grid : dict or list of dictionaries


In [24]:
# Auto-scale the data
X = preprocessing.scale(X)

Try to re-run the previous plotting script and have a look at the outcome of the boxplot after scaling. Alternatively, 
if you feel more adventurous, you create a more enhanced version of the boxplot. You can find more online examples at https://plot.ly/python/box-plots/, and also a full list of boxplot arguments at https://plot.ly/python/reference/#box.


In [None]:
# Create a boxplot of the scaled data (simple or enhanced)

### Learning Activity 7: Investigate the relationship between input features

You can visualise the relationship between two variables (features) using a simple scatter plot. This step can give you a good first indication of the ML model model to apply and its complexity (linear vs. non-linear). At this stage, let’s plot the first two variables against each other:

In [None]:
# Create a scatter plot of the first two features

We can also relate associations between features to their y classifications by making the colour of
the points dependent on the corresponding _y_ value:

In [None]:
# Create an enhanced scatter plot of the first two features


Examples of Plotly scatterplots can be found at https://plot.ly/python/line-and-scatter/ (or for a list of arguments refer to https://plot.ly/python/reference/#scatter/).


### Bonus Activity 8:  Try plotting different combinations of three features (f1, f2, f3) in the same plot.


The scatterplots we have seen so far investigated the relationship between two variables (features). A three-dimensional graph lets you introduce a third axis, typically called the _z_ axis, and can help you understand the relationship between three variables. Plotly's fully interactive functionality allows you to plot, hover, zoom and rotate 3-dimensional scatterplots. For a full list of arguments on 3d plots in Plotly visit https://plot.ly/python/reference/#scatter3d. Other examples on 3D scatterplots using Plotly can be found at https://plot.ly/python/3d-scatter-plots/.

_Hint: Investigate the Scatter3d object from Plotly_

_Axes in 3D Plotly plots work a bit differently than in 2D (axes are bound to a Scene object -- use help(Scene))._


In [None]:
# Create a 3D scatterplot using the first three features

### Bonus Activity 9: Try different combinations of f1 and f2 (in a grid/scatterplot matrix if you can).


A scatterplot matrix shows a grid of scatterplots where each attribute is plotted against all other attributes. For example, try to create a scatterplot matrix of the first four features.
You can find further information on how to create and customise subplots with Plotly at https://plot.ly/python/subplots/.

_Hints: You may want to use nested loops that iterate through the rows and columns of the grid, and also import and make use of the_ `make_subplots()` _function from Plotly_

In [None]:
# Create a grid plot of scatterplots using a combination of features

### Bonus Activity 10: Create a correlation matrix and plot a heatmap of correlations between the input features in X

Often, the different features (variables) in X are not completely independent from each other. For example,
fixed acidity is related to volatile acidity. To quickly identify which features are related and to
what extent, it is useful to see how they are correlated. You can do this by creating a correlation matrix
from X using `corrcoef()` in the `numpy` module:

In [None]:
# Calculate the correlation coefficient

To search for linear relationships between features across all pairs of features, you can use a heatmap
of correlations (directly from X), which is simply a matrix of subplots whose colours represent the
sizes of the correlations:

In [None]:
# Create a heatmap of the correlation coefficients

## Module 3

### Learning Activity 1: Split the data into training and test sets

Training and testing a classification model on the same dataset is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data (poor generalisation). To use different datasets for training and testing, we need to split the wine dataset into two disjoint sets: train and test (**Holdout method**) using the `train_test_split` function. <br/> 

In [47]:
# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

XTrain and yTrain are the two arrays you use to train your model. XTest and yTest are the two arrays that you use to evaluate your model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also specify the proportion of data you want to use for training and testing.

<br/>You can check the sizes of the different training and test sets by using the `shape` attribute:

In [48]:
# Print the dimensionality of the individual splits
print(XTrain.shape)
print(XTest.shape)
print(yTrain.shape)
print(yTest.shape)

(1115, 10)
(372, 10)
(1115,)
(372,)


You can also investigate how the class labels are distributed within the *yTest* vector by using the `itemfreq` function as previously

In [49]:
# Calculate the frequency of classes in yTest
yTestFreqs = scipy.stats.itemfreq(yTest)
print(yTestFreqs)

[[  0 129]
 [  1 243]]


We can see that 129 random samples of class 0 (high quality) and 243 random samples of class 1 (low quality) are included in the yTest set.


### Learning Activity 2: Apply KNN classification algorithm with scikit-learn

To build KNN models using scikit-learn, you will be using the `KNeighborsClassifier` object, which allows you to set the value of K using the `n_neighbors` parameter (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). The optimal choice for the value K is highly data-dependent: in general a larger K suppresses the effects of noise, but makes the classification boundaries less distinct. <br/>


For every classification model built with scikit-learn, we will follow four main steps: 1) **Building** the classification model (using either default, pre-defined or optimised parameters), 2) **Training** the model with data, 3) **Testing** the model, and 4) **Performance evaluation** using various metrics. <br/> <br/>

We are going to start by trying two pre-defined random values of K and compare their performance. Let us start with a small number of K such as K=3.

In [65]:
# Build a KNN classifier with 3 nearest neighbors
knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(XTrain, yTrain)
yPredK3 = knn3.predict(XTest)

print("Overall accuracy: ", round(metrics.accuracy_score(yTest, yPredK3),2))

('Overall accuracy: ', 0.87)


Let us try a larger value of K, for instance K = 99 or another number of your own choice; remember, it is good practice to select an **odd** number for K in a binary classification problem to avoid ties. Can you generate the KNN model and print the overall performance for a larger K (such as K=99) using as guidance the previous example? 

In [62]:
# Build a KNN classifier with 99 nearest neighbors
knnBig = KNeighborsClassifier(n_neighbors=99)
knnBig.fit(XTrain, yTrain)
yPredK3 = knnBig.predict(XTest)

print("Overall accuracy: ", round(metrics.accuracy_score(yTest, yPredK3),2))

('Overall accuracy: ', 0.85)


### Learning Activity 3: Calculate validation metrics for your classifier

In a classification task, once you have created your predictive model, you will need to evaluate it. Evaluation functions help you to do this by reporting the performance of the model through four main performance metrics: precision, recall and specificity for the different classes, and overall accuracy. To understand these metrics, it is useful to create a _confusion matrix_, which records all the true positive, true negative, false positive and false negative values.

We can compute the confusion matrix for our classifier using the `confusion_matrix` function in the `metrics` module.


In [67]:
# Get the confusion matrix for your classifier using metrics.confusion_matrix
mat = metrics.confusion_matrix(yTest, yPredK3)
print(mat)

[[ 99  30]
 [ 20 223]]


Because performance metrics are such an important step of model evaluation, scikit-learn offers a wrapper around these functions, `metrics.classification_report`, to facilitate their computation. It also offers the function `metrics.accuracy_score` that we tried before to compute the overall accuracy.


In [70]:
# Report the metrics using metrics.classification_report
print(metrics.classification_report(yTest, yPredK3))
print("accuracy: ", round(metrics.accuracy_score(yTest, yPredK3), 2))

             precision    recall  f1-score   support

          0       0.83      0.77      0.80       129
          1       0.88      0.92      0.90       243

avg / total       0.86      0.87      0.86       372

('accuracy: ', 0.87)


### Learning Activity 4: Plot the decision boundaries for different models

We can visualise the classification boundary created by the KNN classifier using the built-in function `visplots.knnDecisionPlot`. For easier visualisation, you can (interactively) select to view only the test samples from the plot. Remember though that the decision boundary has been built using the _training_ data! <br/> 

In [73]:
# Check the arguments of the function
help(visplots.knnDecisionPlot)

# Visualise the boundaries
visplots.knnDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_neighbors=3)
visplots.knnDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_neighbors=99)

Help on function knnDecisionPlot in module visplots:

knnDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_neighbors, weights='uniform')



** For smaller values of K the decision boundaries present many "creases". In this case the models may suffer from instances of overfitting. For larger values of K, we can see that the decision boundaries are less distinct and tend towards linearity. In these cases the boundaries may be too simple and unable to learn thus leading to cases of underfitting. **

### Test Activity 5: Try different weight configurations

Under some circumstances, it is better to give more importance ("weight" in computing terms) to nearer neighbors. This can be accomplished through the `weights` parameter.  When `weights = 'distance'`, weights are assigned to the training data points in a way that is proportional to the inverse of the distance from the query point. In other words, nearer neighbors contribute more to the fit. <br/>

What if we use weights based on distance? Does it improve the overall performance?

In [86]:
# Build the classifier with two pre-defined parameters (n_neighbors and weights)
w_knn3 = KNeighborsClassifier(n_neighbors=3, weights='distance')
w_knn3.fit(XTrain, yTrain)
yPredK3 = w_knn3.predict(XTest)

print("Overall accuracy: ", round(metrics.accuracy_score(yTest, yPredK3),2))

# Visualise the boundaries of a KNN model with weights equal to "distance"
visplots.knnDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_neighbors=11, weights='distance')

('Overall accuracy: ', 0.88)


## Module 4

### Learning Activity 1: Implement k-fold cross-validation

Let us estimate the accuracy of the classifier on the wine quality dataset by splitting the data 5 consecutive times (the parameter cv gives the number of samples the data is split into) using the cross_val_score function. For example, try to implement cross-validation for knn3, your KNN model with k=3:

In [93]:
# Implement cross-validation for knn3
knn3scores = cross_val_score(knn3, X, y, cv = 5)
print(knn3scores)
print("Mean of scores KNN3", knn3scores.mean())

[ 0.69463087  0.68456376  0.85521886  0.91245791  0.78114478]
('Mean of scores KNN3', 0.78560323593880632)


What happens if we change the number of _K_?

### Parameter Tuning

### Learning Activity 2: Grid search on hyperparameters

Rather than trying one-by-one predefined values of K, we can automate this process. The scikit-learn library provides the grid search function `GridSearchCV` (http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html), which allows us to exhaustively search for the optimum combination of parameters by evaluating models trained with a particular algorithm with all provided parameter combinations. Further details and examples on grid search with scikit-learn can be found at http://scikit-learn.org/stable/modules/grid_search.html <br/>

You can use the `GridSearchCV` function with the validation technique of your choice (in this example, 10-fold cross-validation has been applied) to search for a parametisation of the KNN algorithm that gives a more optimal model:

In [104]:
# Grid search with 10-fold cross-validation using a dictionary of parameters
# params dict
n_neighbors = np.arange(1,51,2)
weights = ['uniform', 'distance']

parameters = [
    {
        'n_neighbors': n_neighbors,
        'weights': weights
    }
]

gridCV = GridSearchCV(KNeighborsClassifier(), parameters, cv=10)
gridCV.fit(XTrain, yTrain)

bestNeighbors = gridCV.best_params_['n_neighbors']
bestWeight = gridCV.best_params_['weights']

print("Best params: n_neighbors=", bestNeighbors, " and weights=", bestWeight)

('Best params: n_neighbors=', 27, ' and weights=', 'distance')


We can also graphically represent the results of the grid search using a heatmap:

In [106]:
# Visualise the grid search results using a heatmap
scores = np.zeros((len(n_neighbors), len(weights)))

for score in gridCV.grid_scores_:
    ne = score[0]['n_neighbors']
    i = np.argmax(n_neighbors == ne)
    j = 0 if (score[0]['weights'] == 'uniform') else 1
    scores[i,j] = score[1]

data = [
    Heatmap(
        x = n_neighbors,
        y = weights,
        z = scores.T,
        colorscale='Jet',
        reversescale=True,
        colorbar = dict(
            title = "Classification Accuracy",
            len = 4,
            nticks=10
        )
    )
]

layout = Layout(
    xaxis = dict(
        title = "Number of k nearest neighbors",
        tickvals = n_neighbors
    ),
    yaxis = dict(title = "Weights"),
    height = 250
)

fig = dict(data = data, layout = layout)
iplot(fig)

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (XTest). <Br/>
So, we are testing our independent XTest dataset using the optimised model:

In [107]:
# Build the classifier using the optimal parameters detected by grid search 
knn = KNeighborsClassifier(n_neighbors=bestNeighbors, weights=bestWeight)
knn.fit(XTrain, yTrain)
yPredKnn = knn.predict(XTest)

print(metrics.classification_report(yTest, yPredKnn))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredKnn), 2))

             precision    recall  f1-score   support

          0       0.93      0.74      0.83       129
          1       0.88      0.97      0.92       243

avg / total       0.90      0.89      0.89       372

('Overall Accuracy:', 0.89)


### Learning Activity 3: Systematic variation of the K neighbors, and the bias-variance trade-off

We can graphically represent and investigate how the systematic increase of the number of _K_ neighbors influences the validation, train and test accuracy (and attempt to detect cases of over- or under-fitting). 

In [108]:
# Explore the benefit of cross-validated results vs. simple training and test data separation
train_scores = []
test_scores  = []
cv_scores    = [x[0] for x in scores]

for n in n_neighbors:
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(XTrain, yTrain)
    train_scores.append(metrics.accuracy_score(yTrain, knn.predict(XTrain)))
    test_scores.append(metrics.accuracy_score(yTest, knn.predict(XTest)))

In [109]:
# Plot the train, test and validation accuracies
trace0 = Scatter(
    x = n_neighbors,
    y = train_scores,
    mode = "lines+markers",
    name = "Training Scores"
)

trace1 = Scatter(
    x = n_neighbors,
    y = test_scores,
    mode = "lines+markers",
    name = "Test Scores"
)

trace2 = Scatter(
    x = n_neighbors,
    y = cv_scores,
    mode = "lines+markers",
    name = "CV Scores"
)

layout = Layout(
    xaxis = dict(title = 'number of neighbors'),
    yaxis = dict(title = 'prediction accuracy')
)

fig = Figure(data=[trace0, trace1, trace2], layout=layout)

iplot(fig)

### Test Activity 4: Randomized search on hyperparameters

Unlike `GridSearchCV`, `RandomizedSearchCV` does not exhaustively try all the parameter settings. Instead, it samples a fixed number of parameter settings based on the distributions you specify (e.g. you might specify that one parameter should be sampled uniformly while another is sampled following a Gaussian distribution). The number of parameter settings that are tried is given by `n_iter`. If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. You should use continuous distributions for continuous parameters. Further details can be found at http://scikit-learn.org/stable/modules/grid_search.html

In [None]:
# Conduct a randomised search on hyperparameters

In [None]:
# Print the optimal n_neighbors detected by randomised search

We can also graphically represent the results of the randomised search using a scatterplot:

In [None]:
# Visualise the randomised search results using a scatterplot

In [None]:
# Build the classifier using the optimal parameters detected by randomised search

## Module 5

### Learning Activity 1:  Decision Tree

Decision Tree classifiers construct classification models in the form of a tree structure. A decision tree progressively splits the training set into smaller subsets. Each node of the tree represents a subset of the data. Once a new sample is presented to the data, it is classified according to the test condition generated for each node of the tree.

Let us build a simple decision tree with 3 layers. We will first evaluate the accuracy, then plot the decision boundaries just as we did for knn. (See http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html for the documentation of the classifier.)

In [111]:
# Build a Decision Tree classifier with 3 layers
dtc = DecisionTreeClassifier(max_depth=3)
dtc.fit(XTrain, yTrain)
predDT = dtc.predict(XTest)

print(metrics.classification_report(yTest, predDT))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, predDT),2))

visplots.dtDecisionPlot(XTrain, yTrain, XTest, yTest, header, max_depth=3)

             precision    recall  f1-score   support

          0       0.94      0.60      0.73       129
          1       0.82      0.98      0.89       243

avg / total       0.86      0.85      0.84       372

('Overall Accuracy:', 0.85)


### Learning Activity 2:  Random Forests

The random forests model is an `ensemble method` since it aggregates a group of decision trees into an ensemble (http://scikit-learn.org/stable/modules/ensemble.html). Ensemble learning involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer. Unlike single decision trees which are likely to suffer from high Variance or high Bias (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes. <br/> 

Let us start by building a simple Random Forest model which consists of 100 independently trained decision trees. For further details and examples on how to construct a Random Forest, see http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [115]:
# Build a Random Forest classifier with 100 decision trees
rf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=4)
rf.fit(XTrain, yTrain)
predRF = rf.predict(XTest)

print(metrics.classification_report(yTest, predRF))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, predRF),2))

             precision    recall  f1-score   support

          0       0.92      0.77      0.84       129
          1       0.89      0.96      0.92       243

avg / total       0.90      0.90      0.89       372

('Overall Accuracy:', 0.9)


### Learning Activity 3: Visualising the RF accuracy

We can also investigate how the overall test accuracy gets influenced with the increase of `n_estimators` (decision trees) in our model. In order to do so, we can use the provided `rfAvgAcc` function from `visplots`:

In [116]:
# Visualise the average accuracy 
visplots.rfAvgAcc(rfModel=rf, XTest=XTest, yTest=yTest)

### Learning Activity 4: Feature Importance 

Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the datapoints in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is.

We can use the `feature_importances_` attribute of the RF classifier to obtain the relative importance of each feature, which we can then visualise using a simple bar plot.

In [118]:
# Display the importance of the features in a barplot
importance = rf.feature_importances_
names = header[0:10]

data = [
    Bar(
        x = importance,
        y = names,
        orientation = 'h',
    )
]

layout = Layout(
    xaxis = dict(title = "Importance of features"),
    yaxis = dict(title = "Features"),
    width = 800,
    margin=Margin(
        l=250,
        r=50,
        b=100,
        t=50,
        pad=4
    )
)

fig = dict(data = data, layout = layout)
iplot(fig)

###  Learning activity 5: Boundary visualisation

We can visualise the classification boundary created by the Random Forest using the `visplots.rfDecisionPlot` function. You can check the arguments passed in this function by using the `help` command. For easier visualisation, only the test samples have been included in the plot. And remember that the decision boundary has been built using the _training_ data!

In [120]:
# Check the arguments of the function
help(visplots.rfDecisionPlot)

# Visualise the boundaries
visplots.rfDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_estimators=100)

Help on function rfDecisionPlot in module visplots:

rfDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_estimators=10)



### Learning Activity 6: Tuning Random Forests with grid search

Random forests offer several parameters that can be tuned. In this case, parameters such as `n_estimators`, `max_features`, `max_depth` and `min_samples_leaf` can be some of the parameters to be optimised. 

In [121]:
# View the list of arguments to be optimised
help(RandomForestClassifier())

Help on RandomForestClassifier in module sklearn.ensemble.forest object:

class RandomForestClassifier(ForestClassifier)
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and use averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is always the same as the original
 |  input sample size but the samples are drawn with replacement if
 |  `bootstrap=True` (default).
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------
 |  n_estimators : integer, optional (default=10)
 |      The number of trees in the forest.
 |  
 |  criterion : string, optional (default="gini")
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |      Note: this parameter is tree-specific.
 |  
 |  max_features : int, f

Create a dictionary of allowed parameter ranges for `n_estimators` and `max_depth` (or include more of the parameters you would like to tune) and conduct a grid search with cross validation using the `GridSearchCV` function as before:

In [124]:
# Conduct a grid search with 10-fold cross-validation using the dictionary of parameters
n_estimators = np.arange(1,30,5)
max_depth = np.arange(1,100,5)
parameters = [{'n_estimators': n_estimators, 'max_depth': max_depth}]

gridCV = GridSearchCV(RandomForestClassifier(), param_grid=parameters, cv=50)
gridCV.fit(XTrain, yTrain)

best_n_estim = gridCV.best_params_['n_estimators']
best_max_depth = gridCV.best_params_['max_depth']

print("Best parameters: n_estimators=", best_n_estim, ", max_depth=", best_max_depth)

('Best parameters: n_estimators=', 26, ', max_depth=', 96)


Finally, testing our independent XTest dataset using the optimised model: 

In [None]:
# Build the classifier using the optimal parameters detected by grid search

We can also graphically represent the results of the grid search using a heatmap:

In [None]:
# Create a heatmap like the one you made when you applied GridSearchCV to KNN

### Bonus Activity 7: Parallelisation


The scikit-learn implementation of Random Forests also features the parallel construction of the trees and the parallel computation of the predictions through the n_jobs parameter.
If `n_jobs=k` then computations are partitioned into k jobs, and run on k cores of the machine.
If `n_jobs=-1` then all cores available on the machine are used.


In [None]:
# Change the value of n_jobs and estimate the excution time of fit

In [None]:
# Plot a graph