# Applied Machine Learning (INFR11211) 

# Lab 3: Evaluation

In this lab, we learn how to do evaluation for the classifier. Specifically, we will perform K-Nearest Neighbors (KNN)  and Decision trees on the [Breast cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) dataset.

Now let's import the packages

In [None]:
# Import packages
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, log_loss, accuracy_score, r2_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from pandas.api.types import CategoricalDtype
%matplotlib inline

# 1. Nearest Neighbors Classification
In the first part, we will assess the performance of a K-Nearest Neighbors (KNN) classifier. This is a very simple supervised approach which you can read more about KNNs [here](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). For this model, at test time we simply assign a test instance to the same class as the nearest instances in feature space from the training set. The simplest case to think about is where k=1. At test time, we simply classify each test instance by computing the distance to each of the labeled training examples and choosing the class label from the training instance that is closest. Here, distance can be measured used using the Euclidean distance. When k>1, we select the class label that is most common among the k nearest neighbors. k is a user defined hyper parameter that needs to be selected for each dataset.   

To evaluate this model we will introduce a new dataset, the [Breast cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) dataset. The classification task here is to determine whether a tumor is `M=malignant` or `B=benign`. For more information, you can read the dataset description in the link.

### ========== Question 1.1 ==========
The dataset can be loaded directly from `Scikit-learn`, see [sklearn.datasets.load_breast_cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) for more details.
Let's load the detaset and display the size and first 10 instances.

In [None]:
from sklearn.datasets import load_breast_cancer
# specifying "as_frame=True" to return the data as a pandas Dataframe
cancer_data = load_breast_cancer(as_frame=True).frame
print('Number of instances: {}, number of features: {}'.format(cancer_data.shape[0], cancer_data.shape[1]))
cancer_data.head(10)

We can see that this dataset consists of 30 features, and the last column is the target. Here the target is encoded as an integer (`0=malignant, 1=benign`).

## Hold-out validation
To get an accurate estimate of the model's classification performance we will use hold-out validation. Familiarise yourself with the logic behind [`train_test_split CV`](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance) (also called `Hold-out` validation) and [how it is used](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split) in `Scikit-learn`. Execute the cell below to create your training/testing sets by assigning 10% of the data to the test set (and convince yourself you understand what is going on).

In [None]:
X = cancer_data.drop('target', axis=1)
y = cancer_data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.9, test_size=0.1, random_state=0)

### ========== Question 1.2 ==========
Display the shapes of the four arrays `X_train`, `y_train`, `X_test`, and `y_test`

In [None]:
# Your Code goes here:

### ========== Question 1.3 ==========
Familiarise yourself with [Nearest Neighbours Classification](https://scikit-learn.org/stable/modules/neighbors.html#classification). Use a [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
by using a single neighbour. Report the classification accuracy on the **training set**.

In [None]:
# Your Code goes here:

### ========== Question 1.4 ==========
Is the above result meaningful? Why is testing on the training data a particularly bad idea for a 1-nearest neighbour classifier? Do you expect the performance of the classifier on a test set to be as good?

***Your Answer goes here:***

### ========== Question 1.5 ==========
Now report the classification accuracy on the **test set** and check your expectations.

In [None]:
# Your Code goes here:

### ========== Question 1.6 ==========
Plot a histogram of the target variable in the test set. *Hint: You can use Pandas' built-in bar plot tool in conjunction with the [`value_counts`](https://pandas.pydata.org/pandas-docs/version/1.3.1/reference/api/pandas.Series.value_counts.html).* 

In [None]:
# Your Code goes here:

### ========== Question 1.7 ==========
What would be the accuracy of the classifier, if all points were labelled as `1`? 

**Pro Tip** - You should always use a ['Dummy Model'](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators) (a ridiculously simple model) like this to compare with your 'real' models. It's very common for complex models to be outperformed by a simple model, such as predicting the most common class. When complex models are outperformed by 'Dummies', you should investigate why: often there was an issue with the code, the data, or the way the model works was misunderstood.

In [None]:
# Your Code goes here:

### ========== Question 1.8 ==========
Now we want to explore the effect of the `k` parameter. To do this, train the classifier multiple times, each time setting the KNN option to a different value. Try `1`, `3`, `5`, `7`, `10`, `30`, `50`, `100`, and `200` and test the classifier on the test set. How does the k parameter effect the results?   
*Hint: Consider how well the classifier is generalising to previously unseen data, and how it compares to the dumb prediction accuracy.*   
*Hint: You should be able to implement this in a few lines using a for loop.*

In [None]:
# Your Code goes here:

***Your Answer goes here:***

### ========== Question 1.9 ==========
Plot the results (k-value on the x-axis and classification accuracy on the y-axis), making sure to label both axes. Can you conclude anything from observing the plot?

In [None]:
# Your Code goes here:

***Your Answer goes here:***

### ========== Question 1.10 ==========
We now evaluate the classifier by looking at the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). Familiar youself with the definition of confusion matrix in this link.

Scikit-learn has a [`confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html?highlight=confusion_matrix#sklearn.metrics.confusion_matrix) implementation which returns a numpy array (square matrix) of dimensionality `C`, where `C` is the number of classes (2 in our case).

**a)** Select the best value for k from Questions 1.8 and 1.9 and compute the resulting confusion_matrix by using the builtin scikit-learn class and display the result.

In [None]:
# Your Code goes here:

**e)** Normalise the produced confusion matrix by the true class and display the result.

In [None]:
# Your Code goes here:

**f)** By making use of the `plot_confusion_matrix` provided below, visualise the normalised confusion matrix. Plot the appropriate labels on both axes by making use of the `classes` input argument.

In [None]:
def plot_confusion_matrix(cm, classes=None, title='Confusion matrix'):
    """Plots a confusion matrix."""
    if classes is not None:
        sns.heatmap(cm, xticklabels=classes, yticklabels=classes, vmin=0., vmax=1., annot=True)
    else:
        sns.heatmap(cm, vmin=0., vmax=1.)
    plt.title(title)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Your Code goes here:

### ========== Question 1.11 ==========
Read about the [cross entropy loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html) (called `log_loss` here). It is a commonly used loss function (i.e. loss metric) used when we are trying to optimise binary classification models.

This metric takes as input the true labels and the estimated probability distributions. It makes sense to use this metric when we are interested not only in the predicted labels, but also in the confidence with which these labels are predicted.

For instance, think of the situation where you have a single test point and two classifiers. Both classifiers predict the label correctly, however classifier A predicts that the test point belongs to the class with probability 0.55, whereas classifier B predicts the correct class with probability 0.99. Classification accuracy would be the same for the two classifiers (why?) but the `log_loss` metric would indicate that classifier B should be favoured.

Produce a scatter plot, showing `log_loss` on your y axis. Use [predict_proba](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict_proba) to return probability estimates for the test data. Which value for `k` would you pick if `log_loss` was the error metric? Comment on why this might happen, and which metric would be a better evaluator of performance.

In [None]:
# Your Code goes here:

***Your Answer goes here:***

## 2. Decision Trees
One of the big advantages of decision trees is their interpretability. The rules learnt for classification are easy for a person to follow, unlike the opaque "black box" of many other methods, such as neural networks. We demonstrate the utility of this using the same dataset.

### ========== Question 2.1 ==========
Now we will train a Decision Tree classifier on the training data. Read about [Decision Tree classifiers](https://scikit-learn.org/stable/modules/tree.html) in `Scikit-learn` and how they are [used](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier). 
Create a `DecisionTreeClassifier` instance, naming it `dt` and train it by using training data only (i.e. `X_train` and `y_tain`). Set the `criterion` attribute to `entropy` in order to measure the quality of splits using entropy. Use the default settings for the rest of the parameters.   

By default, trees are grown to full depth; this means that very fine splits are made involving very few data points. Not only does this make the trees hard to visualise (they'll be deep), but also we could be overfitting the data. For now, we arbitrarily choose a depth of 3 for our tree (to make it easier to interpret below), but this is a parameter we could tune. For consistency, use a `random_state=1000`.

In [None]:
# Your Code goes here:

We have mentioned in the class that decision trees have the advantage of being interpretable by humans. Now we visualise the decision tree we have just trained. Scikit-learn can export the tree in a `.dot` format. Run the following code:

In [None]:
dot_data = export_graphviz(dt, out_file=None, 
    feature_names=X_train.columns,  
    class_names=['0', '1'],  
    filled=True, rounded=True,  
    special_characters=False)
graph = graphviz.Source(dot_data)
graph

An alternative way to visualise the tree is to open the output .dot file with an editor such as [this online .dot renderer](http://dreampuf.github.io/GraphvizOnline/). You can use the code below to create a dot-file and then copy and paste its contents into the online site (you can double click on the tree once it has been produced to view it in full screen).

In [None]:
column_names = X_train.columns
with open("tree.dot", 'w') as f:
    f = export_graphviz(dt, out_file=f,
                        feature_names=column_names,  
                        class_names=['0', '1'],  
                        filled=True, rounded=True,  
                        special_characters=False)

### ========== Question 2.2 ==========
Inspect the tree and describe what it shows. 

***Your Answer goes here:***

### ========== Question 2.3 ==========
Tree-based estimators (i.e. decision trees and random forests) can be used to compute feature importances. The importance of a feature is computed as the (normalized) total reduction of entropy (or other used `criterion`) brought by that feature. Find the relevant features of the classifier you just trained (i.e. those which are actually used in this short tree) and display feature importances along with their names.

In [None]:
# Your Code goes here:

### ========== Question 2.4 ==========
Now we want to evaluate the performance of the classifier on unseen data. Use the trained model to predict the target variables for the test data set. Display the classification accuracy for both the training and test data sets. What do you observe? Are you surprised by the results?

In [None]:
# Your Code goes here:

***Your Answer goes here:***

### ========== Question 2.5 ==========

Fit another `DecisionTreeClassifier` but this time grow it to full depth (i.e. remove the max_depth condition). Again, use a `random_state=1000`. Display the classification accuracy for training and test data as above. Again, what do you observe and are you surprised?

In [None]:
# Your Code goes here:

***Your Answer goes here:***

### ========== Question 2.6 ==========
By using seaborn's heatmap function, plot the normalised confusion matrices for both the training and test data sets **for the max_depth=3 decision tree from question 2.1**. Make sure you label axes appropriately.   
*Hint: You can make use of the `plot_confusion_matrix` function below.*  

In [None]:
def plot_confusion_matrix(cm, classes=None, title='Confusion matrix'):
    """Plots a confusion matrix."""
    if classes is not None:
        sns.heatmap(cm, xticklabels=classes, yticklabels=classes, vmin=0., vmax=1., annot=True)
    else:
        sns.heatmap(cm, vmin=0., vmax=1.)
    plt.title(title)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Normalised Data

# Your Code goes here:


In [None]:
# Validation Data

# Your Code goes here:


**N.B. it will be obvious if you have plotted the full depth decision tree as the training confusion matrix will be the identity**

### ========== Question 2.7 ==========

Finally we will create a [`Random decision forest`](http://scikit-learn.org/0.24/modules/generated/sklearn.ensemble.RandomForestClassifier.html) classifier and compare the performance of this classifier to that of the decision tree. The random decision forest is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. Start with `n_estimators = 100`, use the `entropy` criterion and the same train/test split as before. Plot the classification accuracy of the random forest model on the test set and show the confusion matrix. How does the random decision forest compare performance wise to the decision tree?

In [None]:
# Your Code goes here:

### ========== Question 2.8 ==========
How high can you get the performance of the classifier by changing the max depth of the trees (`max_depth`), or the `max_features` parameters? Try a few values just to get a look. *Don't do a grid search or anything in-depth, just get a feel*. Try the same settings twice...do you get the same accuracy?

In [None]:
# Your Code goes here:

N.B. Observing these confusion matrices you'll see something very important - for some configurations, the Random Forest **always predicts the majority class**: incidentally these are also the cases which do the best. This highlights (again) the importance of always checking performance against a dummy classifier!!!

Additionally, if you want to reproduce your results, you must set the random seed (you can do this with the `random_state` argument). Random forests are...random!

### ========== Question 2.9 ==========
Compare the feature importances as estimated with the decision tree and random forest classifiers.

In [None]:
# Your Code goes here: