# Unit 5 – Practical Modelling and Predictions

Today we will learn how to practically tune a classifier by using cross-validation and a hold-out test set. Finally, there will be a competition for you to hone your skills.

Let's dive right into it. 

## Today's idioms
### Training a naive bayes classifier in Scikit-learn
Scikit-learn comes with various implementations of the Naive Bayes classifier. The "standard" one is the Gaussian variant, which assumes the data has a normal distribution (although it works reasonably well for cases where this assumption doesn't quite hold).

In [1]:
import matplotlib
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn.datasets as data

X, y  = data.load_iris(return_X_y=True)  # returns data to be used in estimator
# X is the data, y are the targets:
print("data has shape {}, targets have shape {}.".format(X.shape, y.shape))


data has shape (150, 4), targets have shape (150,).


We instantiate and train/fit the classifier:

In [2]:
import sklearn.naive_bayes as nb
gnb = nb.GaussianNB() # instantiates a Gaussian Naive Bayes object
gnb.fit(X=X, y=y)

Let's predict the training data:

In [3]:
y_pred = gnb.predict(X=X)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Note that the targets 0, 1, 2 stand for *setosa*, *virginica* and *versicolor*. It looks like there are a few errors though! This is how the real targets look:

In [4]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Let's compare the two:

In [5]:
y - y_pred

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, -1,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

This would be all zeros if the prediction was correct. Let's compute the accuracy, i.e. the percetage of correct predictions:

In [6]:
y != y_pred

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False,

This is a more general way to compare classifications. Think about when these two ways produce different results and why this one is correct.

In [7]:
fraction_wrong = np.sum(y != y_pred)/len(y) # how many nonzeros, divided by total number
acc = 100*(1-fraction_wrong)
print("accuracy is {:.2f}%".format(acc))

accuracy is 96.00%


96 percent correct!

let's see what an SVM can do.

### Training an SVM in scikit-learn

We use a linear SVM with a linear kernel. The process is the standard one. First we instantiate the classifier:

In [8]:
from sklearn import svm
svc = svm.SVC(kernel='linear') 

Then we fit and predict: 

In [9]:
svc.fit(X=X, y=y)
y_pred = svc.predict(X=X)

Finally we compute the accuracy:

In [10]:
acc = 100*(1-np.sum(y != y_pred)/len(y))
print("accuracy is {:.2f}%".format(acc))

accuracy is 99.33%


hey, that's already substantially better than Naive Bayes! Later you can try to get even better by tuning the parameters.

But first let's try a Multi-Layer Neural Network.

### Training a Multi-Layer Neural Network in scikit-learn

By now you should have gotten the gist of how to instantiate, train and predict using a classifier. The process is standard in `sklearn`, that means you can use it with any estimator/classifier. 

Here we lump it all together in one cell. 

In [11]:
from sklearn import neural_network as nn
mlp = nn.MLPClassifier()  #instantiates a Multi-Layer Perceptron (a specific Multi-Layer Neural Network)
mlp.fit(X=X, y=y)
y_pred = mlp.predict(X=X)
acc = 100*(1-np.sum(y != y_pred)/len(y))
print("accuracy is {:.2f}%".format(acc))


accuracy is 98.00%




Oh, a warning! It complains that the training process of the network has not yet converged, because the maximum iterations the algorithm is instructed to perform (by default) have been reached. 

### Task 1 - fix the warning
Check the documentation of `sklearn.neural_network.MLPClassifier` to find the parameter that controls the maximum number of iterations. Set it to a higher value until the warning goes away.

##### nn.MLPClassifier?

Fix the code below to make the warning disappear:

In [12]:
mlp = nn.MLPClassifier()  #here you need to add the parameter
# Increase the number of iterations to 1000
mlp = nn.MLPClassifier(max_iter=800)
mlp.fit(X=X, y=y)
y_pred = mlp.predict(X=X)
acc = 100*(1-np.sum(y != y_pred)/len(y))
print("accuracy is {:.2f}%".format(acc))

accuracy is 98.00%


## Cross-validation 

Up to now we trained the classifier on all available data. It is very likely that we are overestimating the ability of the trained classifier to predict unseen data. But since we don't *have* unseen data, how can we test how well the classifier might generalise?

The solution is **Cross-validation**. In a nutshell, it means you split your data into several equal-sized chunks, so-called folds. You keep one fold out and train the classifier on the remaining chunks. Repeat until each fold has been held out once. Then compute the accuracy on the held-out folds. 

The scikit-learn online documentation has a [good overview](https://scikit-learn.org/stable/modules/cross_validation.html) I recommend to read should you feel you need a refresher.  

The cross-calidation module is straightforward to use. It does the data splitting for you and loops over the fit/predict cycle. All you need to provide is an instance of a classifier.

Below is an example of cross-validation with an SVM with a linear kernel and `C=1`:

In [13]:
import sklearn.model_selection as ms
clf = svm.SVC(kernel='linear', C=1)
scores = ms.cross_val_score(clf, X, y, cv=5)
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

Let's compute the mean score and the 95% confidence interval (i.e., two standard deviations):

In [14]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


## Parameter tuning: some examples

### Creating a hold-out data set: test-training split
When tuning parameters, it is advisable to create a hold-out set that you don't touch until your parameter tuning campaign is finished. 

You can create multiple candidate tuning sets and use the hold-out set at the very end in order to find the model that generalises best.

Scikit-learn provides the `test_train_split` function that provides this functionality. The below example creates a 40% test/training split:

In [15]:
X_train, X_test, y_train, y_test = ms.train_test_split(
     X, y, test_size=0.4, random_state=0)

Let's look at the sizes:

In [16]:
X_train.shape

(90, 4)

In [17]:
y_train.shape

(90,)

In [18]:
X_test.shape

(60, 4)

In [19]:
y_test.shape

(60,)

That looks fine. 

In the remainder, we will not be touching the test set until we have completed the parameter optimisation. 

### SVM tuning

Tunable parameters for soft-margin SVMs include:
 * $C$, a positive, numerical value, that controls the penalty for misclassifications.
 * the kernel choice. Sensible beginner choices in scikit-learn are `linear`, `poly`, `rbf`, `sigmoid`. These come with different tuning options. 

#### Stuck? Read the docs!
Refer to the documentation on [support vector classification](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info.

There is also an excellent [SVM user guide](https://scikit-learn.org/stable/modules/svm.html) available in the on-line documentation of scikit-learn. Read 

### Task 2: Hands-on tuning
Use the code below to try out a few parameter settings. Note that it uses cross-validation, using the previously generated training data from the `train_test_split` above.

In [20]:
svc = svm.SVC(kernel='linear', C=1)
scores = ms.cross_val_score(svc, X_train, y_train, cv=5)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.9889 (+/- 0.0444)


In [21]:
## Attempt 1
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0
)

# Set the kernel type to 'linear'
svc = SVC(kernel='linear', C=1)
scores = cross_val_score(svc, X_train, y_train, cv=5)

mean_accuracy = scores.mean()
std_accuracy = scores.std() * 2

print("Accuracy: %0.4f (+/- %0.4f)" % (mean_accuracy, std_accuracy))

Accuracy: 0.9889 (+/- 0.0444)


In [22]:
## Attempt 2

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0
)

svc = SVC(kernel='poly', C=1)
scores = cross_val_score(svc, X_train, y_train, cv=5)

mean_accuracy = scores.mean()
std_accuracy = scores.std() * 2

print("Accuracy: %0.4f (+/- %0.4f)" % (mean_accuracy, std_accuracy))

Accuracy: 0.9889 (+/- 0.0444)


In [23]:
## Attempt 3
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0
)

svc = SVC(kernel='rbf', C=1)

scores = cross_val_score(svc, X_train, y_train, cv=5)

mean_accuracy = scores.mean()
std_accuracy = scores.std() * 2

print("Accuracy: %0.4f (+/- %0.4f)" % (mean_accuracy, std_accuracy))


Accuracy: 0.9778 (+/- 0.0889)


In [24]:
## Attempt 4

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0
)

# Set the kernel type to 'sigmoid'
svc = SVC(kernel='sigmoid', C=1)

# Perform 5-fold cross-validation and evaluate the accuracy
scores = cross_val_score(svc, X_train, y_train, cv=5)

# Calculate and print the mean accuracy and standard deviation
mean_accuracy = scores.mean()
std_accuracy = scores.std() * 2

print("Accuracy: %0.4f (+/- %0.4f)" % (mean_accuracy, std_accuracy))

Accuracy: 0.3778 (+/- 0.0444)


When you have found a good combination of $C$ and kernel choice, execute the code below to store that classifier in the variable `best_svm`, and proceed to the next part, "Neural Networks"

In [25]:
import sklearn.model_selection as ms
from sklearn import svm

kernel_types = ['linear', 'poly', 'rbf', 'sigmoid']
C_values = [0.1, 1, 10]

# Initialize variables to store the best accuracy and parameters
best_accuracy = 0
best_params = {}

# Iterate over kernel types
for kernel in kernel_types:
    
    for C in C_values:
        
        svc = svm.SVC(kernel=kernel, C=C)

       
        scores_values = ms.cross_val_score(svc, X_train, y_train, cv=5)

        # Calculate the mean accuracy
        mean_accuracy = scores_values.mean()

        
        if mean_accuracy > best_accuracy:
            
            best_accuracy = mean_accuracy
            best_params = {'kernel': kernel, 'C': C}

best_svm = svm.SVC(kernel=best_params['kernel'], C=best_params['C'])
print(f"Best Kernel: {best_params['kernel']}, Best C: {best_params['C']}")
print("Best Acc.: %0.4f" % best_accuracy)

Best Kernel: linear, Best C: 1
Best Acc.: 0.9889


### Neural networks

Neural networks have tons of parameters to tune, and they are usually much more difficult to understand than SVM's parameters. Have a look at the [MLPClassifier on-line documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) to get a glimpse. Note that these are not even deep networks, for which the number of parameters multiply!

The most common tuning parameters are.
* `hidden_layer_sizes`: the number of the number of hidden layers and the neurons therein. Examples:
    * `hidden_layer_sizes=(100,)` would create a single hidden layer with 100 neurons (the default)
    * `hidden_layer_sizes=(20,10)` would create two hidden layers, the first with 20 and the second with 10 neurons. 
* `activation`: the transfer function. The default is `relu`, the "REctifying Linear Unit", but other choices are worth a try, e.g. the signmoidal functions `sigmoid`, `tanh`, or `logistic`.
* `learning_rate`: The default is constant, but for complicated data sets it can help to set it to other options. Check the documentation for an explanation. 

The [MLPClassifier on-line documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) states a useful tweak to the `solver` parameter:
> The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

We will therefore use the 'lbfgs' solver. 

* best MLP should be stored in variable `best_mlp`

In [26]:
mlp = nn.MLPClassifier(hidden_layer_sizes=(100,10), 
                       activation='relu',
                       learning_rate='constant',
                       solver='lbfgs')
scores = ms.cross_val_score(mlp, X_train, y_train, cv=5)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.9667 (+/- 0.0544)


As above we store the best network for evaluation of the test set. 

In [27]:
best_mlp = mlp

### What's the best classfier?
Now our hold-out dataset comes into play. Let's evaluate the best MLP and the best SVM on the hold-out set.

Verify that the best_svm has the correct parameters:

In [28]:
best_svm

We train it on the entire training set and evaluate it on the held-out test set.

In [29]:
# SVM 
best_svm.fit(X=X_train, y=y_train)
y_pred_svm = best_svm.predict(X_test)
acc = 100*(1-np.sum(y_test != y_pred_svm)/len(y_test))
print("accuracy is {:.4f}%".format(acc))

accuracy is 96.6667%


Same for the Neural Network:

In [30]:
best_mlp

In [31]:
# MLP 
best_mlp.fit(X=X_train, y=y_train)
y_pred_mlp = best_mlp.predict(X_test)
acc = 100*(1-np.sum(y_test != y_pred_mlp)/len(y_test))
print("accuracy is {:.4f}%".format(acc))

accuracy is 96.6667%


### Task 3 - Repeat with Naïve Bayes classifier
Repeat the training procedure with the Naive Bayes algorithm. Copy necessary code from above and adapt to use the training set for training and the testing set for prediction.

In [32]:
## enter your code here.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Train the Naive Bayes classifier
nb = GaussianNB()
nb.fit(X_train, y_train)

y_pred_nb = nb.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred_nb)
print("Naive Bayes accuracy: {:.4f}%".format(accuracy * 100))

Naive Bayes accuracy: 93.3333%


### Task 4 - Comparision and Understanding Questions
How does the Naive Bayes classifier compare to the MLP and SVM classifiers?
In this worksheet we only looked at accuracy, what else could we look at?
The Naive Bayes does not have real options to change its outcome like the MLP and SVM, why is this?

##### convert to a markdown cell and write your answer

Naive Bayes is a probabilistic model based on Bayes' theorem.
Naive Bayes makes strong independence assumptions between features, which simplifies the model but may not capture complex relationships in the data.
Naive Bayes does not have as many hyperparameters to tune, which can be an advantage for quick implementation but a limitation when dealing with more complex datasets.