# Lab: machine learning protocol, part 2

Author: Alasdair Newson

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import random
# data functionalities
from sklearn.datasets import make_classification
from sklearn.datasets import make_moons, make_circles,  make_blobs,fetch_covtype
from sklearn.preprocessing import StandardScaler

# models
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

# pipeline and metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error,accuracy_score
from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score



# pour eviter les warnings embetants
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

random_seed = 42


In [None]:
# STUDENT

# 1/ Common machine learning models

In this section, we look at some common machine learning algorithms, and compare them. First, we start by creating some toy data with the ```make_classification``` function of scikit-learn (you can use this to test out algorithms for your projects). This creates toy data using Gaussian clusters with a certain covariance (automatically imposed). Among all the features, some are acutally useful ```n_informative```, some are redundant ```n_redundant``` (linear combinations of the useful features), some are repeated ```n_repeated```, and finally the rest are random features. You can specify the number of classes with ```n_classes```.

We create the data now (code given):

In [None]:
# ==== 1. Generate toy tabular data ====
# Binary classification with 20 features, some informative, some noisy
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=2,
    n_informative=10,
    n_redundant=5,
    n_repeated=0,
    n_clusters_per_class=2,
    class_sep=1.5,
    random_state=random_seed,
)


Now, split the data into train and test. Use $80\%$/$20\%$ split.

In [None]:
# STUDENT

Now, we will test some common machine learning models and compare their performances.

First we will start with the SVM. We will try three versions : 
- standard soft margin SVM
- soft margin SVM with Polynomial Kernel
- soft margin SVM with Gaussian Radial Basis Function Kernel

Beforet this, we will normalise data. This is important in particular in the case of the Gaussian Radial Basis Function. Indeed, it is based on the distance between two points (see slides), so if the scale of one is much larger, it will dominate the classification. To make the scaling/classification process more automatic, we can use the ```Pipeline``` functionality of scikit-learn. The syntax is, for example:
- ```Pipeline([("scaler", ...), ("svm", ...)])```
The ```...``` parts must be filled in with the scaling and classification functions you wish to use.

Do this now for the three cases above. You can use the svm documentation to help you:
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
random_seed = 42

linear_svm = Pipeline([  ("scaler", ...),
        ("linear_svm", ...)])

poly_svm = Pipeline([  ("scaler", ...),
        ("poly_svm", ...)])

rbf_svm = Pipeline([  ("scaler", ...),
        ("rbf_svm", ...)])

Now, train and test these models using the data above. Use the ```accuracy_score``` function to evaluate the models.

In [None]:
# STUDENT

What are your conclusions ? Let's make the data more complicated now ("circles" data):

In [None]:
X, y = make_circles(n_samples=2000, factor=0.7, noise=0.1, random_state=random_seed)

Carry out the classification with the new dataset: 

In [None]:
# STUDENT

Now, let's try some other models:
- random forests : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- gradient boosting : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

Use the following hyper-parameters:

For random forests:
- ```n_estimators=300```
- ```max_depth=None```
For gradient boosting:
- ```n_estimators=300```
- ```max_depth=3```
- ```learning_rate=0.05```

Note, you do not need to use the scaling and ```pipeline``` for these models. This is because decision trees make their decisions based on splits (thresholds), which thus only depend on an ordering of a variable, not the scale in itself (although you can try the scaling if you wish to see if it performs better).

In [None]:
# STUDENT

Conclusions on comparisons of SVM, random forest and gradient boosting ? Why do you think you observe these results ? 


ANSWER : SVM with Gaussian RBF works better because these are toy data with a strong geometric meaning (circles), so it is specifically designed for this case.

## 1.1 Real-world datasets

The previous datasets were toy datasets, with a strong geometric meaning (concentric circles) , and synthetic (ie, we generated them on the fly). In the real world, datasets are not usually like this: they are fixed, complex and often are not "geometrically" meaningful, because each variable has a semantic meaning (unless there is correlation between variables).

Let's turn to some more complex data (still classification): the "cover-type" data. This is forestry data, with 7 classes and 54 variables/features. The goal is to classify areas of forests into dominant tree species (spruce, aspen etc):
- https://archive.ics.uci.edu/dataset/31/covertype
The variables are quantities such as elevation etc. First, let's load the data and carry out train/test split:

Since the dataset is quite large, we will take a subset of the data for this lab work, to make it run faster.

In [None]:
# Load data
X, y = fetch_covtype(return_X_y=True)
# reduce dataset size
n = 3000
X = X[0:n,:]
y = y[0:n]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# STUDENT

In [None]:
# STUDENT

Compare the performances and the execution time of all the models (3 SVMs, random forest, gradient boosting). What are the advantages and disadvantages ?

In [None]:
# STUDENT

Conclusions on this new dataset ?

## 2/ Evaluation metrics: Precision, recall, f1


As we saw in class, in binary classification problems we use the notions of accuracy, precision, recall, and the F1 score. Let us recall their definitions:
\begin{equation}
    \text{accuracy} = \frac{ \sum_i{ \textbf{1}_{\hat{Y}_i=Y_i}}}{n}
\end{equation}

\begin{equation}
    \text{precision} = \frac{ \sum_i{ \textbf{1}_{\hat{Y}_i=1 \& Y_i=1}}}{\sum_i{ \textbf{1}_{\hat{Y}_i=1}}}
\end{equation}

\begin{equation}
    \text{recall} = \frac{ \sum_i{ \textbf{1}_{\hat{Y}_i=1 \& Y_i=1}}}{\sum_i{ \textbf{1}_{Y_i=1}}}
\end{equation}

\begin{equation}
    \text{F1} = 2*\frac{precision * recall}{precision+recall}
\end{equation}

We also define the __confusion matrix__:

| **Label / Prediction** | **Negative**        | **Positive**        |
| ---------------------- | ------------------- | ------------------- |
| **Negative**           | TN   | FP |
| **Positive**           | FN | TP |


With :

- TN: number of true negatives
- FN: number of false negatives 
- FP: number of false positives
- TP: number of true positives

In this section, we will compute these different elements. To illustrate these concepts, we will use data from the circles dataset:
```
X, Y = make_circles(n_samples=n, factor=0.7, noise=0.1, random_state=0)
```

Create the data with $n=300$ and the other arguments as above. Display them using a scatter plot:

In [None]:
# STUDENT

Split into train/test:

In [None]:
# STUDENT

Now we choose a classification algorithm. To change up a bit, let's use another algorithm, the k-nearest neighbours. Implement this now with $k = 20$. The syntax is:

```
neigh = KNeighborsClassifier(n_neighbors=...)
```

In [None]:
# STUDENT

Use the scikit-learn functions (see imports at the beginning) to calculate accuracy, precision, recall and f1 score: 

In [None]:
# STUDENT

Are these results good ? Let's take a more complicated case

## 2.1 Unbalanced data

We will use the "blobs" dataset from scikit-learn:

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html



In [None]:
centers = [(0, 0), (2, 2)]
n = 2000
X, y = make_blobs(n_samples=n, centers=centers, shuffle=False, random_state=random_seed)


Display these data. Do you think the separation will be as easy as before ?

In [None]:
# STUDENT

Now, we will remove a large portion of the data labeled as positive (label = 1). Since the first $n/2$ samples are labeled 0 and the following ones are labeled 1 (because we set shuffle=False), we simply need to remove samples from the end of $X$ and $y$.

Remove $\tau=90\%$ of the positive samples from the dataset. To do this, you can create new variables called ```X_reduced``` and ```y_reduced```. Then, display the reduced dataset to check that the operation worked correctly.


In [None]:
# STUDENT

Now, carry out the k-nearest-neighbours algorithm again on this new data.

In [None]:
# STUDENT

How would you interpret these results if you had to explain them to someone ?

ANSWER : We got a good precision, so when we predict positive (class 1), we are quite sure. However, the recall is bad, so we missed quite a few positive points. This is quite normal: since we removed a lot of positive points, and put the data close together, we can only detect the positive points which are furthest away from the negative points. The ones which are mixed up have no good nearest neighbours.

In [None]:
# STUDENT