# Lab 2: Classification and regression

We will again use the dataset "breast cancer". 
The task is to classify every instance (i.e. the attributes corresponding to one patient) into one of the two classes: `no-recurrence-events` ou `recurrence-events`.

We will experiment with several classification methods and use the python package "scikit-learn" (sklearn).

## Data preparation

In [None]:
import pandas as pd

bc = pd.read_csv("breast-cancer/breast-cancer.csv", skipinitialspace=True)
bc

### Question 1

Separate the attributes (that we give as input to the training algorithm) from the labels (that we will use at the output for training and evaluation).

In [None]:
labels = bc.iloc[:,0]
data = bc.iloc[:,1:]

## Encoding categorical features

For the majority of classification algorithms, categorical variables must be encoded in numerical format.

(Theoretically, this is not necessary for a decision tree. But for the implementation in scikit-learn, all attributes must be numerical.)

Here is a web page with example for pandas:
https://machinelearningtutorials.org/pandas-encoding-categorical-features-with-examples/

which does not mention the function `factorize()` that can be used for nominal variables (not ordinal).

Here is the corresponding scikit-learn documentation:
https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features

For example, for the "age" attribute (ordinal variable):

In [None]:
age_order = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
data['age'] = data['age'].apply(lambda x: age_order.index(x))

And for the nominal variable "menopause":

In [None]:
data['menopause'] = data.menopause.factorize()[0]

### Question 2

Replace all columns containing categorical attributes with their numerical version (including the "labels").

To display all the distinct values of a column:

In [None]:
data['tumor-size'].unique()

In [None]:
# COR
ts_order = ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', 
       '40-44', '45-49', '50-54']
data['tumor-size'] = data['tumor-size'].apply(lambda x: ts_order.index(x))
in_order = ['0-2', '3-5', '6-8', '9-11', '12-14', '15-17', '24-26']
data['inv-nodes'] = data['inv-nodes'].apply(lambda x: in_order.index(x))
data['node-caps'] = data['node-caps'].factorize()[0]
data['breast'] = data['breast'].factorize()[0]
data['breat-quad'] = data['breat-quad'].factorize()[0]
data['irradiat'] = data['irradiat'].factorize()[0]

In [None]:
# COR
labels = labels.factorize()[0]

We can visualise the data with the fonction "scatter_matrix" of pandas (see code below).
And do another representation with the fonction "parallel_coordinates".

### Question 3

What conclusions can you draw?

In [None]:
import matplotlib.pyplot as plt
#plt.figure(figsize=(15,15))
plt.rcParams['figure.dpi'] = 200
plt.rcParams.update({'font.size': 4})
sm = pd.plotting.scatter_matrix(data, c=labels)
plt.show()

In [None]:
data_and_labels = pd.concat((data, pd.Series(labels, name="class")), axis=1) # recreate dataframe with labels
pd.plotting.parallel_coordinates(data_and_labels, "class", color=("r", "b"))

In [None]:
# COR

# There is no single or pair of attributes that would allow a simple separation of the two classes.
# There are attributes that seem to be more discriminant.


## Separation of evaluation data

In the following, we will apply Machine Learning methods to our dataset, and we would like to be able to apply it to new data. 
To evaluate if our classification model gives correct predictions for examples that it has never seen before, we will put aside a part of our data that we will not use to construct the model.

### Question 4
Separate your DataFrame in two random parts: one for constructing the model ("train"), 80%, and the other for evaluating it ("test"), 20%.
Separate also the corresponding labels. You can use the function `sklearn.model_selection.train_test_split()`.

In [None]:
# COR
from sklearn.model_selection import train_test_split
train, test, train_labels, test_labels = train_test_split(data, labels, stratify=labels, random_state=1, train_size=0.8)

## k-NN : k Nearest Neighbours

The k-NN method is the most intuitive classification algorithm. 
To classify a new example, it will determine the $k$ nearest neighbours of that example in the training set and assign the label of the majority of these $k$ neighbours.

### Question 5
Get familiar with the k-NN algorithm and implement it with the class `KNeighborsClassifier` of "sklearn".
Train the model on the train set (`fit()`) and apply it (`predict()`) on the train and on the test set.

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# COR

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(train, train_labels)
pred = knn.predict(test.loc[:,:])



## Evaluation of the model


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
plt.rcParams['figure.dpi'] = 96
plt.rcParams.update({'font.size': 12})

### Question 6
We will now evalute the performance of our model on the test set.
To this end, we will compute the confusion matrix using the function `confusion_matrix(labels, predicted_labels)`.

Interpret this matrix and normalise it to percentages/proportions using the parameter: `normalize : {'true', 'pred', 'all'}`.
Interpret the output of the different normalisations (the usual choice is "true").

What is the percentage of correct prediction (accuracy)?
You can then use the class `ConfusionMatrixDisplay` to display a graphical representation.


In [None]:
# COR
cm = confusion_matrix(test_labels, pred, normalize='true')
cm = cm
print(cm)

# accuracy := sum of the elements on the diagonal (TP + TN)
print(f"Accuracy: {cm[0,0]+cm[1,1]}")

In [None]:
# COR
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

# pour des version plus récénte de sklearn : 
# ConfusionMatrixDisplay.from_predictions(test_labels, pred)
# plt.show()

## Preprocessing

### Question 7
Many Machine Learning methods perform better with normalised training data.
Use one (or several) of the classes in the package "sklearn.preprocessing" (for example "StandardScaler") to normalise the training data and compare the obtained model with the one without normalisation.

Note: the test set must be normalised with the same parameters as the training set !

In [None]:
# COR

from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
train_ = scaler.fit_transform(train)
test_ = scaler.transform(test)

In [None]:
# COR 
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_, train_labels)
pred = knn.predict(test_)


In [None]:
knn.score(test_, test_labels)

## Pipelines

To simplify the processing and to avoid that the test data is used in the training ("data leackage") we will combine the preprocessing and training in a so-called "pipeline".

In [None]:
# COR

from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=1)
)
pipe.fit(train, train_labels)
pred = pipe.predict(test)
accuracy_score(pred, test_labels)

## Other classification models

## Decision trees

We will construct our first "real" model: a decision tree using the class 'DecisionTreeClassifier' of sklearn.

Decision trees are non-parametric models used in supervised learning for classification and regression.
In supervised learning, we observe several discriminant variables and one or several target variables. 
The goal is to create a model that predicts the target variables using decision rules derived from the data through training.

Decision trees describe thus how to partition a population into separate groups that are distinct, but homogenuous  according to a set of discriminant variables and a given objective function.


### Question 8
Create and instance of the class `DecisionTreeClassifier` and construct the model if the method "fit()" and the training dataset. 

In [None]:
# COR

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree = tree.fit(train, train_labels)

### Question 9
Visualise the tree with the method "plot_tree" of "sklearn.tree" and get familiar with the general model and its construction.


In [None]:
# COR

# import plot_tree
from sklearn.tree import plot_tree

plot_tree(tree)
plt.show()


### Question 10
Faire une prédiction avec ce modèle pour tout le jeu de test et évaluer la précision du modèle.

In [None]:
# COR

# prediction
pred = tree.predict(test.loc[:,:])

In [None]:
# COR

tree.score(test, test_labels)

## Régression Logistique

### Question 11
Train a logistic regression model on our dataset. Use a pipeline.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# COR

pipe_logreg = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=0)
)
pipe_logreg.fit(train, train_labels)
pred = pipe_logreg.predict(test)
accuracy_score(pred, test_labels)

## SVM

### Question 12
Train a SVM model on our dataset. Use a pipeline.

In [None]:
from sklearn.svm import LinearSVC

In [None]:
# COR

pipe_svm = make_pipeline(
    StandardScaler(),
    LinearSVC()
)
pipe_svm.fit(train, train_labels)
pred = pipe_svm.predict(test)
accuracy_score(pred, test_labels)

## Cross-validation

The evaluation result depends on the random division into "train" and "test".
To reduice this bias, we can perform a cross-validation dividing the dataset into K parts (folds) of equal size, where K-1 parts are used for training and the remaining part for evaluation.
This is repeated K times (k-fold cross-validation) where each of the K parts in turn is used for evaluation.

In [None]:
from sklearn.model_selection import cross_validate
result = cross_validate(pipe, data, labels, cv=10)   # 10-fold cross-validation

In [None]:
result

In [None]:
result['test_score'].mean()

## Optimising hyper-parameters

A hyper-parameter of a model is a parameter that controls, for example, its operation, its architecture or the training algorithm (a sort of "meta-parameter"), as opposed to the parameters of the model itself that are optimised during training.

### Question 13
Vary the parameter $k$ of the k-NN algorithm and evaluate the performance.


### Automatic Search
The manual search for training hyper-parameters can be very tedious, especially if there are many and if the values are continuous.

In sklearn, there are several methods to automate this search.
We will experiment with two of them: 'GridSearchCV' and 'RandomizedSearchCV'.

In [None]:
pipe.get_params().keys()

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'kneighborsclassifier__n_neighbors': range(1,25)}
gs = GridSearchCV(pipe, parameters)
gs.fit(train, train_labels)

In [None]:
gs.best_params_

In [None]:
gs.score(test, test_labels)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

parameters = {'criterion': ['gini', 'entropy', 'log_loss'], 
              'max_depth': range(1,20),
              'min_samples_split': range(2,20)}
rs = RandomizedSearchCV(tree, parameters, n_iter=100)
rs.fit(train, train_labels)

In [None]:
rs.best_params_

In [None]:
rs.score(test, test_labels)

### Question 14
Optimise the hyper-parameters of the classifiers: `LogisticRegression` and `LinearSVC`.

In [None]:
from scipy.stats import uniform   # use distributions for continuous-valued hyper-parameters

## Ensemble methods

### Question 15
Train a Random Forest and a "Gradient Boosted Tree" model on our dataset.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [None]:
# COR

pipe_rf = make_pipeline(
    StandardScaler(),
    RandomForestClassifier()
)
pipe_rf.fit(train, train_labels)
pred = pipe_rf.predict(test)
accuracy_score(pred, test_labels)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.pipeline import make_pipeline

In [None]:
# COR

pipe_rf = make_pipeline(
    StandardScaler(),
    #HistGradientBoostingClassifier()  # more efficient for larger datasets
    GradientBoostingClassifier()    # for smaller datasets
)
pipe_rf.fit(train, train_labels)
pred = pipe_rf.predict(test)
accuracy_score(pred, test_labels)

## Regression

We will now use (again) the CO2 measurement dataset for regression.

In [None]:
mlo = pd.read_csv("co2_mm_mlo2.csv", delim_whitespace=True)
mlo1 = mlo[mlo["monthly_avg"]>0]
plt.plot(mlo1.time, mlo1.monthly_avg)

### Question 16
Compute a linear regression model on the first 600 values and apply this model for prediction on the remaining time points.

Plot the regression line on top of the observed data.

In [None]:
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

In [None]:
# COR

# We use the first 600 samples for constructing the model (regression line)
train = mlo1.iloc[:600,:]['time'].to_frame()
train_labels = mlo1.iloc[:600,:]['monthly_avg']
# We use the rest of the data for evaluating the model
test = mlo1.iloc[601:,:]['time'].to_frame()
test_labels = mlo1.iloc[601:,:]['monthly_avg']

regr = linear_model.LinearRegression()
regr.fit(train, train_labels)

In [None]:
# COR

train_pred = regr.predict(train)
test_pred = regr.predict(test)


In [None]:
# COR

plt.plot(mlo1.time, mlo1.monthly_avg)
plt.plot(train, train_pred)
plt.plot(test, test_pred)

### Question 17
Output the regression coefficients and evaluate the model using the mean squared error and the R2 score.

In [None]:
# COR

print("Coefficients: \n", regr.coef_, regr.intercept_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(test_labels, test_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(test_labels, test_pred))