# Session 2 : Supervised learning (1/3)

## Preliminaries

Before using the algorithms on real case datasets, we are going to experiment them on artificially generated datasets. We call these types of datasets **toy datasets**. Use the appropriate magic command to load the script `datasets.py` (it contains functions to generate toy datasets).

## k-Nearest Neighbors : Classification

### Toy dataset

The dataset we are going to use is a set of points which have either the label `0` or `1`. Use the appropriate command to look at the source code of the function `make_forge()` and use it to create a set of points `X` and a set of labels `y`. How many points have been generated ?

In [None]:
# print source code of make_forge()

In [None]:
# create X and y. How many elements in X by default ?
# If needed, recreate X and y so you have 600 data points.

Now load the `matplotlib` library and use the right method to visualize a set of points on a 2D plan. Look at the documentation and use the approriate argument so that points labeled with `0` have a different color from the points with the label `1`.

In [None]:
# print the points X with matplotlib

### Learning a model

As we saw in the course, the first step is to separate our dataset into a training and a test part. Use the function `train_test_split()` to create four variables :
* points for training
* labels for training
* points for test
* labels for test

Use the parameter `random_state = 0` so the experiments can be replicated.

In [None]:
from sklearn.model_selection import train_test_split

Then, we can create a KNN model and specify the parameter `k`. Create a model with `k = 3`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# TODO (create model)
# model = ...

Train the model on your training data (with the `.fit()` method) and evaluate its performance (with the `.score()` method) on the test data. How much accuracy do you get ?

It is interesting to see the boundary decision of our model (i.e. the line indicating where the points are labeled 0 or 1). Run the following piece of code to see it.

In [None]:
%run plots.py
plot_2d_separator(model, X, y, fill=True, eps=0.5, alpha=0.4)

Create other models with a different value for `k` (use 1, 9 and 15). Train and evaluate each model. Which one is the best one ? 

In [None]:
# TODO
# models with k = 1, 9, 15

Look at the decision boundary for each of these models. What can be said about the decision boundary when `k` is low ? When `k` is large ?

In [None]:
# decision boundary for each models

### Real dataset

Sklearn comes with some real case datasets. One of them is the Wisconsin breast cancer dataset. It contains information (measurements) of breast cancer tumors. Each tumor is either "benign" or "malignant" (so it is a binary classification problem). We are going to use KNN to predict if a tumor is "benign" or "malignant".

In [None]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
#print(cancer.DESCR) # uncomment for more information
print(cancer.keys())

This dataset contains 569 data points, each one has 30 attributes (called features). The data can be accessed with `cancer.data` and the labels with `cancer.target`.

In [None]:
print(cancer.data.shape)
print(cancer.data[0])
print(cancer.target[0])

Separate the points into a training and a test datasets with `random_state = 0`.

Create a KNN classifier with six neighbors and train it with the appropriate data.

The main objective of a classifier model is to be able to predict the label of points we have never seen yet. You can use the `.predict()` method of your classifier and feed it with one or more data points. The result will be the label(s) predicted by your model.

In [None]:
model.predict([x_test[0]]) # replace model with the name of your model

Now compute the accuracy of your model on the entire test dataset.

## k-Nearest Neighbors : Regression

We can also do regression with the KNN algorithm. Instead of assigning the most frequent label of the `k` nearest neighbors, we can average the value of the neighbors. Hence we predict a value instead of a class.

Use the `make_wave()` function to create a toy dataset of `40` points for regression.

We can visualize the points with the following piece of code.

In [None]:
plt.scatter(X, y)
plt.xticks(X, "")

Separate the dataset into a training part and a test part with `random_state = 0`. Then create several models for a KNN regression (at least 3 different models) with different values for the number of neighbors used. Train and evaluate them. What is your best accuracy ?

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# create train + test data
# ...

# create regression models, train and evaluate
# ...

## Linear models

Linear models are mostly used to do regression (predicting a value given a set of features). You can use a linear model to do classification but we will focus on regression in this course. The predicted value $\hat{y}$ can be written as follows:

\begin{equation*}
\hat{y} = \sum_{k=1}^n w_k \times x_k + b
\end{equation*}

where $x_k$ are the features of the data points, $w_k$ and $b$ are the parameters learned by the linear model.

### Ordinary Least Squares

Ordinary Least Squares is the most classic linear method for regression. This model finds the $w$ and $b$ parameters that minimize the **mean squared error (MSE)** between predictions and the true value for the $m$ points in training dataset.

\begin{equation*}
MSE = {1 \over {m}} \sum_{k=1}^m (\hat{y}-y)^2
\end{equation*}

Generate a toy dataset for regression with the function `make_wave()` composed of `180` data points. Then split this dataset into a training and a testing dataset with `random_state = 0`.

Then we can create a linear model and train it on the right dataset.

In [None]:
from sklearn.linear_model import LinearRegression
# create model and train it
# model = ...

The learned $w$ are in the `coef_` attribute while the learned $b$ are in the `intercept_` attribute. Since our data only has one feature, we only have one $w$.

In [None]:
print("Learned w:", model.coef_)
print("Learned b:", model.intercept_)

In the same way as before, we can compute the estimated output with the `predict()` method.

In [None]:
print("Model prediction =", model.predict([x_test[0]]))
print("Hand computed prediction =", model.coef_[0] * x_test[0] + model.intercept_)
print("Correct output =", y_test[0])

### Real dataset

We are going to predict the price of houses given some features. The data come from the housing market in Boston. We have 506 data points, and each one has 104 features.

In [None]:
X, y = load_extended_boston()
print(X.shape)
print(y.shape)
print(y[:3]) # some house prices

In [None]:
# separate the data into a training set and a testing set with
# random_state = 0. Then train a linear model and predict the
# price of the first house in the test set. Compare it with the 
# actual price of the house.

In [None]:
# we can also compute the score of the model. Compare
# the score obtained on the training data and the score
# on the test data. 
# Do you thing we are underfitting or overfitting ? Explain why.

### Regularization

Sometimes, the linear model can overfit. This means that it will be good on the training set, but not on the test set. One way to control overfitting is to add a regularization to our model. We can add a constraint to the objective being minimized by the model.

We will see a L2 normalization that minimizes the norm 2 of the weights $w$ of the model. The name of this new type of model is called **Ridge regression** and it minimizes :

\begin{equation*}
MSE + Regularization = {1 \over {m}} \sum_{k=1}^m (\hat{y}-y)^2 + \lambda \left\lVert w \right\rVert ^2
\end{equation*}

$\lambda$ is a parameter to adjust the effect of regularization.

In [None]:
from sklearn.linear_model import Ridge

# create a model Ridge, train it on the same training
# set made of the Housing market and evaluate its
# training score and test score. Do you have any improvement ?
# Is it better compared to a model with no regularization?

In [None]:
# try different Ridge() models with different values for
# the alpha parameter (read the documentation if necessary).
# Then compute the training and test scores for each model.
# Can you tell what is the influence of alpha on the scores?