# Session 5 : Unsupervised learning

## Preprocessing data

We saw that PCA can be used as a pre-processing step before using supervised learning algorithms in order to improve accuracy or training speed. But sometimes, a simple pre-processing step like normalizing the data can bring a huge improvement.

### No normalization

Before seing any improvements, we need to have a **baseline** (to know if we improve or deteriorate accuracy). We are going to train a classification SVM (with non-linear kernel). Do the following operations :
* load the bread cancer dataset
* separate it into a training and a test set
* create a SVC model, with C=100
* train your model, print its accuracy

I know that you already did it in the last session, but try to see if you can do it again on your own, without any code snippet provided or looking at the correction.

### Using built-in normalizers

Let's see what our data looks like before normalization. For each feature, print its minimum and maximum value across all examples in training set.

Now we can use a normalizer, to make sure that each feature has a value between 0 and 1.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# look at the documentation of MinMaxScaler to find examples
# of use case. Then create a new variable X_train_scaled that
# is the rescaled X_train. Print again the maximum and
# minimum values in X_train_scaled for each features. What is
# the difference ?

Use the same scaler (without modifying it) to also rescale the `X_test` variable.

Train again a SVC model, but this time train it on the scaled data. Do you see any improvements ?

Many more different types of scaler are implemented in scikit-learn :
* MaxAbsScaler
* RobustScaler
* StandardScaler

In [None]:
# use each type of scaler and train a SVC model for each one.
# Which one is the best ?

In [None]:
# look also at the features of your scaled dataset to see
# differences between the different scalers (do some of
# them keep negative values ? does it change the dimension
# mean/median?)

## Principal component analysis

### PCA for 2D visualization

We are going use the PCA algorithm on the breast cancer dataset. This dataset has 30 features, so we can not visualize them.

In [None]:
# if you didn't do it before, scale the breast cancer
# dataset with a StandardScaler(). We do not need to have a
# training and test set, because we want to visualize the 
# entire dataset, so you can apply the scaler on the entire
# dataset.

In [None]:
# now we can use PCA
from sklearn.decomposition import PCA

# use the documentation of PCA to create a model that will
# only keep 2 components

# then call the .fit() method of PCA on your scaled data

In [None]:
# we can now transform the dataset with 30 features into
# a dataset with only 2 features
X_pca = pca.transform(X_scaled)
print(X_scaled.shape)
print(X_pca.shape)

In [None]:
# now plot the new dataset (the one with only 2 features)
# on a 2D plan. Use a different color for the two classes.

Hum, interesting... It looks like our
dataset is almost linearly separable. This means that a
linear model (like SVM with linear kernel or logistic regression) could do quite well on this dataset. Let's see if that's the case.

In [None]:
# start by splitting the X_pca dataset into a training
# and a test set with random_state=7 (because we are going 
# to train on this new dataset composed of only 2 features)

### PCA for 3D visualization

PCA can reduce any dataset with $n$ features into a dataset with 2 or 3 features. And matplotlib can draw functions and points in 3D, so we can project our data into a 3D space.

In [None]:
# repeat the same process as before so you get a new
# dataset for the breast cancer dataset that only contains 
# 3 features.

Now we can visualize it in 3D.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# use the function below, but replace the value x, y and z
# with the appropriate one from your dataset that contains
# 3 features. Separate the 2 different classes with 2 
# different colors.
#ax.scatter(x, y, z)

## Clustering with k-means algorithm

In this part, you are going to implement from scratch the k-means algorithm.

### Loading and visualizing the data

The data we are going to use are inside `data-clustering.csv`.

In [None]:
# look at the first 10 rows of this file with the bash
# command head

It seems that we have 2 columns, named V1 and V2. Let's load it into 2 ndarrays : x and y.

In [None]:
import numpy as np
lines = open("data_clustering.csv").read().split()[1:]
x = np.array([line.split(',')[0] for line in lines], 
             dtype=np.float32)
y = np.array([line.split(',')[1] for line in lines], 
             dtype=np.float32)

In [None]:
# now plot the points (x, y) with matplotlib. Modify the 
# value of the argument s so that points are not too big.

In [None]:
# How many clusters do you think there are ?

### Distance function

The algorithm requires a distance (so we can compute which centroid is closer for each point). Implement a function `distance` that takes 2 arguments (two vectors as ndarray) and return the distance between them. Hint : the distance between two vectors can be computed with 
\begin{equation}
d(u, v) = \sqrt{\sum_{i=0}^k (u_i - v_i)^2}
\end{equation}

### Algorithm: 1-step

Before implementing the complete algorithm, let's start with only one step. You need the following things before starting :
* define a constant variable K
* declare an empty array `clusters` that has the same size as x. We will put in each cell $i$ the cluster assigned to $x_i$
* create an array `centroids` where you will store the centroids

Then implement only one step from the algorithm described in the course (i.e. one iteration of the **while** loop).

After running one iteration, you can plot on the same graph :
* all points
* the first centroids chosen at random
* the new updated centroids

Do you see the beginning of an improvement ?

### Algorithm : mutiple steps

The algorithm will repeat steps like the one you created a certain amount of time before convergence. We can consider the algorithm has converged when it does not update the values of centroids any longer (i.e. the distance between old and updated centroids is 0 for each centroid). Implement the full algorithm.

Now, print each cluster on a 2D plan as well as all the points contained inside them. Assign a different color for each group. Do you see something that seems correct ? (i.e. as a human, what would have you done ?)

Run again your algorithm but with a different value for k. What happens ? Do you think it is a good idea to put a high value for k (k > 10) when the number of clusters is small (< 5) ?