<div hidden=True>
    author: Marco Angius
    company: TomorrowData srl
    mail: marco.angius@tomorrowdata.io
    notebook-version: nov20-2.1
    
</div>

# Hands-on 2: Unsupervised Learning
### Version 2.1

This section is meant for learning the Scikit-Learn APIs end provide a playground for machine learning unsupervised tasks.

[Scikit-Learn](https://scikit-learn.org/stable/index.html#) is a library for data mining and data analysis. It  includes models for classification, regression and clustering. It is built on top of NumPy. SciPy and matplotlib. 

For the purpose of this playground, to get familiar with the Scikit-Learn APis, we would use [Toy Datasets](https://scikit-learn.org/stable/datasets/index.html#toy-datasets) available in the library. 

Datasets in `sklearn.datasets` return a *Bunch*:
> Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the regression targets, ‘DESCR’, the full description of the dataset, and ‘filename’, the physical location of boston csv dataset (added in version 0.20).

In this notebook we also use the `sklearn.datasets.make_blobs` and `sklearn.datasets.make_moons` functions which is of help in generating synthetic data for unsupervised tasks. See the [Generated datasets](https://scikit-learn.org/stable/datasets/index.html#generated-datasets) section for more details.

### mglearn library 
For visualizing the results obtained with our models we are going to employ an existing library made by Andreas C. Muller (author of the book *Introduction to Machine Learning with Python*). The library is available in the [github repository](https://github.com/amueller/mglearn).

In [None]:
!pip install mglearn==0.2

In [None]:
from sklearn.datasets import load_breast_cancer, make_blobs, fetch_lfw_people, make_moons
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, Normalizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mglearn

In [None]:
R_STATE = 99

## Dataset load
The **breast cancer** and the **faces** datasets are used in this notebook for applying scaling and PCA transformations.

In [None]:
breast_ds = load_breast_cancer()
faces_ds = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 1**
- check the datasets. Use `print(ds.DESCR)`to print information for each of them.

</div>

## Scaling the data
Here we are going to see how the effect of data scaling has an impact on supervised models before introducing unsupervised models. 

The `MinMaxScaler`, `StandardScaler`, `RobustScaler` e `Normalizer` are sklearn models and follows the same API convention that we have seen in the supervised section. This means you can use `fit()` and `transform()` for preparing the model and then scale the data.

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 2: Scaling Data**


Using the *breast cancer* dataset apply the different scalers and check how data has changed.
- MinMaxScaler: scales features in between the provided ranges 
- RobustScaler: scales the features bases on the quartiles
- StandardScaler: scales the features based on the mean and the variance (column-wise)
- Normalzer: normalizes each row in order to have unit norm (row-wise)

You can use the provided function to plot the data and visually see the differences.

**NOTE**: Before applying any transformation select two features from the *breast cancer* in order to visualize them.

</div>

In [None]:
X, y = breast_ds.data, breast_ds.target

In [None]:
scaled = ... # list of your scaled data with "MinMaxScaler", "RobustScaler", "StandardScaler", "Normalizer" 
scaling_features = breast_ds.feature_names
names = ["MinMaxScaler", "RobustScaler", "StandardScaler", "Normalizer"]

fig, axes = plt.subplots(4, 2, figsize=(20, 25))
for i, ax_row in enumerate(axes):
    ax_0 = ax_row[0]
    ax_1 = ax_row[1]
    
    ax_0.scatter(X[:, 0], X[:, 1], c=y)
    ax_1.scatter(scaled[i][:, 0], scaled[i][:, 1], c=y)
    ax_0.set_title("Original Data")
    ax_1.set_title(names[i])
    for ax in [ax_0, ax_1]:
        ax.set_xlabel(scaling_features[0])
        ax.set_ylabel(scaling_features[1])

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 3: Observing model performances for different scalers**

For convenience, a split into training and test sets is already provided for the *breast cancer* dataset (do not focus on the details for now, this will be part of next hands-ons).

- `X_train`: features for the training set 
- `y_train`: labels for the training set
- `X_test`: features for the test set
- `y_test`: labels for the test set

    
The model for this experiment is a Support Vector Machine (SVM) for classification. For details about this model please refer to the [scikit-learn dedicated page](https://scikit-learn.org/stable/modules/svm.html#svm-classification) to SVM. 
    
A function is provided in order to test the model performances on different scaling of the same data. The function takes the training and test sets as tuples. Each tuple should contain features and labels.

<hr>

Tasks: 
    
1. Fit two or more scalers on the training set;
2. Transform both training and test sets;
3. Use the `compute_perf_SVC()` function for comparing the scaled data with the base model;
4. Print the results.

</div>

<div class="alert alert-warning" role="alert">
    
<img src="../resources/icons/book.png"  width="20" height="20" align="left"> &nbsp;  **Theory: Scaling Training and Test**


Usually when training machine learning models data is split in two sets: one for training and one for test. This allows to have some data for assessing the model generalization performances. When data scaling is applied, the model used for scaling the data should be fit on the *training* dataset. Then, the same scaling model have to be used in the *test set*. If the test set is not scaled the same way as the training set is, performances may be worst!


</div>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
def compute_perf_SVC(training_set, test_set): 
    Xtr, ytr, Xte, yte = *training_set, *test_set
    svc = SVC(C=100, gamma="auto", random_state=11)
    svc.fit(Xtr, ytr)
    score = svc.score(Xte, yte)
    return score

In [None]:
minmaxs  = ...
robsuts  = ...
stds     = ...
norms    = ...

In [None]:
# transform train
Xtr_mms  = ...
Xtr_rs   = ...
Xtr_ss   = ...
Xtr_ns   = ...

In [None]:
# transform test
Xte_mms  = ...
Xte_rs   = ...
Xte_ss   = ...
Xte_ns   = ...

<div class="alert alert-danger" role="alert">
    
<img src="../resources/icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task 1: Observe Scaled Data**

Use `pandas.DataFrame()` to construct dataframes starting from different scaled versions of the *breast cancer* data. In order to access the feature names use the `breast_ds.feature_names` attribute. 

- Use the `df.describe()` method to compare the data ans see the effect of the different scaling algorithms.

</div>

## Principal Component Analysis
In this section we are going to use PCA for both visualization and features extraction. 

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 4: PCA - Visualization**

We can visualize the *breast cancer* features using its first two principal components.
 
- Check the original data dimension;
- Use `sklearn.decomposition.PCA(n_components)` for selecting the number of principal components and apply PCA on the *breast cancer* features;
- Use the provided function `plot_projected_data(X_transformed)` to plot the dataset projected onto the new subspace (defined by the principal components);
- Observe the two principal components: each one presents K coefficients and all together they define a base for a new subspace. Use the `plot_heatmap_coefficients(model)` to analyze the coefficients.

    
Some questions:
1. Does scaling affect the PCA ?
   
</div>

In [None]:
def plot_projected_data(X_transformed, dataset=breast_ds):
    plt.figure(figsize=(8, 8))
    plt.title("Projected Data")
    mglearn.discrete_scatter(X_transformed[:, 0], X_transformed[:, 1], dataset.target)
    plt.legend(dataset.target_names, loc="best")
    plt.gca().set_aspect("equal")
    plt.xlabel("First principal component", size=13)
    plt.ylabel("Second principal component", size=13)

In [None]:
def plot_heatmap_coefficients(model, dataset=breast_ds):
    plt.matshow(model.components_, cmap='viridis')
    plt.yticks([0, 1], ["First component", "Second component"])
    plt.colorbar()
    plt.xticks(range(len(dataset.feature_names)),
    dataset.feature_names, rotation=60, ha='left')
    plt.xlabel("Feature")
    plt.ylabel("Principal components")

<div class="alert alert-warning" role="alert">
    
<img src="../resources/icons/book.png"  width="20" height="20" align="left"> &nbsp;  **Theory: Coefficients' Interpretation**

Principal components represent directions on the original data and they are a combination of the original features. Observing the coefficients' magnitude for one component, they give a clue about the correlation between features for a particular direction. In the case of the first component the are all of the same sign meaning that if we observe points increasing in the component direction also the original features tends to increase as well. 

For the second component is different due to we have mixed signs.

</div>

<div class="alert alert-danger" role="alert">
    
<img src="../resources/icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task 2: PCA and Model Performances**

    
Evaluate model performances with and without PCA as it has been already done in exercise 3. Use again `compute_perf_SVC()` function and the *breast cancer* dataset for this purpose.

</div>

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 5: PCA - Feature Extraction using the Faces dataset**

It is possible to use PCA for feature extraction in order to use the new features for training a supervised model and achieve better results in terms of classification scores.

- Check dimensionality for the *faces* dataset;
- Use the *faces* dataset and apply PCA in order to extract 100 principal components;
- Train and test a KNN classifier on the original features (the `check_knn_performances()` function is provided for the purpose);
- Train and test a KNN classifier on the extracted PCA components (the `check_knn_performances()` function is provided for the purpose);
- Plot, using the provided `plot_pca_face_components()` function, the extracted principal components;

Data have been prepared in `X_faces` and `y_faces`.

<br>
    
Extra:
- Play with the `whitening` hyper-parameter of PCA and check the results;
- Try with a different number of principal components.
 
</div>

In [None]:
image_shape = faces_ds.images[0].shape
def plot_pca_face_components(model):
    fix, axes = plt.subplots(3, 5, figsize=(15, 12),
    subplot_kw={'xticks': (), 'yticks': ()})
    for i, (component, ax) in enumerate(zip(model.components_, axes.ravel())):
        ax.imshow(component.reshape(image_shape), cmap='gray')
        ax.set_title("{}. component".format((i + 1)))

In [None]:
def check_knn_performances(X_train, X_test, y_train, y_test): 
    knn = KNeighborsClassifier(n_neighbors=1)
    knn.fit(X_train, y_train)
    print("Test set accuracy: {:.2f}".format(knn.score(X_test, y_test)))

In [None]:
X, y = faces_ds.data, faces_ds.target

# apply a mask in order to make the data skewed (take up to 50 images of each person)
mask = np.zeros(y.shape, dtype=np.bool)
for target in np.unique(y):
    mask[np.where(y == target)[0][:50]] = 1
    
# scale to greyscale for numeric stability
X_faces, y_faces = X[mask] / 255., y[mask]

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X_faces, y_faces, random_state=0)

<div class="alert alert-warning" role="alert">
    
<img src="../resources/icons/book.png"  width="20" height="20" align="left"> &nbsp;  **Theory: PCA for Face Recognition**

It is possible to observe each component extracted by PCA from the *Faces* dataset. Some components seams to be extracting differences between the face and the background while others are encoding the lightning differences amongst different zones of the face.  

In addition, given some coefficients $b_1, b_2, ... , b_k$ it is possible to express a test point (an image of a person) as a linear combination of those components.
</div>

## Clustering
Finally we look into unsupervised learning models and in particular we are going to see **K-Meams** and **DBSCAN**. 

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 6: K-Means**
- generate a random dataset with `make_blob()` (by default it has 2 features and 100 samples)
- plot the generated random dataset using the `plot_blob(data)` function
- apply kmeans to the random dataset using `KMeans(n_clusters=3)`
- plot the results with the provided `plot_clusters(data, model)` function 
</div>

In [None]:
def plot_clusters(data, model):
    # from sklearn 
    # Step size of the mesh. Decrease to increase the quality of the VQ.
    h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].
    # Plot the decision boundary. For that, we will assign a color to each
    x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
    y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Obtain labels for each point in mesh. Use last trained model.
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=[10,7])
    plt.clf()
    plt.imshow(Z, interpolation='nearest',
               extent=(xx.min(), xx.max(), yy.min(), yy.max()),
               cmap=plt.cm.Pastel1, alpha=0.7,
               aspect='auto', origin='lower')

    plt.plot(data[:, 0], data[:, 1], '.', markersize=10)
    # Plot the centroids as a white X
    centroids = model.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='o', s=20, linewidths=3,
                color='r', zorder=10)
    plt.title('K-means clustering', size=15)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xlabel("feature 1", size=13)
    plt.ylabel("feature 2", size=13)
    plt.show()

In [None]:
def plot_blob(data):
    f = plt.figure(figsize=[10,7])
    ax = f.add_subplot()
    ax.scatter(X[:, 0], X[:, 1])
    ax.set_title("Scatter Plot of random blobs", fontdict={'fontsize':15})
    ax.set_xlabel("feature 1", fontdict={'fontsize':13})
    ax.set_ylabel("feature 2", fontdict={'fontsize':13})

In [None]:
X, y  = make_blobs(random_state=10)

<div class="alert alert-danger" role="alert">
    
<img src="../resources/icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task 3**:  Try different values of k and plot the results. It is also possible to specify the initialization for the centroids.
</div>

<hr>


### When K-Means fails
K-means has some drawbacks: 
- it considers only convex shapes (radius of the cluster's centroids)
- it assumes cluster of the same size (diameter)
- it does not take into account directions' importance 
- k as hyperparameter

Lets consider two cases when kmeans fails to identify potentially "meaningful" clusters. The two dataset are already given for this purpose. Then we will see a different model which can solve these limitations.

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 7: K-Means Failure**
- use the `X_blob` to fit and plot the result of kmeans on a stretched blob dataset
- use the `X_moons` to fit and plot the result of kmeans on the moons dataset
</div>

In [None]:
X_blob, y_blob = make_blobs(random_state=110, n_samples=600)
rng = np.random.RandomState(1200)
transformation = rng.normal(size=(2, 2))
X_blob = np.dot(X_blob, transformation)

In [None]:
X_moons, y_moons = make_moons(n_samples=200, noise=0.05, random_state=10)

### Use DBSCAN
We use a more sophisticated clustering algorithm which creates cluster bases on the data point density. One of the advantages of DBSCAN is the needless of setting a k value. For more details about DBSCAN see the scikit-learn [dedicated page](https://scikit-learn.org/stable/modules/clustering.html#dbscan).

<div class="alert alert-info" role="alert">
    
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 8: DBSCAN**
- use DBSCAN to solve the above problems (reuse the already provided data)
- try different parameters for DBSCAN: 
    - `eps`: set this to implicitly control the number of clusters
    - `min_sample`: in less dense regions determines if a point a noise one or belonging to a cluster
- use the `plot_dbscan_cluster()` to observe the results
</div>

In [None]:
def plot_dbscan_clusters(model, data):
    core_samples_mask = np.zeros_like(model.labels_, dtype=bool)
    core_samples_mask[model.core_sample_indices_] = True
    labels = model.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise_ = list(labels).count(-1)
    # Black removed and is used for noise instead.
    unique_labels = set(labels)
    colors = [plt.cm.Spectral(each)
              for each in np.linspace(0, 1, len(unique_labels))]
    plt.figure(figsize=[10, 7])
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = [0, 0, 0, 1]

        class_member_mask = (labels == k)

        xy = data[class_member_mask & core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
                 markeredgecolor='k', markersize=14)

        xy = data[class_member_mask & ~core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
                 markeredgecolor='k', markersize=6)

    plt.title('Estimated number of clusters: %d' % n_clusters_, size=15)
    plt.xlabel("feature 1", size=13)
    plt.ylabel("feature 2", size=13)
    plt.show()

## Homeworks

In [None]:
from sklearn.datasets import load_wine
wine_ds = load_wine()
print(wine_ds.DESCR)

<div class="alert alert-danger" role="alert">
    
<img src="../resources/icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task**: Apply scaling and PCA to the **Wine** dataset. 

<br>
Do not pay attention now on the model used for classification, this exercise is meant for understanding an important consequence of scaling the data when adopting PCA. 

1. Split training and test (this has been done for you);
2. Check model performances over the original data(use the provided `check_knn_performances` function as we did in the exercises). This is done for having a baseline reference;
3. Transform the data with a `StandardScaler` and save it in a different variable (we need it for comparisons);
4. Apply PCA to both non-standardized data and the standardized one;
5. Check model performances with both version of the data (PCA and Standardization + PCA);
6. Plot the first and second component of the data with PCA and the first and second component  of the data with Standardization + PCA (use the provided `plot_scaling_pcs` function);
7. Compare the results and observe the difference between the two plots.

<hr>
    
Check the [reference](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py) here for more details (do this after solving the exercise)
    
[**SOLUTION**](./solutions/handson2/solution.py)


</div>

In [None]:
def plot_scaling_pcs(X_pca, X_scaled_pca, y):
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 7))


    for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
        ax1.scatter(X_pca[y == l, 0],
                    X_pca[y == l, 1],
                    color=c,
                    label='class %s' % l,
                    alpha=0.5,
                    marker=m
                    )

    for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
        ax2.scatter(X_scaled_pca[y == l, 0],
                    X_scaled_pca[y == l, 1],
                    color=c,
                    label='class %s' % l,
                    alpha=0.5,
                    marker=m
                    )

    ax1.set_title('Training dataset after PCA')
    ax2.set_title('Standardized training dataset after PCA')

    for ax in (ax1, ax2):
        ax.set_xlabel('1st principal component')
        ax.set_ylabel('2nd principal component')
        ax.legend(loc='upper right')
        ax.grid()

<div class="alert alert-success" role="alert">
    
<img src="../resources/icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **PCA & Scaling Tip 1**: 
Be careful to fit both `StandardScaler` and `PCA` on the Training Dataset and use the fit models for transforming the *Test Data*. Why this? Our *Test Data*  simulates new data as if the model is put into production and it should be data you haven't observed yet, thus your only prior information is your training data set.  
</div>

In [None]:
def check_knn_performances(X_train, X_test, y_train, y_test): 
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_train, y_train)
    print("Test set accuracy: {:.2f}".format(knn.score(X_test, y_test)))

In [None]:
# 1: split test and training
X, y = wine_ds.data, wine_ds.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=R_STATE)

<div class="alert alert-warning" role="alert">
    
<img src="../resources/icons/book.png"  width="20" height="20" align="left"> &nbsp;  **Theory: Feature Scaling** 

For some ML applications and models, the data scaling may have an impact on the obtained results and performance provided by a given model. An example is PCA, for which the components that maximize the variance are the key principle of this technique. If there are components which vary more than others, due to their nature, this may affect how principal components are computed.

Another application where scaling the feature is a must is for Neural Networks. Why this is important will be a subject of Lecture 4.
</div>

<hr>

<div hidden=True>
<img src="../resources/icons/list.png"  width="20" height="20" align="left"> &nbsp; Icon made by <a href="https://www.flaticon.com/authors/smashicons" title="Smashicons">Smashicons</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>

<img src="../resources/icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp;Icon made by <a href="https://www.flaticon.com/authors/pixelmeetup" title="Pixelmeetup">Pixelmeetup</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>

<img src="../resources/icons/new.png"  width="20" height="20" align="left"> &nbsp; Icon made by <a href="https://www.flaticon.com/authors/pixel-perfect" title="Pixel perfect">Pixel perfect</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>

<img src="../resources/icons/chemistry.png"  width="20" height="20" align="left"> &nbsp; Icon made by <a href="https://www.flaticon.com/authors/popcorns-arts" title="Icon Pond">Icon Pond</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>

<img src="../resources/icons/book.png"  width="20" height="20" align="left"> &nbsp; Icon made by <a href="https://www.flaticon.com/authors/popcorns-arts" title="Icon Pond">Icon Pond</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>
    
</div>