# Unsupervised Machine Learning with Python


<img src="https://www.python.org/static/img/python-logo.png" alt="yogen" style="width: 200px; float: right;"/>
<br>
<br>
<br>
<img src="../assets/yogen-logo.png" alt="yogen" style="width: 200px; float: right;"/>


# Objectives

* Learn what is unsupervised machine learning

* Group data with clustering

* Find underlying linear patterns with PCA

![scikit-learn cheat sheet](http://amueller.github.io/sklearn_tutorial/cheat_sheet.png)

## Clustering

A series of techniques for finding clusters within datasets.

Clusters are groups of points that are closer to each other than to the rest.

<img src="figs/clustering.png" alt="Clustering" style="height: 600px; float: left;"/>


# Generate random data

We're going to generate random data normally distributed around three centers, with noise. Each cluster will have 200 points. We concatenate all three groups in a single dataframe. 

Please, note that the data we have created does not have a class. It is just a set of points. However, we DO know that they come from different distribution and our objective is to find out them.

## K-means clustering

Very simple algorithm, quite fast:

- Throw K candidate cluster centers (_centroids_) randomly at the data.

- Assign points to the closest centroid.

- Update the centroid as the average of its observations.

- Repeat 2,3 until convergence.



```
clustering (data, K):
    Randomly initialize K cluster centroids (mu(1),..., mu(k))
    # or select K random points from data
    Repeat until convergence:
        # assign cluster
        for d in data:
            assign d to closest cluster centroid
        # recompute cluster centroid
        for k = 1 to K:
            mu(k) = mean of data points assigned to cluster k
            
```

#### Exercise

For 3 random starting points, calculate which is the closest for each of our points. 

Hint: check the `scipy.spatial.distance` module. `np.argmin` might also be useful.

Now we are going to compare the original 'cluster' where each point comes from, with the asigned cluster. In order to do that, we just create a vector with the original class (color) and use that to plot. 

Please, recall that this is something we can do here because we're creating a synthetic dataset, but normally we won't be able to do it, since we don't know how the data has been generated.

Notice that class labels (kmeans) may not agree with original class number.

Now let's take some time to play with different number of (original) distributions and clusters and see the effect when number of clusters does not match true data

### Practical: K-Means Clustering with sklearn

Download `players_20.csv` from [here](https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset?select=players_20.csv).

## The elbow method

To choose a number of clusters in KMeans

## Hierarchical clustering

In hierarchichal clustering there are groups within groups. We can either subdivide the observations (_divisive clustering_) or we join those that are similar to each other (_agglomerative clustering_).

We can track the order in which we join them up and represent it as a _dendrogram_.

We don't need to specify the number of clusters beforehand. 

The distance between two observations is the _height_ of the branching point that separates them.

<img src="https://i0.wp.com/datascienceplus.com/wp-content/uploads/2016/01/hclust.png" alt="Dendrogram" style="height: 600px; float: left;"/>


### Distance measures in clustering

In any of these approaches, we need a measure of distance or similarity between points. 

In hierarchical clustering, we additionally need a measure of similarity between single points and groups of points.

How will this measures be influenced by the scale of our variables?

### Clustering in scikit-learn

![Clustering algorithms in scikit-learn](https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png)

## DBSCAN

There's also a hierarchical version.



## Measuring quality of clustering

### Elbow method

We've already seen it. Not very theoretically solid.

### Silhouette

The Silhouette Coefficient for a sample is a function of the mean intra-cluster distance (a) and the mean nearest cluster distance (b), and it is defined as:
 
$$ s(i) = \frac{b(i) - a(i)}{max\{a(i), b(i)\}}$$

It therefore varies between -1 for a sample that is closer to members of a different cluster than to its own and 1 for one that is a lot closer to members of its cluster.

It has a great advantage over the elbow method: it can either go up or down as we increase the number of clusters.

![2 clusters](https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_silhouette_analysis_001.png)

![3 clusters](https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_silhouette_analysis_002.png)

![4 clusters](https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_silhouette_analysis_003.png)

![5 clusters](https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_silhouette_analysis_004.png)

![6 clusters](https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_silhouette_analysis_005.png)


from https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

## Dimensionality reduction

### Principal Component Analysis (PCA)

A dimensionality reduction technique. It uses Singular Value Decomposition (SVD) of the data matrix to generate Principal Components: unit vectors that secuentially point in the direction that best fits the data, while being orthogonal to the previous ones.

They are ordered by the amount of variance they explain.

Those vectors can then be used to do a change of base. If we take fewer than the total, we will be doing _dimensionality reduction_; a proyection onto the subspace of the given dimension that conserves the most variance.

![PCA as base change](https://intoli.com/blog/pca-and-svd/img/basic-pca.png)

In three dimensions, but keeping only 2 components:

![PCA in 3D](figs/pca_3D.png)

from [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)

* Now let's apply PCA to our dataset.


# Additional References


[An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)

[The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)

[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)

[scikit-learn cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)

[A comparison of classifiers available in scikit-learn](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

[An amazing explanation of the kernel trick](http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html)

[Clustering in scikit-learn](http://scikit-learn.org/stable/modules/clustering.html)

[Ensemble methods in scikit-learn](http://scikit-learn.org/stable/modules/ensemble.html)

[An example of customer segmentation](https://www.kaggle.com/fabiendaniel/customer-segmentation)

[Hands-on ML](https://github.com/ageron/handson-ml2)

[Silhouette analysis](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

[PCA explained visually](https://setosa.io/ev/principal-component-analysis/)

[A step by step explanation of PCA](https://builtin.com/data-science/step-step-explanation-principal-component-analysis)