<img src=images/gdd-logo.png width=300px align=right>

# Unsupervised Learning

Until now we have considered machine learning algorithms that learn with the help of external feedback: the algorithm makes a prediction, compares its prediction with a provided ground truth, and "learns" by adjusting its internal parameters. This class of learning techniques is referred to as **supervised learning**.

In contrast, the techniques described in this lecture do not rely on some external notion of what is or is not correct; this class of learning techniques is referred to as **unsupervised learning**.

For unsupervised learning, we can roughly differentiate between two categories: 
   - Clustering
   - Dimensionality reduction
   
In this notebook, we will take a look at techniques to determine whether your data can be clustered and a popular clustering algorithm called k-means.  

# Clustering

- [Introduction to clustering](#intro)
- [K-Means clustering](#kmeans)
- [Determining the number of clusters](#nr-clusters)
- [Scaling](#scaling)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id='intro'></a>
## Introduction to clustering

Clustering is the technique of **grouping together** points that in some form or another 'belong' together. When you have a clustering problem, you are given an *unlabeled* data set and hace an algorithm automatically group the data into coherent subsets or into coherent clusters for you. 

<mark>**Question:** Can you think of any examples of unsupervised ML problems?</mark>

<details>
    
  <summary><span style="color:blue">Show examples</span></summary>
  
Examples include market segmentation, social network analysis, organising computer clusters/data centers better, or understanding galaxy formation. 

</details>

The type of algorithm to choose for your clustering heavily depends on the type of data you have. Let's start out with creating a dataset with four distinct clusters.

In [None]:
from sklearn.datasets import make_blobs

blobs, blobs_labels = make_blobs(n_samples=300, 
                                 centers=4,
                                 cluster_std=0.6, 
                                 random_state=0)
plt.scatter(blobs[:, 0], blobs[:, 1], s=50);

We can clearly see the four distinct clusters here, because we are able to visualise our data in 2D. However, if our data is more than two-dimensional, it can be difficult to visualise and therefore assess the clusterability.  

<a id='kmeans'></a>
## K-Means clustering

K-means is a widely used clustering algorithm that assigns each point in the dataset to a cluster. 

1. A number (_k_) of **centroids** are initialised. These are the centers of our clusters. Usually, these centroids are data points in the data set. 
2. Each point in the dataset gets assigned to one out of _k_ clusters based on the **minimal Euclidean distance** between the data point and each centroid. 
3. The centroid of each cluster is **recalculated** to the average of the points in that cluster. 
4. Repeat 2-3 until points no longer get reassigned.


<img src="images/kmeans.gif" alt="K-Means Illustration" height=600 width=600>

<mark>**Question:** Can you think of any potential downsides of Kmeans?</mark>

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
Onedownside of k-means is that it requires you to define **in advance** how many clusters there are expected to be in the data.

</details>



Let's try it out in python!

In [None]:
from sklearn.cluster import KMeans

help(KMeans)

In [None]:
from sklearn.cluster import KMeans

# Experiment with different values of k!
kmeans = KMeans(n_clusters=4, n_init=1)
kmeans.fit(blobs)

blobs_kmeans = kmeans.predict(blobs)
blobs_kmeans[0:5]

In [None]:
centers = kmeans.cluster_centers_
centers

In [None]:
plt.scatter(blobs[:, 0], blobs[:, 1], 
            c=blobs_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

K-Means is also pretty sensitive to the initial initialisation. To ensure you have not become stuck in a local minimum, you can run K-Means multiple times and choose the centroid for which the **cost function** (sum of squared distances of samples to their closest cluster center) is the lowest. In sklearn, you can access this value with the `.inertia_` attribute of your estimator. 

### <mark>Exercise</mark>

Run the code below and experiment with different values for `k` (e.g. what happens when k=300) and the amount of initialisations (`n_init`).

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, n_init=1)
kmeans.fit(blobs)

blobs_kmeans = kmeans.predict(blobs)

plt.scatter(blobs[:, 0], blobs[:, 1], 
            c=blobs_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

print(f"The intertia score for this clustering attempting is {kmeans.inertia_}")

<a id='nr-clusters'></a>
## Determining the number of clusters
Inertia will tend to zero as the number of centers increases to the amount of data points.

You can use this property to infer the optimal $k$ value: you should choose a number of clusters so that adding another cluster would not give a much better inertia value.

The **Elbow method** is one of the most popular methods to determine this optimal number of clusters for the data at hand.

In [None]:
blobs, blobs_labels = make_blobs(n_samples=300, 
                                 centers=4,
                                 cluster_std=0.60, 
                                 random_state=0)
plt.scatter(blobs[:, 0], blobs[:, 1], s=50);

With the Elbow method, you plot the inertia for the number of clusters you consider. 

In [None]:
K = range(1, 10)
score = []
for k in K:
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(blobs)
    score.append(kmeans.inertia_)
    
score

In [None]:
plt.plot(K, score, 'bx-')
plt.title('The Elbow Method')
plt.xlabel('k')
plt.ylabel('Inertia');

As expected, the plot looks like an arm with a clear elbow at k = 4. The choice of number of clusters, based on this plot, would be **four**.

Unfortunately, you do not always have such clearly clustered data. Let's create a bit more ambiguous data by altering the cluster standard deviation in the blobs example.

In [None]:
blobs, blobs_labels = make_blobs(n_samples=300, 
                                 centers=4,
                                 cluster_std=1.20, 
                                 random_state=0)
plt.scatter(blobs[:, 0], blobs[:, 1], s=50);

Now let's again plot the inertia for various numbers of clusters.

In [None]:
K = range(1, 10)
score = []
for k in K:
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(blobs)
    score.append(kmeans.inertia_)

In [None]:
plt.plot(K, score, 'bx-')
plt.title('The Elbow Method')
plt.xlabel('k')
plt.ylabel('Inertia');

The elbow is not as sharp. Although we know (as we created our data) the best number of clusters is four, based on this plot the choice is ambiguous. Five, for instance, could also be a decent choice. 

In such an ambiguous case, you can also use the **Silhouette method**. This metric measures how similar a point is to its own cluster compared to other clusters. The range of the Silhouette value is between -1 and +1 and the higher it is, the better. 

In [None]:
from sklearn.metrics import silhouette_score
help(silhouette_score)

In [None]:
K = range(2, 10)
score = []
for k in K: 
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(blobs)
    labels = kmeans.labels_
    sil_score = silhouette_score(blobs, labels, metric='euclidean')
    score.append(sil_score)

In [None]:
plt.plot(K, score, 'bx-')
plt.title('The Sihouette Score')
plt.xlabel('k')
plt.ylabel('Silhouette Score');

The Silhouette score reaches its global maximum at the optimal k. This means that the highest Silhouette score value corresponds with the best choice of number of clusters. In a plot, this appears as a peak. 

In [None]:
for k, val in zip(K, score):
    print(f'No. clusters {k}: {val:.3f}')

It should be noted that the Elbow method and the Silhouette score are not alternatives to each other for finding the optimal K. Rather, they are tools to be used together for a more confident decision. 

<a id='scaling'></a>
## Scaling

**Distance-based algorithms** like k-means (but also k-nearest neighbors for supervised learning, or even support vector machines) are sensitive to the **scale** of the variables. 

Imagine data that has an age and an income variable. A typical range for age is 25-60, while the range of income can vary between \\$25,000 and \\$150,000. In measuring the distance to a cluster centroid, every variable is taken into account equally. A change of 25 in terms of years (25 years old or 50 years old) is treated similarly as a change of 25 in income (\\$25,000 or \\$25,025). This does not seem right and influences the clustering results. 

Let's recreate the blobs dataset, but make a small adjustment. 

In [None]:
blobs, blobs_labels = make_blobs(n_samples=300, 
                                 centers=4,
                                 cluster_std=0.60, 
                                 random_state=0)
plt.scatter(blobs[:, 0], blobs[:, 1], s=50);

In [None]:
blobs[:, 1] = blobs[:, 1] * 100
plt.scatter(blobs[:, 0], blobs[:, 1], s=50);

The values of the second feature (y-axis) have been multiplied by a factor of 100. 

In [None]:
np.min(blobs[:, 0]), np.max(blobs[:, 0])

In [None]:
np.min(blobs[:, 1]), np.max(blobs[:, 1])

The ranges of the features are no longer similar. Let's see what this does to for the clustering algorithm.

In [None]:
kmeans = KMeans(n_clusters=4, n_init=10)
kmeans.fit(blobs)

blobs_kmeans = kmeans.predict(blobs)

In [None]:
plt.scatter(blobs[:, 0], blobs[:, 1], 
            c=blobs_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

The clustering results have shifted! Let's fix this by scaling the data before training the model.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

scaler = StandardScaler()
scaler.fit(blobs)
blobs_transformed = scaler.transform(blobs) 

kmeans = KMeans(n_clusters=4, n_init=10)
kmeans.fit(blobs_transformed)

blobs_kmeans = kmeans.predict(blobs_transformed)

plt.scatter(blobs[:, 0], blobs[:, 1], 
            c=blobs_kmeans, s=50, cmap='viridis')

centers_scaled = kmeans.cluster_centers_
centers = scaler.inverse_transform(centers_scaled)

plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

Below we repeat the process but using Scikit-Learn Pipelines.

In [None]:
pipe = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('model', KMeans(n_clusters=4, n_init=10))
])

pipe.fit(blobs)
blobs_kmeans = pipe.predict(blobs)

plt.scatter(blobs[:, 0], blobs[:, 1], 
            c=blobs_kmeans, s=50, cmap='viridis')

centers_scaled = pipe['model'].cluster_centers_
centers = pipe['scaler'].inverse_transform(centers_scaled)

plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

# Summary

**Clustering** is the technique of grouping together points that in some form or another 'belong' together.

A popular approach to clustering is using the **k-means algorithm**. Each point in the dataset is assigned to a cluster through an iterative method based on cluster centroids. The number of cluster centroids needs to be determined upfront. 

Two helpful methods to help determine the appropriate number of cluster centroids are: 
* **The Elbow Method**: the inertia is plotted against k. The point where the decrease in inertia slows down rapidly (the "elbow") is a good number of cluster points. 
* **Silhouette score**: a score between -1 and +1. The highest Silhouette score corresponds to the best number of clusters for this particular problem. 

It is important to be mindful of **scaling** when working with distance-based methods, like k-means clustering.