<a href="https://colab.research.google.com/github/thurmboris/Data-Science_4_Sustainability/blob/main/07_Clustering/07_Clustering_Solutions.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import standard libraries
import os
os.environ["OMP_NUM_THREADS"] = '1'
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import time

# ML import
from sklearn import datasets        # datasets
from sklearn.cluster import KMeans  # K-Means algorithm
from sklearn.cluster import AgglomerativeClustering  # Hierarchical clustering
from scipy.cluster.hierarchy import dendrogram, linkage # dendogram visualization

# Clustering

<img src='https://miro.medium.com/v2/resize:fit:4800/format:webp/0*ZxLMBwq9rmW9ZFuZ.jpg' width="800">

Source: [The difference between supervised and unsupervised learning](https://twitter.com/athena_schools/status/1063013435779223553), illustrated by [@Ciaraioch](https://twitter.com/Ciaraioch) 


## Content

The goal of this walkthrough is to provide you with insights on clustering, focusing on two methods: K-Means and Hierarchical clustering. After presenting the main concepts, you will be introduced to the techniques to implement the algorithms in Python. Finally, it will be your turn to practice, using an application on customers of shopping mall.

This notebook is organized as follows:
- [Background](#Background)
    - [Objective](#Objective)
    - [Algorithm overview](#Algorithm-overview)
- [Implementation](#Implementation)
    - [Discover dataset](#Discover-dataset)
    - [K-Means](#K-Means)
        - [Implementing K-Means](#Implementing-K-Means)
        - [Graphical representation](#Graphical-representation)
        - [Elbow method](#Elbow-method)
    - [Hierarchical clustering](#Hierarchical-clustering)
        - [Implementing hierarchical (agglomerative) clustering](#Implementing-hierarchical-(agglomerative)-clustering)
        - [Dendrogram visualization](#Dendrogram-visualization)
    - [Runtime complexity](#Runtime-complexity)
- [Your turn](#Your-turn)

## Background

### Objective

Clustering aims at creating groups of data points with the goal to:
- organize data into classes with high intra-class similarity and low inter-class similarity
- find the class labels and the number of classes directly from the data (vs classification for which classes are defined)
- find natural groupings among objects

Clustering algorithms are thus **unsupervised learning** methods.

### Algorithm overview

Here is a table describing different techniques available with the sklearn module `sklean.cluster`. The [documentation](https://scikit-learn.org/stable/modules/clustering.html) contains detailed description of each technique, you can explore it to deepen your understanding of each algorithm! You can also refer to the [Glossary](https://scikit-learn.org/stable/glossary.html) for definitions of technical terms.

| Method name | Parameters | Usecase | Geometry (metric used) |
| :- | :- | :- | :- |
| [K-Means](https://scikit-learn.org/stable/modules/clustering.html#k-means) | Number of clusters | General-purpose, even cluster size, flat geometry, not too many clusters, inductive | Distances between points|
| [Affinity propagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation) | Damping, sample preference | Many clusters, uneven cluster size, non-flat geometry, inductive | Graph distance (e.g. nearest-neighbor graph)|
| [Mean-shift](https://scikit-learn.org/stable/modules/clustering.html#mean-shift) | Bandwidth | Many clusters, uneven cluster size, non-flat geometry, inductive | Distances between points|
| [Spectral clustering](https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering) | Number of clusters | Few clusters, even cluster size, non-flat geometry, transductive | Graph distance (e.g. nearest-neighbor graph)|
| [Ward hierarchical clustering](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering) | Number of clusters or distance threshold | Many clusters, possibly connectivity constraints, transductive | Distances between points|
| [Agglomerative clustering](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering) | Number of clusters or distance threshold, linkage type, distance| Many clusters, possibly connectivity constraints, non Euclidean distances, transductive | Any pairwise distance|
| [DBSCAN](https://scikit-learn.org/stable/modules/clustering.html#dbscan) | Neighborhood size | Non-flat geometry, uneven cluster sizes, outlier removal, transductive | Distances between nearest points|
| [OPTICS](https://scikit-learn.org/stable/modules/clustering.html#optics) | Minimum cluster membership | Non-flat geometry, uneven cluster sizes, variable cluster density, outlier removal, transductive | Distances between points|
| [Gaussian mixtures](https://scikit-learn.org/stable/modules/mixture.html#mixture) | Many | Flat geometry, good for density estimation, inductive | Mahalanobis distances to  centers|
| [BIRCH](https://scikit-learn.org/stable/modules/clustering.html#birch) | Branching factor, threshold, optional global clusterer | Large dataset, outlier removal, data reduction, inductive | Euclidean distance between points|
| [Bisecting K-Means](https://scikit-learn.org/stable/modules/clustering.html#bisect-k-means) | Number of clusters | General-purpose, even cluster size, flat geometry, not too many clusters, inductive | Distances between points|




Each method performs differently depending on the input data. They also differ in their complexity. The figure below illustrates these differences - check the computation time bottom right!

<center>
<img src='https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png' width="800">
<center/>

Source: [Scikit-learn Clustering Documentation](https://scikit-learn.org/stable/modules/clustering.html)       

## Implementation

### Discover dataset

We're going to use the Iris dataset, which contains measurement for 3 different types of iris flowers: *setosa*, *versicolor*, and *virginica*:

<img src='https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png' width="600">

The data includes, for each Iris flower, measures of width and length of sepals and petals. The dataset was originally created by Sir R.A. Fisher. We can directly obtain the dataset from `sklearn`, via the `datasets` module, which contains several toy datasets ([Documentation](https://scikit-learn.org/stable/datasets/toy_dataset.html)). Here is the import line:

```python
from sklearn import datasets
```

The Iris dataset is a classic in ML and is often used to discover, for instance, classification, clustering, and dimensionality reduction. Let's discover it!

In [None]:
# Load the Iris dataset from sklearn
iris = datasets.load_iris()

In [None]:
print("The different types of irises are:", iris.target_names)

The Iris dataset is saved as a set of numpy arrays. We're going to transform it into a pandas dataframe.

**Note:** We could also used the numpy array format for k-means Clustering.

In [None]:
# Note that our dataframe only includes data about the flowers, and NOT the actual type of flowers
X = pd.DataFrame(iris.data, columns=iris.feature_names)
X.head()

We have 4 different metrics stored in X. For now we'll work with only the sepal features: "sepal lenght (cm)" and "sepal width (cm)":

In [None]:
# Dataframe with sepal features
X_sepal = X.loc[:, ["sepal length (cm)","sepal width (cm)"]]
X_sepal.head()

The species are encoded by labels:

In [None]:
print("Species are encoded as:", iris.target)

We save these labels in a dataframe, indicating for each observation (flower) which kind it is.

In [None]:
# Dataframe with flower type (labels)
y=pd.DataFrame(iris.target, columns=["Flower_type"])
y.head()

Let's check how many observations we have for each type of flowers:

In [None]:
y.value_counts()

Let's print some summary statistics, for each type of flower:

In [None]:
# Summary statistics
pd.concat([X_sepal, y], axis=1).groupby(['Flower_type']).describe().loc[:,(slice(None),['max','min','mean'])].transpose().sort_index()

### K-Means

#### Implementing K-Means

We are using the `KMeans` module of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans))

```python
from sklearn.cluster import KMeans
```
As parameters, we need to specify `n_clusters`, describing the number of clusters to form as well as the number of centroids to generate. For illustration, we're going to train two K-Means models and fit it on the sepal features (X_sepal):
-   K-means with 3 clusters
-   K-means with 5 clusters

The one with 5 clusters is only for illustration, because we already know that there are only 3 different types of iris in the dataset.

In [None]:
# Create an instace of KMeans and specify the number of clusters=3, 
# Random state help make sure we all have exactly the same results
kmeans3 = KMeans(n_clusters=3, random_state=0, n_init='auto') #3 clusters

# Fit the model on the set of features we previously labelled as X_sepal (NOT including the labels on the type of flowers)
kmeans3.fit(X_sepal)

We can access the `labels_` of each observation:

In [None]:
print(kmeans3.labels_)

We can use `cluster_centers_` to obtain the coordinates of the centers generated by the model:

In [None]:
print(kmeans3.cluster_centers_)

Let's proceed similarly, this time with 5 clusters:

In [None]:
# Create an instance of KMeans
kmeans5=KMeans(n_clusters=5, random_state=0, n_init='auto') #5 clusters

# Fit the model on the X features
kmeans5.fit(X_sepal)

# Labels
print(kmeans5.labels_)

#### Graphical representation

Let's visualize the clusters created, and compare them with the "true" labels (flower types).

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(14, 4))

# Scatter plot, raw data with original labels
ax[0].scatter(X_sepal["sepal length (cm)"], # x-axis
              X_sepal["sepal width (cm)"],  # y-axis
              c=y['Flower_type'],           # points colored by the different flower types
              cmap='tab10')                 # choice of colors
ax[0].set_xlabel("Sepal length (cm)")       # label x-axis
ax[0].set_ylabel("Sepal width (cm)")        # label y-axis
ax[0].set_title("Raw data with the original labels")  # title

# Scatter plot of clusters, KMeans Model with 3 Clusters
ax[1].scatter(X_sepal["sepal length (cm)"], 
              X_sepal["sepal width (cm)"], 
              c=kmeans3.labels_,              # points colored by the labels created by the model
              cmap='tab10')
ax[1].scatter(kmeans3.cluster_centers_[:, 0],  # x-coordinates of cluster centroids
              kmeans3.cluster_centers_[:, 1],  # y-coordinates of cluster centroids
              c="red",                        # color of centroids
              marker='x')                     # marker of centroids
ax[1].set_xlabel("Sepal length (cm)")
ax[1].set_ylabel("Sepal width (cm)")
ax[1].set_title("KMeans Model with 3 Clusters")

# Scatter plot of clusters, KMeans Model with 5 Clusters
ax[2].scatter(X_sepal["sepal length (cm)"], X_sepal["sepal width (cm)"], c=kmeans5.labels_, cmap='tab10')
ax[2].scatter(kmeans5.cluster_centers_[:, 0], kmeans5.cluster_centers_[:, 1], c="red", marker='x')
ax[2].set_xlabel("Sepal length (cm)")
ax[2].set_ylabel("Sepal width (cm)")
ax[2].set_title("KMeans Model with 5 Clusters")

plt.subplots_adjust(wspace=0.4)   # Space between plots
plt.show()

Our K-Means model with 3 clusters recognizes well the *setosa* (top left cluster), however it struggles to distinguish the *versicolor* and *virginica* since these flowers have interwonen sepal length and width. Note that the original labels and the one created with our model do not match (e.g., *setosa* is label 0 in the original dataset, and 1 in our K-Means model), hence the colors in our plot differ.

How do we assess the performance of our models? In the general case we do not have the target variable (unsupervised learning), so we cannot rely on the metrics used in classification such as the accuracy. Instead, we rely on other metrics such as the **inertia**, which is the sum of squared distances of samples to their closest cluster center, potentially weighted by the sample weights if provided. This is the cost function that the algorithm minimizes. Let's check the inertia of our K-Means models with 3 and 5 clusters:

In [None]:
print("The inertia of the K-Means model with 3 clusters is: {:0.2f}".format(kmeans3.inertia_))
print("The inertia of the K-Means model with 5 clusters is: {:0.2f}".format(kmeans5.inertia_))

Which model should we choose? In other words, how many clusters should we pick? We'll explore this question using the Elbow method.

#### Elbow method

We now try to find the "optimal" number of clusters using the [Elbow method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)).  

This method consists in plotting the explained variation (e.g., **inertia**) as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. The intuition is that increasing the number of clusters will always improve the fit (explain more of the variation), since there are more parameters (more clusters). However, this will at some point result in **over-fitting**, with only minimal gains in the fit, which the elbow reflects. 

Let's try! We create a loop to iteratively train K-Means algorithms for different values of k, saving the parameter `inertia_` at each iteration. This time we will use all the features, i.e., sepal and petal length and width. We then plot the inertia compared to the number of clusters:

In [None]:
inertias = []
nbr_clusters = range(2,11)

for i in nbr_clusters:
    km = KMeans(n_clusters=i, random_state=0, n_init='auto').fit(X)  # Create and fit model
    inertias.append(km.inertia_)     # Store inertia

# Plot      
plt.plot(nbr_clusters, inertias, '-o')
plt.xticks(nbr_clusters)
plt.title('Elbow method for inertia')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.show()

The elbow method tells us to select the cluster when there is a significant change in inertia (i.e., cost). In this case, 4 seems like the optimal number of clusters. From k=5 we see that the reduction in the cost function is much lower than for example for k=3.

### Hierarchical clustering

#### Implementing hierarchical (agglomerative) clustering

We are using the `AgglomerativeClustering` module of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering))

```python
from sklearn.cluster import AgglomerativeClustering
```
As parameters, we specify:
- `n_clusters`, number of clusters to find (default =2)
    - Instead of specifying the number of clusters, we could also provide the `distance_threshold`, which is the linkage distance threshold at or above which clusters will not be merged. If this case, `n_clusters` must be `None` and `compute_full_tree` must be `True`.
- `metric` is the type of distance metric used, e.g., "euclidean" (default), "l1", "l2", "manhattan", "cosine", or "precomputed"
    - Note: be careful to the sklearn version you are using. `metric` was added since version 1.2. For previous versions, you can use `affinity` (or better update your sklearn version).
- `linkage` is the linkage criterion to use:
    - 'single': minimum of the distances between all observations of the two sets
    - 'complete': maximum distances between all observations of the two sets
    - 'average': average of the distances of each observation of the two sets
    - 'ward' (default): minimizes the variance of the clusters being merged

<img src='https://scikit-learn.org/stable/_images/sphx_glr_plot_linkage_comparison_001.png' width="500">

In [None]:
# Create model
agglomerative3 = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='average')

# Fit model
agglomerative3.fit(X_sepal)

As before we can access the labels:

In [None]:
print(agglomerative3.labels_)

Let's plot our clusters to visually compare our results to the K-Means algorithm:

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(14, 4))

# Scatter plot, raw data with original labels
ax[0].scatter(X_sepal["sepal length (cm)"], # x-axis
              X_sepal["sepal width (cm)"],  # y-axis
              c=y['Flower_type'],           # points colored by the different flower types
              cmap='tab10')                 # choice of colors
ax[0].set_xlabel("Sepal length (cm)")       # label x-axis
ax[0].set_ylabel("Sepal width (cm)")        # label y-axis
ax[0].set_title("Raw data with the original labels")  # title

# Scatter plot of clusters, KMeans Model with 3 Clusters
ax[1].scatter(X_sepal["sepal length (cm)"], 
              X_sepal["sepal width (cm)"], 
              c=kmeans3.labels_,              # points colored by the labels created by the model
              cmap='tab10')
ax[1].scatter(kmeans3.cluster_centers_[:, 0],  # x-coordinates of cluster centroids
              kmeans3.cluster_centers_[:, 1],  # y-coordinates of cluster centroids
              c="red",                        # color of centroids
              marker='x')                     # marker of centroids
ax[1].set_xlabel("Sepal length (cm)")
ax[1].set_ylabel("Sepal width (cm)")
ax[1].set_title("KMeans Model with 3 Clusters")

# Scatter plot of clusters, agglomerative clustering with 3 Clusters
ax[2].scatter(X_sepal["sepal length (cm)"], X_sepal["sepal width (cm)"], c=agglomerative3.labels_, cmap='tab10')
ax[2].set_xlabel("Sepal length (cm)")
ax[2].set_ylabel("Sepal width (cm)")
ax[2].set_title("Agglomerative Model with 3 Clusters")

plt.subplots_adjust(wspace=0.4)   # Space between plots
plt.show()

#### Dendrogram visualization

In this section, we're going to present a way to create a **Dendrogram**, using the `scipy.cluster.hierarchy` library ([Documentation](https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html)). We're going to use: 
- `dendogram`: to plot the dendogram ([Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram))
- `linkage`: to specify the type of linkage between the clusters ([Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage))

In `linkage`, we can specify the `metric` (e.g., 'euclidean') and the `method` (e.g., 'single', 'complete', 'average', or 'ward' - see the documentation for more details).

In [None]:
# Provide the linkage method we want and the chosen distance metric.
method_Z = 'average' 
Z = linkage(X, method = method_Z, metric = 'euclidean')

# Single linkage
plt.figure(figsize=(16, 4))
dendrogram(Z) # Plot the dendogram according the linkage
plt.title('Dendrogram - '+method_Z, fontsize=14)
plt.xlabel('Index of observations', fontsize=14)
plt.ylabel('Distance', fontsize=14)
plt.show()

Feel free to try different methods and visualize the difference!

### Runtime complexity

Let's compare the computation time needed between K-means and hierarchical clustering for different numbers of points. To do so, we are using the `time` library ([Documentation](https://docs.python.org/3/library/time.html)).

We'll start by creating clusters of points. We are using `make_blobs` to generate our dataset ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs)), and will first define a function with input the number of data points and output the generated samples (X) and associated labels (y):

In [None]:
# We create a function that generates 3 clusters
def generate_three_clusters(num_points):
    centers = [(-15, -15), (0, 0), (15, 15)]
    cluster_std = [2, 3, 2]
    X, y = datasets.make_blobs(n_samples=num_points, cluster_std=cluster_std, centers=centers, n_features=3, random_state=1)
    return X, y

# Example with 100 points
X, y = generate_three_clusters(100)
# Plot clusters
plt.figure(figsize=(4,2))
plt.scatter(X[y == 0, 0], X[y == 0, 1], color="red", s=10)
plt.scatter(X[y == 1, 0], X[y == 1, 1], color="blue", s=10)
plt.scatter(X[y == 2, 0], X[y == 2, 1], color="green", s=10)
plt.title('Number of points: 100')
plt.show()

Next we generate 3 clusters using the above-defined function for n = 100, 1000, 2500, 5000, 7500, 10000, 25000 points, storing the result in a list:

In [None]:
# Define list
X_list = []
# Define numbers of points
num_points = [100, 1000, 2500, 5000, 7500, 10000, 25000]

for n in num_points:
    X, y = generate_three_clusters(n)  # Generate three clusters
    X_list.append(X)                   # Append X to X_list

The first item of our list contains 100 points, the second 1000 points, etc.

In [None]:
print(X_list[1].shape)

Now we create K-Means and hierarchical clustering models (with 3 clusters), and train the algorithm on the dataset generated above. We store the execution time in two lists, one for K-Means, the other for hierarchical clustering:  

In [None]:
# Record time in list
k_means_time = []
hc_time = []

for X in X_list:
    # K-Means
    model = KMeans(n_clusters=3, n_init='auto')    # Create instance of KMeans class (with 3 clusters)
    start = time.time()                            # Start recording time
    model.fit(X)                                   # Fit the model on X
    end = time.time()                              # End recording time
    k_means_time.append(end-start)                 # Store the execution time in k_means_time
    # Hierarchical clustering
    model = AgglomerativeClustering(n_clusters=3)  # Create instance of AgglomerativeClustering class (3 clusters)
    start = time.time()
    model.fit(X)
    end = time.time()
    hc_time.append(end-start)

Let's plot the result!

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(num_points, k_means_time, label='K-Means')
plt.scatter(num_points, hc_time, label='Hierarchical clustering')
plt.ylabel('Execution time')
plt.xlabel('Number of observations')
plt.legend()
plt.show()

We can see that for a small number of observations, K-Means takes a bit longer than hierarchical clustering. However, when the number of observations increase, hierarchical clustering takes much longer than K-Means. Indeed, the hierarchical clustering algorithm needs to compute the distance between each observation, and then iteratively between each cluster created. 

## Your turn!

Now it's your turn to practice. We will use a dataset on the customers of a shopping mall, exploring clustering the customers based on their annual income and spending score to see if there are distinguishable clusters which the mall can target.

The dataset was obtained from Dr. Tirthajyoti Sarkar GitHub repository [Machine-Learning-with-Python](https://github.com/tirthajyoti/Machine-Learning-with-Python). I recommend checking it out since it contains amazing ML tutorial and practice notebooks. The following exercise is inspired by the 'Hierarchical_Clustering' notebook of the repository.

In [None]:
# Import dataset
url = 'https://raw.githubusercontent.com/thurmboris/Data-Science_4_Sustainability/main/data/Mall_Customers.csv'
df = pd.read_csv(url)
df.head(10)

- Discover your dataset, looking at summary statistics, and histograms of income and spending score

In [None]:
# YOUR CODE HERE

# Summary statistics
df.describe()

In [None]:
# Histogram of income and spending score
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# Income
ax[0].set_title("Annual income distribution",fontsize=12)
ax[0].set_xlabel ("Annual income (k$)",fontsize=10)
sns.histplot(df['Annual Income (k$)'], ax = ax[0])

# Spending score
ax[1].set_title("Spending score distribution",fontsize=12)
ax[1].set_xlabel ("Spending Score (1-100)",fontsize=10)
sns.histplot(df['Spending Score (1-100)'], color = 'green', ax = ax[1])

plt.show()

- Do a scatter plot of income and spending score... Does your plot help you define the number of clusters?

In [None]:
# YOUR CODE HERE

plt.figure(figsize=(5, 4))
plt.title("Annual Income and Spending Score")
sns.scatterplot(data = df, x='Annual Income (k$)', y='Spending Score (1-100)', c='darkcyan')
plt.show()

Dendograms can also be used to gain insights on the optimal number of clusters.

- Plot a dendogram varying the linkage method. How many clusters do you think is optimal? You can apply the following technique:
    - Look for the longest stretch of vertical line which is not crossed by any "extended horizontal lines" (horizontal lines grouping clusters that are extended infinitely to both directions).
    - Take any point on that stretch of line and draw an imaginary horizontal line.
    - Count how many vertical lines this imaginary lines crossed.
    - That is likely to be the optimal number of clusters. 

*We provide below an interactive code to let the user change the linkage method. It is using the `ipywidgets` library. See the [Documentation](https://ipywidgets.readthedocs.io/en/stable/)*

In [None]:
# Import ipywidgets library
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

In [None]:
# YOUR CODE HERE

X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

@interact
def interactive_dendrogram(Method = ['ward', 'single', 'complete', 'average']):
    plt.figure(figsize=(16,6))
    Z = linkage(X, method = Method, metric = 'euclidean')
    dendrogram(Z)
    plt.title('Dendrogram - '+Method, fontsize=14)
    plt.xlabel('Customers', fontsize=14)
    plt.ylabel('Distance', fontsize=14)
    plt.show()

How many clusters should we pick? Let's apply the method described above with the dendrogram using the Ward method. We notice that, for distance between about 100 and 240, the vertical lines are not crossed by extended horizontal line. If we plot an horizontal line for a distance of 200, we can count 5 clusters: 

In [None]:
Z = linkage(X, method = 'ward', metric = 'euclidean')

plt.figure(figsize=(16,6))
dendrogram(Z)
plt.hlines(y=200,xmin=0,xmax=2000,colors='k',linestyles='--')
plt.text(x=850,y=210,s='Horizontal line crossing 5 vertical lines',fontsize=12)
plt.title('Dendrogram - ward', fontsize=14)
plt.xlabel('Customers', fontsize=14)
plt.ylabel('Distance', fontsize=14)
plt.show()

We will now implement clustering algorithms using 5 clusters representing 5 customer groups:
- *Careful* - high income but low spenders
- *Standard* - middle income and middle spenders
- ***Target group*** - middle-to-high income and high spenders (should be targeted by the mall
- *Careless* - low income but high spenders (should be avoided because of possible credit risk)
- *Sensible* - low income and low spenders

Let's start with K-Means.

- Implement a K-Means algorithm, using as features the annual income and the spending score, with 5 clusters. Also print the execution time.

In [None]:
# YOUR CODE HERE

model_kmeans = KMeans(n_clusters=5, random_state=0, n_init='auto')    # KMeans model with 5 clusters)
start_kmeans = time.time()                                            # Start recording time
model_kmeans.fit(X)                                                   # Fit the model on X
end_kmeans = time.time()                                              # End recording time
k_means_time = end_kmeans-start_kmeans                                # Store the execution time

print(f"K-Means model (5 clusters) execution time: {k_means_time}")

- Make a scatter plot of the annual income and spending score, colored by the cluster they belong to, adding to the figure the cluster centers

In [None]:
# YOUR CODE HERE

# Add a column to our dataframe with the cluster labels
df['cluster'] = model_kmeans.labels_

# Plot
plt.figure(figsize=(8, 5))
plt.title("Clustering of customers")

# Scatter plot colored by clusters
sns.scatterplot(data = df, 
                x='Annual Income (k$)', 
                y='Spending Score (1-100)', 
                hue='cluster',                                                     # Color by cluster
                palette = ['goldenrod','royalblue','green','firebrick','plum'],    # Choice of colors
                style = 'cluster')                                                 # Style of markers by cluster
plt.legend(loc = 'right',labels=['Sensible','Careful','Standard','Target','Careless'])  # Label legend

# Area of interest (target customers) in light green
plt.axhspan(ymin=60,ymax=100,xmin=0.4,xmax=0.96,alpha=0.3,color='lightgreen')     

# Cluster centers
plt.scatter(model_kmeans.cluster_centers_[:, 0],    # x-coordinates of cluster centroids
              model_kmeans.cluster_centers_[:, 1],  # y-coordinates of cluster centroids
              c="red",                              # color of centroids
              marker='x')                           # marker of centroids

plt.show()

- Implement the elbow method to determine the optimum number of cluster. Does the elbow method confirm our previous choice of 5 clusters?

In [None]:
# YOUR CODE HERE

inertias = []
nbr_clusters = range(2,11)

for i in nbr_clusters:
    km = KMeans(n_clusters=i, random_state=0, n_init='auto').fit(X)  # Create and fit model
    inertias.append(km.inertia_)     # Store inertia

# Plot
plt.figure(figsize=(6, 4))
plt.plot(nbr_clusters, inertias, '-o', color='darkturquoise')
plt.xticks(nbr_clusters)
plt.vlines(x=5,ymin=0,ymax=200000,linestyles='--', color='k')
plt.text(x=5.25,y=80000,s='5 clusters seem optimal', fontsize=10)
plt.title('Elbow method for inertia')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.show()

- Implement a hierarchical algorithm with 5 clusters and the linkage method of your choice. Print the execution time

In [None]:
# YOUR CODE HERE

# Hierarchical clustering with 5 clusters
model_hc = AgglomerativeClustering(n_clusters=5,
                                  metric='euclidean', 
                                  linkage='ward') 
start = time.time()
model_hc.fit(X)
end = time.time()
hc_time=end-start

print(f"Hierarchical clustering model (5 clusters) execution time: {hc_time}")

- Make a scatter plot of the annual income and spending score, colored by the cluster they belong to. How does your algorithms compare?

In [None]:
# YOUR CODE HERE

# Add a column to our dataframe with the cluster labels
df['cluster hc'] = model_hc.labels_

# We will do two subplot, one for K-Means, the other for hierarchical clustering
fig, ax = plt.subplots(1, 2, figsize=(15, 4))

fig.suptitle("Clustering of customers")

# K-Means
ax[0].set_title("K-Means")
sns.scatterplot(data = df, 
                x='Annual Income (k$)', 
                y='Spending Score (1-100)', 
                hue='cluster',                                                     # Color by cluster
                palette = ['goldenrod','royalblue','green','firebrick','plum'],    # Choice of colors
                style = 'cluster',                                                 # Style of markers by cluster
                ax = ax[0])                                                        # Position
ax[0].legend(loc = 'right',labels=['Sensible','Careful','Standard','Target','Careless'])  # Label legend

# Hierarchical clustering
ax[1].set_title("Hierarchical clustering")
sns.scatterplot(data = df, 
                x='Annual Income (k$)', 
                y='Spending Score (1-100)', 
                hue='cluster hc',                                                  # Color by cluster
                palette = ['goldenrod','royalblue','green','firebrick','plum'],    # Choice of colors
                style = 'cluster hc',                                              # Style of markers by cluster
                ax = ax[1])                                                        # Position
ax[1].legend(loc = 'right',labels=['Sensible','Careful','Standard','Target','Careless'])  # Label legend

plt.show()

The clusters of "Standard" customers expand. Our "Target" customers are the same.