
## Unsupervised Learning in Machine Learning - Clustering

In this notebook you will get familiar with K means clustering algorithm using PyCaret python package. First lets dive through PyCaret clustering functions along dummy data set.




# **Import libraries**

In [None]:
# install pycaret this way if you are running this notebook in google colab environment.
# !pip install joblib==1.3.2
# !pip install pycaret 

In [None]:
from pycaret.clustering import * #importing pycaret clustering module
import plotly.express as px

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs # to generate new datasets/dummy dataset
import random



# **Make simple dummy dataset for clustering**


In [None]:
random.seed(42)
# create points on 4 clusters
# X contains the features (coordinates) of the generated data points, while y contains the corresponding cluster labels for each data point.
X, y = make_blobs(n_samples=200, centers=4, random_state=42, cluster_std=1.5)

In [None]:
X

In [None]:
y

In [None]:
# visualize the example data

# call that function to draw a scatterplot
sns.scatterplot(x=X[:,0], y=X[:,1], palette='viridis')
#X[:,0] selects the values from the first column of the X array. The first column typically represents the first feature of the dataset.
#y=X[:,1]: This specifies the y-axis values for the scatter plot. Similarly, X[:,1] selects the values from the second column of the X array. The second column typically represents the second feature of the dataset.

# **Pycaret**
Pycaret is a low-code and beginner-friendly machine learning (ML) library in Python that automates and speeds up the ML-workflow. Pycaret replaces hundreds of lines of code with only a few.

# **Clustering in Pycaret**
* PyCaret's clustering module provides several pre-processing features that can be configured when initializing the setup through the **`setup()`** function.

* It has several algorithms and plots to analyze the results. PyCaret's clustering module also implements a unique function called **`tune_model()` ** that allows you to tune the hyperparameters of a clustering model to optimize a supervised learning objective such as R^2 for regression.

* **`setup()`** is Pycaret's main function and it needs to be run before executing any other function in pycaret. The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment.

* When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment.

* These tasks are performed differently for each data type which means it is very important for them to be correctly configured.




In [None]:
# setup() has lots of optional parameters for e.g. preprocessing, but let's run it with defaults
s = setup(X)

In [None]:
# pycaret offers many clustering algorithms we can compare
models()

# **Create a model**
Next let's create and train a **kmeans model**. Without additional parameters it will use **4 clusters** as default but if you know the number of clusters beforehand you can pass it using num_clusters parameter. In this case we know there is supposed to be 5 clusters and we're gonna use that.

In [None]:
kmeans = create_model('kmeans', num_clusters=4)

In [None]:
kmeans_cluster = assign_model(kmeans)
kmeans_cluster


# **Silhouette Coefficient/silhouette score**
Pycaret will print some useful metrics.

**Silhouette Coefficient** or **silhouette score** is a metric used to calculate the goodness of a clustering technique.
* Its value ranges from -1 to 1.
* 1 means that clusters are well apart
from each other and clearly distinguished,
* 0 means that clusters are indifferent ie. the distance between clusters is not significant,
* -1 means that clusters are assigned in the wrong way.

We can plot silhouette scores per cluster and get validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. In other words, the silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

In [None]:
plot_model(kmeans, plot = 'silhouette')


# **Elbow method**
Another useful method is the **elbow method**, which is a heuristic method of interpretation and validation of consistency within cluster analysis designed to help find the appropriate number of clusters in a dataset.

In [None]:
plot_model(kmeans, plot = 'elbow')

In this example the Elbow plot above suggests that 3 is the optimal number of clusters. Usually there is a clear angle, elbow, in the distortion scores, and that cutoff point is where adding another cluster doesn't give much better modeling of the data. You can use this suggestion to create a new model, but in this case we know there should be 5 clusters and we used that previously so we're not gonna create another model.

# **Centroids**
The model here is a python object, and can thus have certain attributes, such as the centroids locations:

In [None]:
centroids = kmeans.cluster_centers_
centroids

In [None]:
sns.scatterplot(x=X[:, 0], y=X[:, 1], color='gray')
sns.scatterplot(x=centroids[:, 0], y=centroids[:, 1], s=200, marker="X") # this time for the centroids of the clusters.
#x=centroids[:, 0] specifies that the x-coordinates of the centroids will be taken from the first column of the centroids array.
#y=centroids[:, 1] specifies that the y-coordinates of the centroids will be taken from the second column of the centroids array.
#s=200 sets the size of the markers

# **Plot the model results**
The **`plot_model()`** function can be used to analyze different aspects of the clustering model. This function takes a trained model object and returns a plot. See examples below:

PCA plot

In [None]:
plot_model(kmeans, plot = 'cluster') #cluster is default

# **Distribution plot**
The distribution plot shows the size of each cluster. When hovering over the bars you will see the number of samples assigned to each cluster. We can also use the distribution plot to see the distribution of cluster labels in association with any other numeric or categorical feature. Features are column names of your dataframe, but in this case feature_1 has been autogenerated since column names weren't passed. See an example below:

In [None]:
plot_model(kmeans, plot = 'distribution', feature='feature_2') # you can check with feature_1 as well

# **Compare to original clusters**
Normally you wouldn't be able to compare clusters since you don't have anything to compare. In this case we know the original clusters and we can therefore compare kmeans results to them.

Predicted clusters are saved in the model object:

In [None]:
pred = kmeans.labels_
pred

In [None]:
# figure size
plt.figure(figsize=(12,5))

# original clusters with first and 3rd feature
plt.subplot(1, 2, 1)
sns.scatterplot(x=X[:,0], y=X[:,1],hue=y, palette='viridis').set(title='Original clusters')

# predicted clusters
plt.subplot(1, 2, 2)
sns.scatterplot(x=X[:,0], y=X[:,1],hue=pred, palette='viridis').set(title='Predicted clusters')