## Unsupervised Learning in Machine Learning - Clustering

In this notebook you will get familiar with K means clustering algorithm using PyCaret python package with preprocessed "Pima Indians Diabetes Database" dataset.


**Please create a report by addressing the provided questions(Q1-Q5) throughout the notebook.**

## Import libraries

The following code should be adapted if you run this on your laptop. You should already have a conda environment where you have installed pycaret so you can skip the pip install step (comment it out or remove cell). Change also the path (and file name) to load your preprocessed data.

In [None]:
!pip install joblib==1.3.0

In [None]:
!pip install pycaret # install pycaret this way if you are running this notebook in google colab environment.

In [None]:
from pycaret.clustering import * #importing pycaret clustering module
import plotly.express as px

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
# load the preprocessed data

url = "https://raw.githubusercontent.com/thilinib/CBM101/main/E_Macine_Learning/data/preprocessed_diabetes.csv"
df = pd.read_csv(url)

In [None]:
# check the shape of the preprocessed data
df.shape

In [None]:
# check how preprocessed data looks like
df.head()

As you may remember, you have 8 measurement features and one column specifying the outcome. Let's try to cluster this data considering all measured features.

You have already explored the data by plotting it along axis that show two features, e.g. with the seaborn scatterplot. Below, is an example plot of Insulin vs Glucose.

After we cluster the data, we would like to know whether the clusters are separating in the high-dimensional (in this data 8-dimensional) feature space. For that purpose we will apply dimensionality reduction methods, including PCA, t-SNE and UMAP to visualize clustering results and examine how they relate to the outcome variable.

In [None]:
sns.scatterplot(data=df, x=df.Insulin, y=df.Glucose, hue=df.Outcome, palette='viridis')

#1. Setup
Initialize PyCaret's clustering module with the dataset.


Are you wondering what this function will do to your data? You should :-) The pycaret package tries to do clever things with your data. Check carefully the function description for setup and make notes about default settings.

Remember also that your dataset has one column that indicates the outcome. The main goal here is to examine how the other measured features relate to outcome, specifically here we are interested to find out if the diabetic cases separate from non-diabetic using clustering-based methods. You should therefore perform clustering on those features (set parameter ignore accordingly to omit "outcome" feature).

In [None]:
?setup

In [None]:
s = setup(df, ignore_features='Outcome')

**Q1**

> Why we need to call setup function and what are the key pre-processing tasks automatically performed by PyCaret?



Get all the clustering models list

In [None]:
models()

#2. Create a k-means model


In [None]:
kmeans = create_model('kmeans')

**Q2**

> When creating a clustering model, such as K-Means clustering using PyCaret's create_model() function, what value for k is used by default? What metrics and values are returned as part of the model creation process?



You can try to create other clustering models like ap, hclust etc. and compare the metrics scores.

#3. Analyze the model

In [None]:
#This function analyzes the performance of the trained model.

evaluate_model(kmeans)


You can also use the `plot_model` function to generate plots individually.

In [None]:
?plot_model

In [None]:
plot_model(kmeans, plot = 'silhouette')

In [None]:
plot_model(kmeans, plot = 'elbow')

**Q3**

> What is the plot suggesting could be a good cluster number? Justify your answer.



In [None]:
centroids = kmeans.cluster_centers_
centroids

In [None]:
plot_model(kmeans, plot = 'cluster') #plot PCA

In [None]:
plot_model(kmeans, plot = 'distribution')

PyCaret generates distribution plots to visualize the distribution of data points within each cluster.

Distribution plots help in understanding,
* how the data points are distributed within each cluster.
* potential outliers or anomalies within cluster.
* the density of data points within each cluster.
* similar or dissimilar characteristics among clusters.

In [None]:
plot_model(kmeans, plot = 'distribution', feature="Glucose")

**Q4**

> What information you can get from the above distribution plot?



From the above plot we can see the feature ranges are quite different. As there are different scales in the features of the dataset, so there is definitely a need to normalize the dataset to ensure a better result.  Let's try what effect this has!

We can use normalize=True in setup()


In [None]:
s = setup(df, normalize=True, ignore_features='Outcome')

# Clustering with normalized dataset

In [None]:
kmeans_4 = create_model('kmeans')

In [None]:
plot_model(kmeans_4, plot = 'silhouette')

In [None]:
centroids = kmeans_4.cluster_centers_
centroids

In [None]:
plot_model(kmeans_4, plot = 'cluster')

In [None]:
plot_model(kmeans_4, plot = 'distribution', feature="Glucose")

In [None]:
plot_model(kmeans_4, plot = 'distribution', feature="Insulin")

Upto now you have worked with a linear clustering method - Kmeans.
When we're trying to group our data points into clusters, sometimes the relationships between them are simple and straight, like connecting dots with lines. As we mentioned in the clustering moodle book, that's where K-means clustering comes in. It's great at finding these *straight-line patterns*.

But what if our data is more like a tangled web of connections, with curves and twists, we need to use non-linear clustering methods like Mean Shift, DBSCAN, and OPTICS.
You can try the other clustering methods and see how they make clusters.

As we discussed under dimensionality reduction section in clustering moodle book, when we want to take a big jumble of data and shrink it down into a simpler picture, we use non-linear techniques like t-SNE and UMAP. They're tools that can preserve non-linear relationships between points while making everything easier to understand. Let's plot also these visualisations.

In [None]:
plot_model(kmeans_4, plot = 'tsne')

In [None]:
!pip install umap-learn


In [None]:
import umap
reducer = umap.UMAP()
embedding = reducer.fit_transform(df)
pred_4 = kmeans_4.labels_
pred_4

In [None]:
sns.scatterplot(x=embedding[:, 0], y= embedding[:, 1], hue=df.Outcome)
plt.title('UMAP projection of the dataset', fontsize=12);

In [None]:
sns.scatterplot(x=embedding[:, 0], y= embedding[:, 1], hue=pred_4)
plt.title('UMAP projection of the dataset', fontsize=12);

# Clustering with pre-defined cluster count

Now let's try to plot with defining the cluster number as 2. (Our true Outcome column has 0 and 1 values.)

In [None]:
kmeans = create_model('kmeans', num_clusters=2)

In [None]:
plot_model(kmeans, plot = 'silhouette')

In [None]:
centroids = kmeans.cluster_centers_
centroids

In [None]:
plot_model(kmeans, plot = 'cluster')

In [None]:
plot_model(kmeans, plot = 'distribution', feature="Glucose")

In [None]:
plot_model(kmeans, plot = 'distribution', feature="Insulin")

In [None]:
plot_model(kmeans, plot = 'tsne')

In [None]:
!pip install umap-learn

In [None]:
import umap
reducer = umap.UMAP()
embedding = reducer.fit_transform(df)
pred = kmeans.labels_


In [None]:
pred

In [None]:
sns.scatterplot(x=embedding[:, 0], y= embedding[:, 1], hue=df.Outcome)
plt.title('UMAP projection of the dataset', fontsize=12);

In [None]:
sns.scatterplot(x=embedding[:, 0], y= embedding[:, 1], hue=pred)
plt.title('UMAP projection of the dataset', fontsize=12);

**Q5**

> What can you say about glucose and insulin distribution within different clusters? Have you noticed any specific tendencies? Do you consider these results biologically relevant? Please justify your answers.
