# GUC Clustering Project 

**Objective:** 
The objective of this project teach students how to apply clustering to real data sets

The projects aims to teach student: 
* Which clustering approach to use
* Compare between Kmeans, Hierarchal, DBScan, and Gaussian Mixtures  
* How to tune the parameters of each data approach
* What is the effect of different distance functions (optional) 
* How to evaluate clustering approachs 
* How to display the output
* What is the effect of normalizing the data 

Students in this project will use ready-made functions from Sklearn, plotnine, numpy and pandas 
 



In [6]:
# if plotnine is not installed in Jupter then use the following command to install it 

Running this project require the following imports 

In [95]:
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns 
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn.preprocessing as prep
from sklearn.datasets import make_blobs
from plotnine import *   
# StandardScaler is a function to normalize the data 
# You may also check MinMaxScaler and MaxAbsScaler 
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

from sklearn.cluster import DBSCAN


from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

from sklearn.metrics import silhouette_score

%matplotlib inline

In [96]:
def display_cluster(X, km=[], num_clusters=0):
    color = ['b', 'r', 'g', 'c', 'm', 'y', 'k', 'violet', 'aqua', 'pink']
    alpha = 0.5
    s = 20
    if num_clusters == 0:
        plt.scatter(X[:, 0], X[:, 1], c=color[0], alpha=alpha, s=s)
    else:
        X_np = X.values if isinstance(X, pd.DataFrame) else X  # Convert to numpy array if X is a DataFrame
        for i in range(num_clusters):
            cluster_indices = np.where(km.labels_ == i)[0]
            plt.scatter(X_np[cluster_indices, 0], X_np[cluster_indices, 1], c=color[i], alpha=alpha, s=s)
            plt.scatter(km.cluster_centers_[i][0], km.cluster_centers_[i][1], c=color[i], marker='x', s=100)

## Multi Blob Data Set 
* The Data Set generated below has 6 cluster with varying number of users and varing densities
* Cluster the data set below using 



In [None]:
plt.rcParams['figure.figsize'] = [8,8]
sns.set_style("whitegrid")
sns.set_context("talk")

n_bins = 6  
centers = [(-3, -3), (0, 0), (5,2.5),(-1, 4), (4, 6), (9,7)]
Multi_blob_Data, y = make_blobs(n_samples=[100,150, 300, 400,300, 200], n_features=2, cluster_std=[1.3,0.6, 1.2, 1.7,0.9,1.7],
                  centers=centers, shuffle=False, random_state=42)
display_cluster(Multi_blob_Data)

### Kmeans 
* Use Kmeans with different values of K to cluster the above data 
* Display the outcome of each value of K 
* Plot distortion function versus K and choose the approriate value of k 
* Plot the silhouette_score versus K and use it to choose the best K 
* Store the silhouette_score for the best K for later comparison with other clustering techniques. 

In [101]:
plt.rcParams['figure.figsize'] = [8,8]
sns.set_style("whitegrid")
sns.set_context("talk")

n_bins = 6  
centers = [(-3, -3), (0, 0), (5,2.5),(-1, 4), (4, 6), (9,7)]
Multi_blob_Data, y = make_blobs(n_samples=[100,150, 300, 400,300, 200], n_features=2, cluster_std=[1.3,0.6, 1.2, 1.7,0.9,1.7],
                  centers=centers, shuffle=False, random_state=42)

* using predefined kmeans

In [None]:
# Perform K-means clustering with different values of K
K_values = range(2, 10)  # Values of K to try
distortions = []  # List to store distortion values
silhouette_scores = []  # List to store silhouette scores

for k in K_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(Multi_blob_Data)
    distortions.append(km.inertia_)  # Sum of squared distances to closest cluster center
    silhouette_scores.append(silhouette_score(Multi_blob_Data, km.labels_))  # Silhouette score

    # Display clusters for each value of K
    plt.figure()
    plt.title("K-means clustering with K={}".format(k))
    display_cluster(Multi_blob_Data, km, k)

# Plot distortion function versus K
plt.figure()
plt.plot(K_values, distortions, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Distortion')
plt.title('Distortion versus K')

# Plot silhouette score versus K
plt.figure()
plt.plot(K_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score versus K')

# Find the best value of K based on silhouette score
best_K = K_values[np.argmax(silhouette_scores)]
print("Best value of K based on silhouette score:", best_K)
print("Best silhouette score:",np.max(silhouette_scores) )
plt.show()

### Hierarchal Clustering
* Use AgglomerativeClustering function to  to cluster the above data 
* In the  AgglomerativeClustering change the following parameters 
    * Affinity (use euclidean, manhattan and cosine)
    * Linkage( use average and single )
    * Distance_threshold (try different)
* For each of these trials plot the Dendograph , calculate the silhouette_score and display the resulting clusters  
* Find the set of paramters that would find result in the best silhouette_score and store this score for later comparison with other clustering techniques. 
* Record your observation 

In [109]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
from sklearn.metrics import silhouette_score

In [110]:
plt.rcParams['figure.figsize'] = [8,8]
sns.set_style("whitegrid")
sns.set_context("talk")

n_bins = 6  
centers = [(-3, -3), (0, 0), (5,2.5),(-1, 4), (4, 6), (9,7)]
Multi_blob_Data, y = make_blobs(n_samples=[100,150, 300, 400,300, 200], n_features=2, cluster_std=[1.3,0.6, 1.2, 1.7,0.9,1.7],
                  centers=centers, shuffle=False, random_state=42)

df = pd.DataFrame(Multi_blob_Data)

In [111]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage
from itertools import combinations

def hierarchical_clustering(df):
    # Initialize variables to store the best silhouette score and its corresponding parameters
    best_silhouette_score = -1  # Initialize with a value that ensures any calculated silhouette score will be better
    best_params = None

    affinities = ['euclidean', 'cityblock', 'cosine']
    linkages = ['average', 'single']
    distance_thresholds = [None]
    numeric_distance_thresholds = [3, 7]  # Add numeric distance thresholds here

    # Iterate over parameter combinations with numeric distance_thresholds
    for affinity in affinities:
        for linkage_method in linkages:
            for distance_threshold in numeric_distance_thresholds:
                # Perform Agglomerative Clustering
                clustering = AgglomerativeClustering(affinity=affinity, linkage=linkage_method,
                                                     distance_threshold=distance_threshold, n_clusters=None)
                Z = linkage(df, method=linkage_method, metric=affinity)
                cluster_labels = clustering.fit_predict(df)

                # Check if multiple clusters are formed
                if len(np.unique(cluster_labels)) > 1:
                    # Calculate silhouette score
                    silhouette_avg = silhouette_score(df, cluster_labels)
                    print(f"Silhouette Score: {silhouette_avg}")

                    # Plot dendrogram
                    plt.figure(figsize=(10, 10))
                    plt.title(f'Dendrogram - Affinity: {affinity}, Linkage: {linkage_method}, Distance Threshold: {distance_threshold}')
                    dendrogram(Z, leaf_rotation=90., leaf_font_size=8.)
                    plt.xlabel('Sample Index')
                    plt.ylabel('Distance')
                    plt.show()

                    # Plot resulting clusters
                    for pair in combinations(range(df.shape[1]), 2):
                        plt.figure(figsize=(8, 6))
                        plt.scatter(df.iloc[:, pair[0]], df.iloc[:, pair[1]], c=cluster_labels, cmap='viridis')
                        plt.title(f"Clusters - Features {pair[0]} and {pair[1]} - Affinity: {affinity}, Linkage: {linkage_method}, Distance Threshold: {distance_threshold}")
                        plt.xlabel(f'Feature {pair[0]}')
                        plt.ylabel(f'Feature {pair[1]}')
                        plt.colorbar(label='Cluster')
                        plt.grid(True)
                        plt.tight_layout()
                        plt.show()

                    # Check if this silhouette score is better than the current best
                    if silhouette_avg > best_silhouette_score:
                        best_silhouette_score = silhouette_avg
                        best_params = {'Affinity': affinity, 'Linkage': linkage_method, 'Distance Threshold': distance_threshold}

    # Iterate over parameter combinations with distance_thresholds
    for affinity in affinities:
        for linkage_method in linkages:
            for distance_threshold in distance_thresholds:
                # Set n_clusters to 2 if distance_threshold is None
                n_clusters = 2 if distance_threshold is None else None
                # Perform Agglomerative Clustering
                clustering = AgglomerativeClustering(affinity=affinity, linkage=linkage_method,
                                                     distance_threshold=distance_threshold, n_clusters=n_clusters)
                cluster_labels = clustering.fit_predict(df)
                Z = linkage(df, method=linkage_method, metric=affinity)

                # Check if multiple clusters are formed
                if len(np.unique(cluster_labels)) > 1:
                    # Calculate silhouette score
                    silhouette_avg = silhouette_score(df, cluster_labels)
                    print(f"Silhouette Score: {silhouette_avg}")

                    # Plot dendrogram
                    plt.figure(figsize=(10, 6))
                    plt.title(f'Dendrogram - Affinity: {affinity}, Linkage: {linkage_method}, Distance Threshold: {distance_threshold}')
                    dendrogram(Z, leaf_rotation=90., leaf_font_size=8.)
                    plt.xlabel('Sample Index')
                    plt.ylabel('Distance')
                    plt.show()

                    # Plot resulting clusters
                    for pair in combinations(range(df.shape[1]), 2):
                        plt.figure(figsize=(8, 6))
                        plt.scatter(df.iloc[:, pair[0]], df.iloc[:, pair[1]], c=cluster_labels, cmap='viridis')
                        plt.title(f"Clusters - Features {pair[0]} and {pair[1]} - Affinity: {affinity}, Linkage: {linkage_method}, Distance Threshold: {distance_threshold}")
                        plt.xlabel(f'Feature {pair[0]}')
                        plt.ylabel(f'Feature {pair[1]}')
                        plt.colorbar(label='Cluster')
                        plt.grid(True)
                        plt.tight_layout()
                        plt.show()

                    # Check if this silhouette score is better than the current best
                    if silhouette_avg > best_silhouette_score:
                        best_silhouette_score = silhouette_avg
                        best_params = {'Affinity': affinity, 'Linkage': linkage_method, 'Distance Threshold': distance_threshold}

    # Print the best silhouette score and its corresponding parameters
    print("Best Silhouette Score:", best_silhouette_score)
    print("Best Parameters:", best_params)


In [None]:
hierarchical_clustering(df)

### DBScan
* Use DBScan function to  to cluster the above data 
* In the  DBscan change the following parameters 
    * EPS (from 0.1 to 3)
    * Min_samples (from 5 to 25)
* Plot the silhouette_score versus the variation in the EPS and the min_samples
* Plot the resulting Clusters in this case 
* Find the set of paramters that would find result in the best silhouette_score and store this score for later comparison with other clustering techniques. 
* Record your observations and comments 

In [114]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

In [115]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

def DBSCAN_clustering(df):
    # Define the range of parameter values to experiment with
    eps_values = np.linspace(0.1, 3, 30)  # Range of eps values from 0.1 to 3
    min_samples_values = range(5, 26)  # Range of min_samples values from 5 to 25

    # Initialize variables to store the best silhouette score and its corresponding parameters
    best_silhouette_score = -1  # Initialize with a value that ensures any calculated silhouette score will be better
    best_params = None
    best_cluster_labels = None
    silhouette_scores = []

    # Iterate over parameter combinations
    for eps in eps_values:
        for min_samples in min_samples_values:
            # Perform DBSCAN clustering
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            cluster_labels = dbscan.fit_predict(df)

            # Check if only one unique label is detected
            if len(np.unique(cluster_labels)) <= 1:
                continue

            # Calculate silhouette score
            silhouette_avg = silhouette_score(df, cluster_labels)

            # Store silhouette score and parameters
            silhouette_scores.append(silhouette_avg)

            # Check if this silhouette score is better than the current best
            if silhouette_avg > best_silhouette_score:
                best_silhouette_score = silhouette_avg
                best_params = {'EPS': eps, 'Min Samples': min_samples}
                best_cluster_labels = cluster_labels

            # Plot the resulting clusters for each pair of features
            n_features = df.shape[1]
            for i in range(n_features):
                for j in range(i+1,n_features):
                    plt.figure(figsize=(10, 6))
                    plt.scatter(df.iloc[:, i], df.iloc[:, j], c=cluster_labels, cmap='viridis', s=50, alpha=0.5)
                    plt.xlabel(df.columns[i])
                    plt.ylabel(df.columns[j])
                    plt.colorbar(label='Cluster')
                    plt.grid(True)
                    plt.tight_layout()
                    plt.axis
                    plt.show()

    # Plot silhouette score versus eps and min_samples
    plt.figure(figsize=(10, 6))
    plt.plot(range(len(silhouette_scores)), silhouette_scores, marker='o', linestyle='-')
    plt.title('Silhouette Score vs. Parameters (EPS and Min Samples)')
    plt.xlabel('Parameter Combination')
    plt.ylabel('Silhouette Score')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    # Print the best silhouette score and its corresponding parameters
    print("Best Silhouette Score:", best_silhouette_score)
    print("Best Parameters:", best_params)


In [None]:
DBSCAN_clustering(df)

### Gaussian Mixture
* Use GaussianMixture function to cluster the above data 
* In GMM change the covariance_type and check the difference in the resulting proabability fit 
* Use a 2D contour plot to plot the resulting distribution (the components of the GMM) as well as the total Gaussian mixture 

In [117]:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt

def GMMClustering(df):
    # Define covariance types to test
    covariance_types = ['full', 'tied', 'diag', 'spherical']

    # Initialize variables to store the best silhouette score and its corresponding parameters
    best_silhouette_score = -1
    best_params = {}

    # Initialize lists to store silhouette scores and corresponding parameters
    silhouette_scores = []
    parameters = []

    # Iterate over covariance types
    for covariance_type in covariance_types:
        # Perform clustering
        gmm = GaussianMixture(n_components=3, covariance_type=covariance_type)
        cluster_labels = gmm.fit_predict(df)

        # Calculate silhouette score
        silhouette_avg = silhouette_score(df, cluster_labels)

        # Store silhouette score and parameters
        silhouette_scores.append(silhouette_avg)
        parameters.append({'Covariance Type': covariance_type})

        # Check if this silhouette score is better than the current best
        if silhouette_avg > best_silhouette_score:
            best_silhouette_score = silhouette_avg
            best_params = {'Covariance Type': covariance_type}
            best_cluster_labels = cluster_labels

    # Plot scatter plot for each pair of features
    n_features = df.shape[1]
    for i in range(n_features):
        for j in range(i + 1, n_features):
            plt.figure(figsize=(10, 6))

            # Scatter plot
            plt.subplot(1, 2, 1)
            plt.scatter(df.iloc[:, i], df.iloc[:, j], c=best_cluster_labels, cmap='viridis', s=50, alpha=0.5)
            plt.xlabel(f'Feature {i}')
            plt.ylabel(f'Feature {j}')
            plt.title('Scatter Plot')

            # Fit GaussianMixture model for the current pair of features
            gmm_pair = GaussianMixture(n_components=3, covariance_type=best_params['Covariance Type'])
            X_pair = df[[df.columns[i], df.columns[j]]]
            gmm_pair.fit(X_pair)

            # Contour plot for Gaussian mixture
            plt.subplot(1, 2, 2)
            x_min, x_max = df.iloc[:, i].min() - 1, df.iloc[:, i].max() + 1
            y_min, y_max = df.iloc[:, j].min() - 1, df.iloc[:, j].max() + 1
            xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                                 np.linspace(y_min, y_max, 100))
            Z = gmm_pair.score_samples(np.column_stack([xx.ravel(), yy.ravel()]))
            Z = Z.reshape(xx.shape)
            plt.contourf(xx, yy, Z, cmap='viridis', levels=20, alpha=0.5)
            plt.xlabel(f'Feature {i}')
            plt.ylabel(f'Feature {j}')
            plt.title('Contour Plot')

            plt.tight_layout()
            plt.show()

    # Plot silhouette score versus covariance type
    plt.figure(figsize=(10, 6))
    plt.bar(range(len(silhouette_scores)), silhouette_scores, tick_label=[param['Covariance Type'] for param in parameters])
    plt.title('Silhouette Score versus Covariance Type')
    plt.xlabel('Covariance Type')
    plt.ylabel('Silhouette Score')
    plt.grid(True)
    plt.show()

    # Print the best silhouette score and its corresponding parameters
    print("Best Silhouette Score:", best_silhouette_score)
    print("Best Parameters:", best_params)


In [None]:
GMMClustering(df)

## iris data set 
The iris data set is test data set that is part of the Sklearn module 
which contains 150 records each with 4 features. All the features are represented by real numbers 

The data represents three classes 


* Repeat all the above clustering approaches and steps on the above data 
* Normalize the data then repeat all the above steps 
* Compare between the different clustering approaches 

In [119]:
from sklearn.datasets import load_iris
iris_data = load_iris()
iris_data.target[[10, 25, 50]]
#array([0, 0, 1])
list(iris_data.target_names)
['setosa', 'versicolor', 'virginica']
iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)

* kmeans using predefined

In [None]:
# Perform K-means clustering with different values of K
K_values = range(2, 10)  # Values of K to try
distortions = []  # List to store distortion values
silhouette_scores = []  # List to store silhouette scores

for k in K_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(iris_df)
    distortions.append(km.inertia_)  # Sum of squared distances to closest cluster center
    silhouette_scores.append(silhouette_score(iris_df, km.labels_))  # Silhouette score

    # Display clusters for each value of K
    plt.figure()
    plt.title("K-means clustering with K={}".format(k))
    display_cluster(iris_df, km, k)

# Plot distortion function versus K
plt.figure()
plt.plot(K_values, distortions, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Distortion')
plt.title('Distortion versus K')

# Plot silhouette score versus K
plt.figure()
plt.plot(K_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score versus K')

# Find the best value of K based on silhouette score
best_K = K_values[np.argmax(silhouette_scores)]
print("Best value of K based on silhouette score:", best_K)
print("Best silhouette score:",np.max(silhouette_scores) )
plt.show()

* hierarchical clustering

In [None]:
hierarchical_clustering(iris_df)

* DBSCAN clusteing

In [None]:
DBSCAN_clustering(iris_df)

* GMM gaussian clustering

In [None]:
GMMClustering(iris_df)

# after normalizing iris data

In [130]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
iris_normalized = scaler.fit_transform(iris_df)
iris_normalized_df = pd.DataFrame(iris_normalized)

* kmeans using predefined

In [None]:
# Perform K-means clustering with different values of K
K_values = range(2, 10)  # Values of K to try
distortions = []  # List to store distortion values
silhouette_scores = []  # List to store silhouette scores

for k in K_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(iris_normalized_df)
    distortions.append(km.inertia_)  # Sum of squared distances to closest cluster center
    silhouette_scores.append(silhouette_score(iris_normalized_df, km.labels_))  # Silhouette score

    # Display clusters for each value of K
    plt.figure()
    plt.title("K-means clustering with K={}".format(k))
    display_cluster(iris_normalized_df, km, k)

# Plot distortion function versus K
plt.figure()
plt.plot(K_values, distortions, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Distortion')
plt.title('Distortion versus K')

# Plot silhouette score versus K
plt.figure()
plt.plot(K_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score versus K')

# Find the best value of K based on silhouette score
best_K = K_values[np.argmax(silhouette_scores)]
print("Best value of K based on silhouette score:", best_K)
print("Best silhouette score:",np.max(silhouette_scores) )
plt.show()

* hierarchical clustering

In [None]:
hierarchical_clustering(iris_normalized_df)

* DBSCAN 

In [None]:
DBSCAN_clustering(iris_normalized_df)

* GMM clustering

In [None]:
GMMClustering(iris_normalized_df)

## Compare between the different clustering approaches

* for the distance being euclidean in the kmeans, we can notice that the best k(@ iris_df)= 5 and (@ iris_normalized_df)=3
* while for the distance being pearson in the kmeans, we can notice that the best k(@ iris_df)= 2 and (@ iris_normalized_df)=2
* for the DBSCAN clustering, we can notice that there was a slight change in the values of the silhouette score and an 0.4999 difference in the epsilon values.
* for the hierarchical clustering, we can notice that the silhouette score changed slightly and the distance threshold changed from 3 to None
* for the GMM gaussian, we can notice that the silhouette score changed slightly and the covariance type changed from spherical to diag

## Customer dataset
Repeat all the above on the customer data set 

In [141]:
data_dir = "/Users/tasabeehzainyahoo.com/Desktop/Machine Learning/Assignments/Assignment 1/data/"
customer_df= pd.read_csv(data_dir + "Customer data.csv")
customer_df.set_index('ID', inplace=True)

* kmeans using predefined

In [None]:
# Perform K-means clustering with different values of K
K_values = range(2, 10)  # Values of K to try
distortions = []  # List to store distortion values
silhouette_scores = []  # List to store silhouette scores

for k in K_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(customer_df)
    distortions.append(km.inertia_)  # Sum of squared distances to closest cluster center
    silhouette_scores.append(silhouette_score(customer_df, km.labels_))  # Silhouette score

    # Display clusters for each value of K
    plt.figure()
    plt.title("K-means clustering with K={}".format(k))
    display_cluster(customer_df, km, k)

# Plot distortion function versus K
plt.figure()
plt.plot(K_values, distortions, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Distortion')
plt.title('Distortion versus K')

# Plot silhouette score versus K
plt.figure()
plt.plot(K_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score versus K')

# Find the best value of K based on silhouette score
best_K = K_values[np.argmax(silhouette_scores)]
print("Best value of K based on silhouette score:", best_K)
print("Best silhouette score:",np.max(silhouette_scores) )
plt.show()

* hierarchical clustering

In [None]:
hierarchical_clustering(customer_df)

* DBSCAN

In [None]:
DBSCAN_clustering(customer_df)

* GMM Gaussian

In [None]:
GMMClustering(customer_df)

## after normalizing customer_df

In [153]:
from sklearn.preprocessing import MinMaxScaler

scaler_customer = MinMaxScaler()
customer_normalized = scaler_customer.fit_transform(customer_df)
customer_normalized_df = pd.DataFrame(customer_normalized)

* kmeans using predefined

In [None]:
# Perform K-means clustering with different values of K
K_values = range(2, 10)  # Values of K to try
distortions = []  # List to store distortion values
silhouette_scores = []  # List to store silhouette scores

for k in K_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(customer_normalized_df)
    distortions.append(km.inertia_)  # Sum of squared distances to closest cluster center
    silhouette_scores.append(silhouette_score(customer_normalized_df, km.labels_))  # Silhouette score

    # Display clusters for each value of K
    plt.figure()
    plt.title("K-means clustering with K={}".format(k))
    display_cluster(customer_normalized_df, km, k)

# Plot distortion function versus K
plt.figure()
plt.plot(K_values, distortions, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Distortion')
plt.title('Distortion versus K')

# Plot silhouette score versus K
plt.figure()
plt.plot(K_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score versus K')

# Find the best value of K based on silhouette score
best_K = K_values[np.argmax(silhouette_scores)]
print("Best value of K based on silhouette score:", best_K)
print("Best silhouette score:",np.max(silhouette_scores) )
plt.show()

* hierarchical clustering

In [None]:
hierarchical_clustering(customer_normalized_df)

* DBSCAN clustering

In [None]:
DBSCAN_clustering(customer_normalized_df)

* GMM gaussian clustering

In [None]:
GMMClustering(customer_normalized_df)

## comparing between normalized and normal customer df

* we can notice that in the DBSCAN, when the data was not normalized,the DBSCAN did not work as the features with large scale would dominate the clustering process.
* so when we normalized, the features with large scale were transformed to be close to each other , hence the DBSCAN worked.