# Cluster Performance Analysis
Performance analysis in clustering methods involves assessing how well a clustering algorithm has grouped data points into meaningful clusters.
1. **Inertia or Sum of Squared Distances (SSD)**: Inertia represents the sum of squared distances of samples to their closest cluster center. It is a measure of how compact the clusters are. Lower inertia indicates better clustering. However, inertia alone may not always be sufficient for assessing the quality of clusters.

2. **Silhouette Score**: The silhouette score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The score ranges from -1 to 1, where a higher score indicates better-defined clusters.

## Inertia or Sum of Squared Distances (SSD)
Inertia, also known as the Sum of Squared Distances (SSD) or within-cluster sum of squares, is a measure used to evaluate the performance of a clustering algorithm, particularly in the context of K-means clustering. It quantifies how far the points within a cluster are from the centroid of that cluster.

Here's how inertia is calculated in the context of K-means clustering:

1. **For each data point in the dataset, calculate the squared Euclidean distance from the point to the centroid of the cluster it belongs to.**
   
2. **Sum up these squared distances for all points in all clusters.**

In mathematical terms, if $(C_i)$ represents the centroid of the $(i$)-th cluster and $(x_j$) represents the $(j$)-th data point in that cluster, the inertia $(I$) is calculated as:

$[ I = \sum_{i=1}^{k} \sum_{j=1}^{n_i} \left\| x_j - C_i \right\|^2 ]$

where:
- $( k )$ is the number of clusters.
- $( n_i )$ is the number of data points in the $(i)$-th cluster.

The objective of K-means clustering is to minimize this inertia. In other words, K-means tries to find cluster assignments and centroids that minimize the sum of squared distances of each point to the centroid of its assigned cluster.

When using the elbow method for determining the optimal number of clusters (K), practitioners often plot the values of inertia for different values of K and look for the "elbow" in the plot. The elbow is the point where the rate of decrease in inertia starts to slow down, and adding more clusters does not lead to a significant reduction in inertia. The optimal K is often chosen at this elbow point, as it represents a good balance between model complexity and the ability to capture the structure of the data.

In [5]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt


In [None]:
# Generate some random data for demonstration
np.random.seed(42)
X,y = make_blobs(n_samples=100, centers=5, n_features=2)
plt.scatter(X[:,0], X[:,1])
# data.shape

In [None]:
# Specify the number of clusters (K)
k = 5

# Fit KMeans model to the data
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

# Get the inertia (Sum of Squared Distances)
inertia = kmeans.inertia_

print(f"Inertia: {inertia}")


# How to define K value?

## Elbow Method
The elbow method is a technique used in cluster analysis and machine learning to determine the optimal number of clusters for a dataset. When performing clustering, such as K-means clustering, you need to specify the number of clusters beforehand. The elbow method helps you find the point at which adding more clusters does not significantly improve the model's performance.

1. Run the algorithm with different numbers of clusters (k): Execute the clustering algorithm (e.g., K-means) on the dataset for a range of values of k. Typically, you would choose a range of k values, such as 1 to 10.

2. Compute the sum of squared distances (SSD): For each value of k, calculate the sum of squared distances from each point to its assigned cluster center. The sum of squared distances is also known as the "inertia" or "within-cluster sum of squares."

3. Plot the results: Create a plot where the x-axis represents the number of clusters (k), and the y-axis represents the corresponding sum of squared distances.

4. Identify the "elbow" in the plot: The idea is to look for a point on the plot where adding more clusters does not result in a significant decrease in the sum of squared distances. This point is often referred to as the "elbow." The elbow represents a trade-off between the model's complexity (number of clusters) and its goodness of fit.

NOTE:

The optimal number of clusters is usually the value at the elbow, as it indicates a good balance between capturing the underlying structure of the data and avoiding overfitting.

Keep in mind that the elbow method is a heuristic, and the interpretation of the elbow may not always be straightforward. In some cases, the elbow may not be well-defined, and other methods or domain knowledge may be needed to determine the appropriate number of clusters.

In [53]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# from sklearn.datasets import make_blobs
from sklearn.datasets import load_iris

In [57]:
# Generate random data with three clusters
# X, _ = make_blobs(n_samples=3000, centers=10, cluster_std=1.0, random_state=42)
# plt.scatter(X[:,0], X[:,1])
iris_data=load_iris()
X=iris_data.data

In [58]:
# Create an empty list to store the sum of squared distances (inertia) for each k
inertia_values = []


In [None]:
# Try different values of k (from 1 to 10, for example)
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia_values.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(range(1, 11), inertia_values, marker='o')
plt.title('Elbow Method For Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Distances (Inertia)')
plt.show()


In [None]:
inertia_values

In [None]:
kmeans_test = KMeans(n_clusters=7, random_state=42)
labels=kmeans_test.fit_predict(X)
# kmeans.cluster_centers_

plt.scatter(X[:,0], X[:,1], s=5, c=labels)
plt.scatter(kmeans_test.cluster_centers_[:,0],kmeans_test.cluster_centers_[:,1], color='red')

In [47]:
kmeans_test.inertia_

308.2793763938471

## Exercise 1
Take a real life dataset and perform elbow method to find the optimal value of K.

# Silhouette Score
The Silhouette Score is a metric used to calculate the goodness of a clustering technique, such as K-means, by measuring how well-defined the clusters are in a given dataset. It quantifies the distance between the resulting clusters and assesses the cohesion and separation of the clusters.

The Silhouette Score for a single data point $(i)$ is defined as follows:

$[ \text{Silhouette}(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} ]$

where:
- $(a(i))$ is the average distance from the $(i)$-th data point to the other data points in the same cluster (intra-cluster distance).
- $(b(i))$ is the smallest average distance from the $(i)$-th data point to data points in a different cluster (inter-cluster distance).

The overall Silhouette Score for the entire dataset is the average of the Silhouette scores for each data point, and it ranges from -1 to 1. A high Silhouette Score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Interpretation of Silhouette Score:
- A score close to 1 suggests that the object is well matched to its own cluster and poorly matched to neighboring clusters.
- A score around 0 indicates overlapping clusters.
- A score close to -1 suggests that the object is poorly matched to its own cluster and well matched to a neighboring cluster.

In [61]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import make_blobs

In [62]:
# Generate some random data for demonstration
X, _ = make_blobs(n_samples=3000, centers=4, cluster_std=1.0, random_state=42)

In [70]:
# Specify the number of clusters (K)
k = 4

# Fit KMeans model to the data
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

# Predict cluster labels
labels = kmeans.predict(X)

# Calculate Silhouette Score
silhouette_avg = silhouette_score(X, labels)

print(f"Silhouette Score: {silhouette_avg}")



Silhouette Score: 0.7914720000513819


## Exercise 2
Use Silhouette Score to determine K value in Kmeans. Plot Silhouette Score and K values and in the plot, look for the K value that corresponds to the peak of the silhouette scores.

In [83]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt
# from sklearn.datasets import make_blobs
from sklearn.datasets import load_iris

In [84]:
# Generate some random data for demonstration
# X, _ = make_blobs(n_samples=3000, centers=7,
#                   cluster_std=1.0, random_state=42)
data=load_iris()
X=data.data

In [None]:
# Specify a range of K values
k_values = range(2, 11)  # You can adjust this range based on your problem

# Store the silhouette scores for each K
silhouette_scores = []

# Iterate over different K values
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    silhouette_avg = silhouette_score(X, labels)
    silhouette_scores.append(silhouette_avg)

# Plot the Silhouette Scores for each K
plt.plot(k_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Average Silhouette Score')
plt.title('Silhouette Score for Different K Values')
plt.show()


## Silhouette Score for Heirarchical Aglomerative Clustering (HAC)

In [88]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

In [91]:
X, _ = make_blobs(n_samples=300, centers=4,
                  cluster_std=1.0, random_state=42)

In [None]:
# Specify a range of linkage methods and corresponding K values
linkage_methods = ['ward', 'complete', 'average']
k_values = range(2, 11)

# Store the silhouette scores for each combination of linkage method and K
silhouette_scores = {}

# Iterate over different linkage methods
for linkage in linkage_methods:
    silhouette_scores[linkage] = []

    # Iterate over different K values
    for k in k_values:
        hac = AgglomerativeClustering(n_clusters=k, linkage=linkage)
        labels = hac.fit_predict(X)
        silhouette_avg = silhouette_score(X, labels)
        silhouette_scores[linkage].append(silhouette_avg)

# Plot the Silhouette Scores for each linkage method and K value
for linkage, scores in silhouette_scores.items():
    plt.plot(k_values, scores, marker='o', label=linkage)

plt.xlabel('Number of clusters (K)')
plt.ylabel('Average Silhouette Score')
plt.title('Silhouette Score for Different Linkage Methods and K Values')
plt.legend()
plt.show()


# NOTE
Can silhouette score be used in dbscan and mean shift?
The Silhouette Score is a metric commonly used with partitioning clustering algorithms (such as K-means) and hierarchical clustering methods. However, it is not typically applied to density-based clustering algorithms like DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and mean shift.

Here's why:

1. **DBSCAN:**
   - DBSCAN doesn't necessarily produce clusters with well-defined shapes or sizes. It identifies dense regions separated by areas of lower point density, and the number of clusters is not predetermined. The Silhouette Score assumes well-defined clusters, and its calculation involves comparing the distance of a point to points within the same cluster and to the nearest points in other clusters. In DBSCAN, points in the same "cluster" may not necessarily be directly connected or have a clear boundary.

2. **Mean Shift:**
   - Mean shift is a non-parametric clustering algorithm that identifies modes in the data distribution. Like DBSCAN, it doesn't assume a specific number of clusters, and the resulting clusters may have irregular shapes. Silhouette Score, designed for partitioning clusters, may not be suitable for evaluating the quality of clusters formed by mean shift.

For density-based clustering algorithms like DBSCAN, alternative evaluation metrics such as the Davies-Bouldin Index or visual inspection of the resulting clusters may be more appropriate.

For mean shift and other non-parametric methods, the evaluation is often done through visual examination of the resulting clusters and assessing how well they capture the underlying structure of the data.

In summary, while the Silhouette Score is a useful metric for certain types of clustering algorithms, it is not universally applicable to all clustering methods. Always consider the characteristics of the algorithm and the nature of the data when selecting an appropriate evaluation metric.