# Module 1: Introduction to Scikit-Learn

## Part 2: K-means Clustering

In this part, we will explore k-means clustering, a popular unsupervised learning algorithm used for clustering tasks. k-means clustering aims to partition a dataset into k distinct clusters based on the similarity of the data points.

### 2.1 Understanding k-means Clustering

K-means clustering is a widely used unsupervised machine learning algorithm that aims to partition a dataset into a predefined number of clusters or groups. The primary goal is to group similar data points together and identify underlying patterns within the data without relying on pre-existing labels. 

The key idea behind k-means clustering is to minimize the total distance between each data point and the centroid of its assigned cluster. Each cluster is represented by its centroid, which is the mean of all data points assigned to that cluster. The algorithm iteratively updates the cluster assignments and recalculates the centroids until convergence, where the centroids no longer change significantly. 

Key Concepts:

- Clusters: K-means divides the data into "k" clusters, where "k" is a user-defined parameter. Each cluster represents a group of data points that are close to each other in feature space.
- Centroids: The algorithm initializes "k" centroids, one for each cluster. Centroids are the center points of the clusters and are updated during the training process.
- Assignment Step: In each iteration, each data point is assigned to the cluster whose centroid is closest to it. This assignment is based on a distance metric, often Euclidean distance.
- Update Step: After all data points are assigned to clusters, the centroids are updated to the mean of the data points within each cluster.
- Convergence: The assignment and update steps are repeated until convergence, typically defined by a convergence criterion, such as a small change in centroids or a fixed number of iterations.
- Random Initialization: K-means is sensitive to the initial placement of centroids, so it's common to run the algorithm multiple times with different initializations and select the best result.

It's important to note that k-means clustering has some limitations. It assumes that clusters are isotropic, have equal variance, and are equally sized. The algorithm may also be sensitive to outliers. If these assumptions are violated, other clustering algorithms or modifications of k-means, such as Gaussian Mixture Models, may be more appropriate.

### 2.2 Training

To apply the k-means algorithm, we need an unlabeled dataset. The model learns by iteratively updating the cluster assignments and centroids based on the distance between data points and centroids.

One of the challenges in k-means clustering is determining the optimal number of clusters, k. Choosing the right value of k is crucial to obtain meaningful and interpretable clusters. There are various methods to estimate the optimal value of k, such as the elbow method, silhouette analysis, or domain knowledge.

It is important to scale the features before applying k-means clustering to ensure that all features contribute equally to the clustering process. StandardScaler or MinMaxScaler can be used to scale the features appropriately.

Additionally, k-means clustering is sensitive to the initialization of centroids. Scikit-Learn uses the k-means++ initialization method by default, which is more effective than random initialization in most cases.

Once trained, we can use the k-means model to predict the cluster labels for new, unseen data points. The model assigns each data point to the nearest centroid, based on the distance metric used.

In [None]:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=400, n_features=2, centers=3, cluster_std=2.0, random_state=42)

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

silhouette_avg = silhouette_score(X, labels)
db_index = davies_bouldin_score(X, labels)

print(f"Silhouette Score: {silhouette_avg:.2f}")
print(f"Davies-Bouldin Index: {db_index:.2f}")

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=20)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=40, marker='X', label='Cluster Centers')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

This code example demonstrates K-means clustering on synthetic data with three clusters. It calculates and prints two clustering evaluation metrics, the Silhouette Score and Davies-Bouldin Index, to assess the quality of the clustering. 

The Silhouette Score measures how well-separated and cohesive the clusters are. In this example, the Silhouette Score is approximately 0.58, indicating reasonably well-defined clusters.
The Davies-Bouldin Index quantifies the average similarity ratio between clusters. In this case, it's approximately 0.60, suggesting good clustering quality.
These metrics provide insights into the clustering quality, and the visualization shows the data points and cluster centers.

### 2.3 Summary

K-means clustering is a fundamental unsupervised learning technique for partitioning data into clusters based on similarity. It's widely used for various applications, including customer segmentation, image compression, and anomaly detection. Understanding the algorithm's inner workings, appropriate parameter selection, and evaluation are crucial for successful clustering. Keep in mind that k-means has limitations, such as its sensitivity to cluster shape and the need to specify the number of clusters in advance.