# K-Means Clustering with Scikit-Learn

This notebook demonstrates K-Means clustering using scikit-learn library with comprehensive visualizations.

## What is K-Means?
K-Means is an unsupervised learning algorithm that partitions data into K clusters based on feature similarity.

**Algorithm Steps:**
1. Initialize K centroids randomly
2. Assign each point to nearest centroid
3. Update centroids as mean of assigned points
4. Repeat steps 2-3 until convergence

## 1. Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## 2. Generate Sample Data

In [None]:
# Generate sample data with 4 natural clusters
X, y_true = make_blobs(n_samples=300, centers=4, n_features=2, 
                         cluster_std=0.8, random_state=42)

# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

print(f"Data shape: {X.shape}")
print(f"Features: 2")
print(f"Samples: {X.shape[0]}")

## 3. Visualize Raw Data

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c='blue', alpha=0.6, s=50)
plt.title('Raw Data - Before K-Means Clustering', fontsize=14, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Data points are shown in blue. Now we'll apply K-Means to find clusters.")

## 4. Apply K-Means with Sklearn

In [None]:
# Initialize and fit K-Means with K=4
k = 4
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

print(f"K-Means fitted successfully!")
print(f"Number of clusters: {k}")
print(f"Inertia (sum of squared distances): {kmeans.inertia_:.2f}")
print(f"Number of iterations: {kmeans.n_iter_}")
print(f"\nCentroid coordinates:")
for i, centroid in enumerate(centroids):
    print(f"  Cluster {i}: ({centroid[0]:.3f}, {centroid[1]:.3f})")

## 5. Visualize Clustering Results

In [None]:
# Define colors for clusters
colors = ['red', 'green', 'blue', 'orange', 'purple']

plt.figure(figsize=(12, 6))

# Subplot 1: Clusters with centroids
plt.subplot(1, 2, 1)
for i in range(k):
    cluster_points = X[y_pred == i]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], 
               c=colors[i], label=f'Cluster {i}', alpha=0.6, s=50)

# Plot centroids
plt.scatter(centroids[:, 0], centroids[:, 1], 
           c='black', marker='X', s=300, edgecolors='white', linewidth=2,
           label='Centroids')

plt.title('K-Means Clustering Results (K=4)', fontsize=14, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)

# Subplot 2: Cluster sizes
plt.subplot(1, 2, 2)
cluster_sizes = np.bincount(y_pred)
bars = plt.bar(range(k), cluster_sizes, color=colors[:k])
plt.title('Cluster Sizes', fontsize=14, fontweight='bold')
plt.xlabel('Cluster')
plt.ylabel('Number of Points')
plt.xticks(range(k))
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)}', ha='center', va='bottom')
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 6. Elbow Method - Finding Optimal K

In [None]:
# Calculate inertia for different values of K
inertias = []
k_values = range(1, 11)

for k_val in k_values:
    kmeans_temp = KMeans(n_clusters=k_val, random_state=42, n_init=10)
    kmeans_temp.fit(X)
    inertias.append(kmeans_temp.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, inertias, 'bo-', linewidth=2, markersize=8)
plt.axvline(x=4, color='red', linestyle='--', linewidth=2, label='Optimal K=4')
plt.title('Elbow Method - Finding Optimal Number of Clusters', fontsize=14, fontweight='bold')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
plt.grid(True, alpha=0.3)
plt.legend()
plt.xticks(k_values)
plt.tight_layout()
plt.show()

print("The elbow appears around K=4, suggesting that is an optimal choice.")

## 7. Predict Cluster for New Data Points

In [None]:
# Generate new data points for prediction
new_points = np.array([[-2, -2], [0, 0], [2, 2], [-2, 2]])
new_predictions = kmeans.predict(new_points)

print("New points and their predicted clusters:")
for point, cluster in zip(new_points, new_predictions):
    print(f"  Point {point} -> Cluster {cluster}")

## 8. Visualize with New Data Points

In [None]:
plt.figure(figsize=(12, 6))

# Plot original data
for i in range(k):
    cluster_points = X[y_pred == i]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], 
               c=colors[i], label=f'Cluster {i}', alpha=0.6, s=50)

# Plot centroids
plt.scatter(centroids[:, 0], centroids[:, 1], 
           c='black', marker='X', s=300, edgecolors='white', linewidth=2,
           label='Centroids')

# Plot new data points
for point, cluster in zip(new_points, new_predictions):
    plt.scatter(point[0], point[1], c=colors[cluster], marker='s', 
               s=200, edgecolors='black', linewidth=2)

plt.title('K-Means with New Data Points Predictions', fontsize=14, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 9. Performance Metrics

In [None]:
# Calculate silhouette score
from sklearn.metrics import silhouette_score, davies_bouldin_score

silhouette_avg = silhouette_score(X, y_pred)
davies_bouldin = davies_bouldin_score(X, y_pred)

print("Clustering Quality Metrics:")
print(f"  Inertia: {kmeans.inertia_:.2f}")
print(f"  Silhouette Score: {silhouette_avg:.4f} (range: -1 to 1, higher is better)")
print(f"  Davies-Bouldin Index: {davies_bouldin:.4f} (lower is better)")
print(f"\nSilhouette Score Interpretation:")
if silhouette_avg > 0.5:
    print("    ✓ Strong cluster structure")
elif silhouette_avg > 0.3:
    print("    ○ Reasonable cluster structure")
else:
    print("    ✗ Weak cluster structure")

## 10. Summary

### Key Takeaways:

1. **K-Means Algorithm**: Simple and effective for partitioning data into clusters
2. **Centroid-based**: Minimizes within-cluster sum of squares
3. **Elbow Method**: Helps determine optimal number of clusters
4. **Scalability**: Efficient for large datasets
5. **Limitations**: 
   - Requires specifying K in advance
   - Sensitive to initialization
   - Assumes spherical clusters

### Applications:
- Customer segmentation
- Image compression
- Document clustering
- Anomaly detection

### When to Use:
✓ Data has clear, distinct clusters  
✓ Need fast clustering  
✓ Clusters are roughly spherical  
✗ When K is unknown and data is sparse