# Unsupervised Learning & KMeans Clustering

This notebook demonstrates the use of KMeans clustering—an unsupervised learning technique—to discover inherent groupings in the Breast Cancer dataset. We then compare the clustering assignments with the known labels to assess the algorithm's ability to recover meaningful groups.

In this notebook, we cover:

1. **Overview of Unsupervised Learning & KMeans**  
   - Introduction to clustering when there is no response variable.
   - Explanation of how KMeans partitions data into k clusters by minimizing within-cluster variance.
   
2. **Data Preprocessing**  
   - Loading the Breast Cancer dataset.
   - Centering and scaling the features, which is crucial for KMeans performance.

3. **KMeans Clustering Computation and Visualization**  
   - Running KMeans on the standardized data.
   - Visualizing the clustering results in 2D using PCA for dimensionality reduction.
   - Comparing the cluster assignments with the true labels (benign vs. malignant).

4. **Evaluation and Discussion**  
   - Assessing clustering performance with metrics (e.g., Adjusted Rand Index).
   - Discussing the insights and potential limitations.

In [None]:
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Step 1: Load and Preprocess the Data

We begin by loading the Breast Cancer dataset.
- The Breast Cancer dataset contains measurements for benign and malignant tumors.
- Here, the dataset provides:
    - `X` as a feature matrix
    - `y` as a target variable

In [None]:
from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer Wisconsin dataset
breast_cancer = load_breast_cancer()
X = breast_cancer.data  # Feature matrix
y = breast_cancer.target  # Target variable (diagnosis)
feature_names = breast_cancer.feature_names
target_names = breast_cancer.target_names

print(y)
pd.DataFrame(X, columns = [feature_names])

Scaling is critical here because KMeans relies on distance calculations and can be biased by the scale of features.

In [None]:
from sklearn.preprocessing import StandardScaler

# It is important to center and scale the features since PCA is sensitive to the variable scales.
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

## Step 2: Compute KMeans Clustering
KMeans clustering partitions the data into k clusters by iteratively assigning points to the nearest cluster centroid and then updating the centroids based on the cluster’s mean.

Key Concepts:
- Initialization: Randomly select k centroids.

- Assignment & Update: Reassign points and recalculate centroids until convergence.

- Choosing k: For the Breast Cancer dataset (with two known classes), we set k = 2.

In [None]:
from sklearn.cluster import KMeans

# Set the number of clusters to 2, as we have two classes (malignant and benign)


# Output the centroids and first few cluster assignments


## Step 3: Visualization of KMeans Clustering Results
Visualization helps us understand how well KMeans has partitioned the data. However, our dataset is high-dimensional, so we first reduce it to 2 dimensions using PCA for visualization. We then plot the PCA scores with colors corresponding to the cluster assignments.

### 3a. 2D Scatter Plot of Clustering Results Using PCA

In [None]:
from sklearn.decomposition import PCA

# Reduce the data to 2 dimensions for visualization using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)

# Create a scatter plot of the PCA-transformed data, colored by KMeans cluster labels
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[clusters == 0, 0], X_pca[clusters == 0, 1],
            c='navy', alpha=0.7, edgecolor='k', s=60, label='Cluster 0')
plt.scatter(X_pca[clusters == 1, 0], X_pca[clusters == 1, 1],
            c='darkorange', alpha=0.7, edgecolor='k', s=60, label='Cluster 1')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('KMeans Clustering: 2D PCA Projection')
plt.legend(loc='best')
plt.grid(True)
plt.show()

### 3b. Comparing Clusters with True Labels
Even though KMeans is unsupervised, we can compare its cluster assignments with the actual labels to gauge performance.

In [None]:
# For comparison, visualize true labels using PCA (same 2D projection)
plt.figure(figsize=(8, 6))
colors = ['navy', 'darkorange']
for i, target_name in enumerate(target_names):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
                color=colors[i], alpha=0.7, edgecolor='k', s=60, label=target_name)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('True Labels: 2D PCA Projection')
plt.legend(loc='best')
plt.grid(True)
plt.show()



---



## Step 4: Evaluation and Discussion
Although KMeans clustering is unsupervised, we can assess how well the clusters match the true labels.
- There are other metrics like Adjusted Rand Index (ARI) that are also used to evaluate clusters to true values.

In [None]:

# Since KMeans labels are arbitrary (e.g., 0 and 1) and may not match the true labels directly,
# we compute accuracy for both the original labels and their complement, and choose the higher value.


### Step 5: Evaluating the Best Number of Clusters
Determining the optimal number of clusters is a common challenge in clustering applications. Two popular methods to address this are:

- Elbow Method:
We plot the Within-Cluster Sum of Squares (WCSS) against different values of k. The "elbow" point—where the rate of decrease sharply changes—suggests an optimal value for k.

- Silhouette Score:
This score quantifies how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better clustering. We compute the average silhouette score for different values of k and select the one with the highest score.

In [None]:

# Define the range of k values to try
# starting from 2 clusters to 10 clusters

# Within-Cluster Sum of Squares for each k
# Silhouette scores for each k

# Loop over the range of k values

Plot the Results Above

In [None]:
# Plot the Elbow Method result
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(ks, wcss, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.title('Elbow Method for Optimal k')
plt.grid(True)

# Plot the Silhouette Score result
plt.subplot(1, 2, 2)
plt.plot(ks, silhouette_scores, marker='o', color='green')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal k')
plt.grid(True)

plt.tight_layout()
plt.show()