# CS246 - Colab 3
## K-Means & PCA

### Setup

Let's set up the necessary libraries for this Colab. Run the cell below!

In [22]:
# Install scikit-learn if needed (usually pre-installed in Colab)
!pip install -q scikit-learn

Now we import some of the libraries usually needed by our workload.

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

Let's verify the imports are working correctly.

In [24]:
# Verify scikit-learn is available
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")

scikit-learn version: 1.6.1


### Data Preprocessing

In this Colab, rather than downloading a file from Google Drive, we will load a famous machine learning dataset, the [Breast Cancer Wisconsin dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html), using the ```scikit-learn``` datasets loader.

In [25]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()

For convenience, we will construct a Pandas dataframe from the dataset.

In [26]:
pd_df = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
print(f"Dataset shape: {pd_df.shape}")
print(f"\nColumn names:")
for col in pd_df.columns:
    print(f"  - {col}")

Dataset shape: (569, 30)

Column names:
  - mean radius
  - mean texture
  - mean perimeter
  - mean area
  - mean smoothness
  - mean compactness
  - mean concavity
  - mean concave points
  - mean symmetry
  - mean fractal dimension
  - radius error
  - texture error
  - perimeter error
  - area error
  - smoothness error
  - compactness error
  - concavity error
  - concave points error
  - symmetry error
  - fractal dimension error
  - worst radius
  - worst texture
  - worst perimeter
  - worst area
  - worst smoothness
  - worst compactness
  - worst concavity
  - worst concave points
  - worst symmetry
  - worst fractal dimension


With the next cell, we build the two data structures that we will be using throughout this Colab:


*   ```features```, a numpy array containing all the original features in the dataset;
*   ```labels```, an array of binary labels indicating if the corresponding set of features belongs to a subject with breast cancer, or not.



In [27]:
features = breast_cancer.data
labels = breast_cancer.target

print(f"Features shape: {features.shape}")
print(f"Labels shape: {labels.shape}")

Features shape: (569, 30)
Labels shape: (569,)


### Your task

If you run successfully the Setup and Data Preprocessing stages, you are now ready to cluster the data with the [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm included in scikit-learn.
Set the ```n_clusters``` parameter to **2** and ```random_state``` to **1**, fit the model, and then compute the [Silhouette score](https://en.wikipedia.org/wiki/Silhouette_(clustering)) (i.e., a measure of quality of the obtained clustering, here we use squared euclidean distance).  

**IMPORTANT:** use the scikit-learn implementation of the Silhouette score via ```silhouette_score```.

In [28]:
# 8-9 lines of code in total expected but can differ based on your style.
# for sub-parts of the question, creating different cells of code would be recommended.
# The running time should be less than 1 minute
# YOUR CODE HERE
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
kmeans = KMeans(n_clusters=2, random_state=1)
kmeans.fit(features)
predictions = kmeans.labels_
score = silhouette_score(features, predictions)
print(f"Silhouette Score: {score}")

Silhouette Score: 0.6972646156059464


Take the predictions produced by K-means, and compare them with the ```labels``` variable (i.e., the ground truth from our dataset).  

Compute how many data points in the dataset have been clustered correctly (i.e., positive cases in one cluster, negative cases in the other), please use the best case scenario since the output cluster ids can be a permutation of labels.

*HINT*: you can use ```np.count_nonzero(predictions == labels)``` to quickly compute the element-wise comparison of two arrays.

**IMPORTANT**: K-means is a clustering algorithm, so it will not output a label for each data point, but just a cluster identifier!  As such, label ```0``` does not necessarily match the cluster identifier ```0```.

In [29]:
# 4 lines of code in total expected but can differ based on your style.
# for sub-parts of the question, creating different cells of code would be recommended.
# YOUR CODE HERE
match1 = np.count_nonzero(predictions == labels)
match2 = np.count_nonzero(predictions != labels)
print(f"Correctly clustered (best case): {max(match1, match2)} out of {len(labels)}")

Correctly clustered (best case): 486 out of 569


Now perform dimensionality reduction on the ```features``` using the [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) statistical procedure, available in scikit-learn.

Set the ```n_components``` parameter to **2**, effectively reducing the dataset size of a **15X** factor.

In [30]:
# 6 lines of code in total expected but can differ based on your style.
# for sub-parts of the question, creating different cells of code would be recommended.
# The running time should be less than 30 seconds.
# Sanity check: the fourth row in the result should be [-692.6905100570509,38.57692259208171]
# YOUR CODE HERE
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_features = pca.fit_transform(features)
print("Fourth row after PCA:", pca_features[3])


Fourth row after PCA: [-407.18080253  -67.38031982]


Now run K-means with the same parameters as above, but on the ```pcaFeatures``` produced by the PCA reduction you just executed.

Compute the Silhouette score, as well as the number of data points that have been clustered correctly (also the best case scenario).

In [31]:
# 11-13 lines of code in total expected but can differ based on your style.
# for sub-parts of the question, creating different cells of code would be recommended.
# YOUR CODE HERE
# Run KMeans again on PCA-reduced features
kmeans_pca = KMeans(n_clusters=2, random_state=1)
kmeans_pca.fit(pca_features)
pca_predictions = kmeans_pca.labels_
score_pca = silhouette_score(pca_features, pca_predictions)
print(f"Silhouette Score after PCA: {score_pca}")
match1 = np.count_nonzero(pca_predictions == labels)
match2 = np.count_nonzero(pca_predictions != labels)
print(f"Correctly clustered after PCA (best case): {max(match1, match2)} out of {len(labels)}")

Silhouette Score after PCA: 0.6984195775999954
Correctly clustered after PCA (best case): 486 out of 569
