<a href="https://colab.research.google.com/github/rumela-dasgupta/-Unsupervised-Learning-with-Dimensionality-Reduction-and-Clustering-Project-Notebook/blob/main/10_Unsupervised_Learning_with_Dimensionality_Reduction_and_Clustering_Spring_2026.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="left" style="background-color: #008080; padding: 20px 10px;">
<h3><b>IDEAS - Institute of Data Engineering, Analytics and Science Foundation</b></h3>
<p>Spring Internship Program 2026</p>
<hr style="width:100%;">
<h3><b>Project Title:</b> Unsupervised Learning with Dimensionality Reduction and Clustering</h3>
<h4>Project Notebook</h4>

<blockquote style="border-left: 4px solid #4285F4; padding-left: 15px;">
  <strong>Created by:</strong> Koustab Ghosh<sup>1</sup> & Sujoy Kumar Biswas<sup>2</sup><br>
  <strong>Designation:</strong>
  <ol style="margin-top: 5px; padding-left: 20px; font-size: 0.9em;">
    <li>Researcher, IDEAS-TIH, Indian Statistical Institute, Kolkata</li>
    <li>Head of Research & Innovation, IDEAS-TIH, Indian Statistical Institute, Kolkata</li>
  </ol>
</blockquote>
<hr style="width:100%;">
</div>

### Question 1: Load Libraries and Dataset (5 Marks)

Import the `load_digits` function from `sklearn.datasets`. Load the handwritten digits dataset into a variable named `digits`. Print the shape of the `digits.data` attribute.

**Hint:** Use `from sklearn.datasets import load_digits` and then call the function. The shape can be accessed using `.shape` on the `data` attribute of the loaded object.

**Expected Output:**
```
(1797, 64)
```

In [1]:
from sklearn.datasets import load_digits

digits = load_digits()

print(digits.data.shape)

(1797, 64)


### Question 2: Perform K-Means Clustering (5 Marks)

Import `KMeans` from `sklearn.cluster`. Create an instance of the KMeans model with `n_clusters=10` and `random_state=0`. Fit this model to the `digits.data` and get the cluster predictions. Print the shape of the `cluster_centers_` attribute of your fitted model.

**Hint:** Instantiate the model with `kmeans = KMeans(...)`. Then, use `kmeans.fit_predict(digits.data)`. The cluster centers are stored in `kmeans.cluster_centers_`.

**Expected Output:**
```
(10, 64)
```

In [2]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0)
labels = kmeans.fit_predict(digits.data)
print(kmeans.cluster_centers_.shape)

(10, 64)


### Question 3: Inspect Cluster Centers (5 Marks)

The cluster centers represent the 'average' digit for each cluster. Reshape the `kmeans.cluster_centers_` into a 3D array of shape `(10, 8, 8)` and store it in a variable called `centers`. Print the shape of the first center (i.e., `centers[0]`).

**Hint:** Use the `.reshape()` method on the `cluster_centers_` numpy array.

**Expected Output:**
```
(8, 8)
```

In [3]:
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
print(centers[0].shape)

(8, 8)


### Question 4: Dimensionality Reduction with PCA (10 Marks)

Import `PCA` from `sklearn.decomposition`. Create a PCA model to reduce the 64-dimensional digit data to 2 dimensions. Fit and transform the `digits.data` with this model, storing the result in a variable named `reduced_data`. Print the shape of `reduced_data`.

**Hint:** Instantiate PCA with `pca = PCA(n_components=2)`. Then, use the `.fit_transform()` method.

**Expected Output:**
```
(1797, 2)
```

In [4]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(digits.data)
print(reduced_data.shape)

(1797, 2)


### Question 5: Clustering on Reduced Data (10 Marks)

Perform K-Means clustering again, but this time on the `reduced_data` from the previous step. Use `n_clusters=10` and `random_state=0`. Store the predictions in a variable `clusters`. Print the first 10 cluster labels.

**Hint:** Create a new `KMeans` instance and use the `.fit_predict()` method on `reduced_data`.

**Expected Output:** A numpy array showing the first 10 predicted cluster labels.

In [5]:
from sklearn.cluster import KMeans

kmeans_pca = KMeans(n_clusters=10, random_state=0)

clusters = kmeans_pca.fit_predict(reduced_data)

print(clusters[:10])

[9 4 5 6 2 6 2 8 3 3]


### Question 6: Label Mapping (10 Marks)

Since K-Means does not know the actual digit labels, we need to map our cluster labels (0-9) to the most common true digit label within each cluster. Import `scipy.stats.mode`. For each cluster, find the mode of the true `digits.target` labels. Store this mapping in a dictionary called `labels_map`.

**Hint:** Loop from cluster 0 to 9. In each loop, find the mode of `digits.target[clusters == i]` where `i` is the cluster number.

**Expected Output:** A dictionary where keys are cluster labels and values are the corresponding most frequent true digit labels.

In [6]:
from scipy.stats import mode

labels_map = {}

for i in range(10):
    mask = (clusters == i)
    labels_map[i] = mode(digits.target[mask], keepdims=True).mode[0]

print(labels_map)

{0: np.int64(2), 1: np.int64(0), 2: np.int64(4), 3: np.int64(8), 4: np.int64(1), 5: np.int64(8), 6: np.int64(9), 7: np.int64(6), 8: np.int64(7), 9: np.int64(0)}


### Question 7: Creating Final Predictions (10 Marks)

Using the `labels_map` dictionary from the previous step, create an array of predicted labels `pred_labels` by converting your K-Means `clusters` array to the mapped digit labels. Print the first 10 predicted labels.

**Hint:** You can use a list comprehension or `np.vectorize` to apply the mapping dictionary to the `clusters` array.

**Expected Output:** An array showing the first 10 final predicted digit labels.

In [7]:
import numpy as np

pred_labels = np.array([labels_map[c] for c in clusters])

print(pred_labels[:10])

[0 1 8 9 4 9 4 7 8 8]


### Question 8: Evaluate Clustering Accuracy (15 Marks)

Import `accuracy_score` from `sklearn.metrics`. Calculate and print the accuracy of your unsupervised clustering by comparing the `pred_labels` with the true `digits.target` labels.

**Hint:** The function call is `accuracy_score(true_labels, predicted_labels)`.

**Expected Output:** A single floating-point number representing the accuracy score.

In [8]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(digits.target, pred_labels)

print(acc)


0.5664997217584864


### Question 9: Confusion Matrix (15 Marks)

Import `confusion_matrix` from `sklearn.metrics`. Compute and print the confusion matrix to see which digits were most often confused with each other.

**Hint:** The function call is `confusion_matrix(true_labels, predicted_labels)`.

**Expected Output:** A 10x10 numpy array representing the confusion matrix.

In [9]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(digits.target, pred_labels)

print(cm)


[[167   0   0   0   1   0   7   0   3   0]
 [  2  97   0   0   4   0   0  11  68   0]
 [  0   3 114   0   0   0   0  10  47   3]
 [  2   0  63   0   0   0   0  13  15  90]
 [  0  11   0   0 138   0  27   3   2   0]
 [ 14   6   4   0   2   0   4  23 104  25]
 [ 22   0   0   0  21   0 136   0   2   0]
 [  0  35   0   0   1   0   0 115  28   0]
 [  0  10   3   0   1   0   0  24 136   0]
 [  4   1   8   0   0   0   1   9  42 115]]


### Question 10: Model Evaluation with Classification Report (15 Marks)

Import `classification_report` from `sklearn.metrics`. Print the full classification report for your predicted labels against the true labels to get a detailed breakdown of precision, recall, and f1-score for each digit.

**Hint:** Use `print(classification_report(digits.target, pred_labels))`.

**Expected Output:** A text-based report showing precision, recall, and f1-score for each digit class (0 through 9).

In [10]:
from sklearn.metrics import classification_report

print(classification_report(digits.target, pred_labels))


              precision    recall  f1-score   support

           0       0.79      0.94      0.86       178
           1       0.60      0.53      0.56       182
           2       0.59      0.64      0.62       177
           3       0.00      0.00      0.00       183
           4       0.82      0.76      0.79       181
           5       0.00      0.00      0.00       182
           6       0.78      0.75      0.76       181
           7       0.55      0.64      0.59       179
           8       0.30      0.78      0.44       174
           9       0.49      0.64      0.56       180

    accuracy                           0.57      1797
   macro avg       0.49      0.57      0.52      1797
weighted avg       0.49      0.57      0.52      1797



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
