
# Programming Assignment 5 - K-Means Clustering

---

## Seeds Dataset

You can download the dataset here: https://archive.ics.uci.edu/ml/datasets/seeds#

This dataset is obtained by examining the geometry of group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes. These 7 attributes are:
1. Area A
2. Perimeter P
3. Compactness C=4∗π∗A/P2
4. Length of kernel
5. Width of kernel
6. Asymmetry coefficient
7. Length of kernel groove

`seeds_dataset.txt` contains the actual dataset.

**IMPORTANT**: you can assume that the last column will always contain the y values. However, your code will be tested against datasets of varying number of rows and columns, so do not hardcode (e.g. `[:7]`)

## Objective

You are to implement a K-means clustering algorithm in python to create clusters of wheat. After completing this assignment, you should be familiar with the following:
1. Appointing a data point to a cluster
2. Computing the sum of squares error between a cluster centroid and its points
3. Running K-means clustering
4. Plotting a graph of error against iterations

### **Total Marks: 30**
---


## Downloading the Dataset and Importing Modules

You can follow the steps below to download the dataset and upload it to a Colab environment.
1. Download the dataset from https://archive.ics.uci.edu/ml/datasets/seeds#
2. Open the Colab file browser by pressing the small folder icon on the top left of the Colab page.  
3. Drag and drop the `housing.data` file into the Colab folder.

We will be using `csv`, `math` and `numpy` as `np` for the questions. **You do not need to import them when submitting on Coursemology.**

In [None]:
import csv
import math
import numpy as np

# to display the float numbers with 2 decimal points and supress the use of
# scientific notations for small numbers
## np.set_printoptions(precision=2, suppress=True)

---

### Q1 loadSeedData (3 marks)

The function `loadSeedData` takes in a text file `f` and returns the numpy arrays `X`, comprising of the 7 attributes of each seed, and `y`, the seed's corresponding class value (which can take a value of 1, 2 or 3). Please **leave the rows and columns in the order that they appear** in the text file.

Note: if you noticed, we only use `X` for the later questions and not `y`. This is because K-means clustering is an unsupervised algorithm and thus does not require labelled data.

In [None]:
# Submit to Coursemology
def loadSeedData(f):
    '''
    f: string
    RETURN
        X: numpy array, shape = [N, D]
        y: numpy array, shape = [N]
    '''
    X, y = None, None
    ## start your code here

    
    ## end
    return X, y

In [None]:
# Testing

filename = 'seeds_dataset.txt'
X, y = loadSeedData(filename)

print((X[1][2], X[100,6], X[177][6]))
print(y[1], y[100], y[177])
print(X.shape, y.shape)

Expected output:

```
0.8811 5.618 4.963
1.0 2.0 3.0
(210, 7) (210,)
```

---

### Q2 standardizeDataset (3 marks)

As per usual, we now feature scale the data, this time using standardization.

For a dataset with N rows and D columns, the function `standardizeDataset` takes in the numpy array `X` and returns the standardized array `Xstd` (of shape N x D), the array for each column's mean `meanArray` (of shape 1 x D) and the array for each column's standard deviation `stdArray` (of shape 1 x D). Return values to the **nearest 3 decimal places using `round()`**.

In [None]:
# Submit to Coursemology
def standardizeDataset(X):
    '''
    X: numpy array, shape = [N, D]
    RETURN
        Xstd: numpy array, shape = [N, D]
        meanArray: numpy array, shape = [D]
        stdArray: numpy array, shape = [D]
    '''
    Xstd, meanArray, stdArray = None, None, None
    ## start your code here

    
    ## end
    return Xstd, meanArray, stdArray

In [None]:
# Testing

Xstd, meanArray, stdArray = standardizeDataset(X)
print(Xstd.shape)
print(meanArray[3])
print(stdArray[4])
print(Xstd[10, 1], Xstd[1, 6], Xstd[177, 6])

Expected output:

```
(210, 7)
5.629
0.377
0.223 -0.922 -0.908
```

---

### Q3 euclideanDist (3 marks)

We will use Euclidean distance as our similarity measure.

The function `euclideanDist` takes in two rows of data `x1` and `x2` and returns the Euclidean distance `dist`.

In [None]:
# Submit to Coursemology
def euclideanDist(x1, x2):
    '''
    x1: numpy array, shape = [D]
    x2: numpy array, shape = [D]
    RETURN
        dist: float value
    '''
    dist = None
    ## start your code here

    
    ## end
    return dist

In [None]:
# Testing

indx = [1, 10, 20, 60, 80, 90, 110, 140, 160, 169]
for i in indx:
    print(euclideanDist(Xstd[1, :], Xstd[i, :]))

Expected output:

```
0.0
2.5586115780349417
1.8522341627790595
2.8933220141835188
3.7331918523023235
4.822853754173586
3.1030936090537335
3.5241204417202856
2.897983013085987
3.653768994815257
```

---

### Q4 closestCentroid (4 marks)

We need a way to help us assign our data points to a cluster.

The function `closestCentroid` takes in a data point (without the class value) `coordinates_x` and a dictionary `coordinates_centroid` in the form of `{0: coordinates_0, 1: coordinates_1, ..., k: coordinates_k}` where `coordinates_k` is an array of the coordinates of the centroid of cluster `k`. It returns the key of the closest centroid cluster `closest_centroid` (i.e. 0, 1, ..., k) based on Euclidean distance. A correct implementation of `euclideanDist` has been given to you in Coursemology (i.e. you don't need to code it again)

In [None]:
# Submit to Coursemology
def closestCentroid(coordinates_x, coordinates_centroid):
    '''
    coordinates_x: numpy array, shape = [D]
    coordinates_centroid: dictionary, key = int, value = numpy array of shape [D]
    RETURN
        closest_centroid: int value
    '''
    closest_centroid = None
    ## start your code here

    
    ## end
    return closest_centroid

In [None]:
# Testing

coord_x1 = np.array([17.08, 15.38, 0.9079, 5.832, 3.683, 2.956, 5.484])
coord_x2 = np.array([13.99, 13.83, 0.9183, 5.119, 3.383, 5.234, 4.781])
coord_x3 = np.array([19.11, 16.26, 0.9081, 6.154, 3.93,  2.936, 6.079])
coord_x4 = np.array([14.11, 14.26, 0.8722, 5.52,  3.168, 2.688, 5.219])

coord_data = [coord_x1, coord_x2, coord_x3, coord_x4]
coord_centroid = {0: np.array([18.72180328, 16.29737705, 0.88508689, 6.20893443, 3.72267213, 3.60359016, 6.06609836]),
  1: np.array([11.96441558, 13.27480519, 0.8522    , 5.22928571, 2.87292208, 4.75974026, 5.08851948]),
  2: np.array([14.64847222, 14.46041667, 0.87916667, 5.56377778, 3.27790278, 2.64893333, 5.19231944])}

results = []

for coord in coord_data:
  results.append(closestCentroid(coord, coord_centroid))

print(results)

Expected output:

```
[0, 1, 0, 2]
```

---

### Q5 computeSSE (3 marks)

Another standard machine learning tool that we need is the loss function, which is SSE (sum of squares error) in this case.

The function `computeSSE` takes in a dictionary `repartition` in the form of `{0: array_0, 1: array_1, ..., k: array_k}` where `array_k` is composed of the coordinates of each data point in cluster `k`, and the dictionary `coordinates_centroid` as described in Q4. It returns the sum of all squared Euclidean distances between each data point and its cluster centroid `SSE`. Return values to the **nearest 4 decimal places using `round()`**. A correct implementation of `euclideanDist` has been given to you in Coursemology (i.e. you don't need to code it again)

In [None]:
# Submit to Coursemology
def computeSSE(repartition, coordinates_centroid):
    '''
    repartition: dictionary, key = int, value = numpy array of shape [number of points in cluster, D]
    coordinates_centroid: dictionary, key = int, value = numpy array of shape [D]
    RETURN
        SSE: float value
    '''
    SSE = 0
    ## start your code here

    
    ## end
    return SSE

In [None]:
# Testing

repartition = {0: np.array([([17.08  , 15.38  ,  0.9079,  5.832 ,  3.683 ,  2.956 ,  5.484 ]),
  ([17.63  , 15.98  ,  0.8673,  6.191 ,  3.561 ,  4.076 ,  6.06  ]),
  ([16.84  , 15.67  ,  0.8623,  5.998 ,  3.484 ,  4.675 ,  5.877 ]),
  ([17.26  , 15.73  ,  0.8763,  5.978 ,  3.594 ,  4.539 ,  5.791 ]),
  ([19.11  , 16.26  ,  0.9081,  6.154 ,  3.93  ,  2.936 ,  6.079 ]),
  ([16.82  , 15.51  ,  0.8786,  6.017 ,  3.486 ,  4.004 ,  5.841 ])]),
 1: np.array([([13.99  , 13.83  ,  0.9183,  5.119 ,  3.383 ,  5.234 ,  4.781 ]),
  ([12.72  , 13.57  ,  0.8686,  5.226 ,  3.049 ,  4.102 ,  4.914 ]),
  ([13.02  , 13.76  ,  0.8641,  5.395 ,  3.026 ,  3.373 ,  4.825 ]),
  ([14.28  , 14.17  ,  0.8944,  5.397 ,  3.298 ,  6.685 ,  5.001 ]),
  ([11.42  , 12.86  ,  0.8683,  5.008 ,  2.85  ,  2.7   ,  4.607 ]),
  ([11.23 , 12.63 ,  0.884,  4.902,  2.879,  2.269,  4.703]),
  ([12.36  , 13.19  ,  0.8923,  5.076 ,  3.042 ,  3.22  ,  4.605 ]),
  ([13.22 , 13.84 ,  0.868,  5.395,  3.07 ,  4.157,  5.088]),
  ([12.73  , 13.75  ,  0.8458,  5.412 ,  2.882 ,  3.533 ,  5.067 ]),
  ([13.07 , 13.92 ,  0.848,  5.472,  2.994,  5.304,  5.395]),
  ([13.32  , 13.94  ,  0.8613,  5.541 ,  3.073 ,  7.035 ,  5.44  ])]),
 2: np.array([([15.26 , 14.84 ,  0.871,  5.763,  3.312,  2.221,  5.22 ]),
  ([14.88  , 14.57  ,  0.8811,  5.554 ,  3.333 ,  1.018 ,  4.956 ]),
  ([14.29 , 14.09 ,  0.905,  5.291,  3.337,  2.699,  4.825]),
  ([13.84  , 13.94  ,  0.8955,  5.324 ,  3.379 ,  2.259 ,  4.805 ]),
  ([16.14  , 14.99  ,  0.9034,  5.658 ,  3.562 ,  1.355 ,  5.175 ]),
  ([14.38  , 14.21  ,  0.8951,  5.386 ,  3.312 ,  2.462 ,  4.956 ]),
  ([14.69  , 14.49  ,  0.8799,  5.563 ,  3.259 ,  3.586 ,  5.219 ]),
  ([14.11  , 14.1   ,  0.8911,  5.42  ,  3.302 ,  2.7   ,  5.    ]),
  ([16.63  , 15.46  ,  0.8747,  6.053 ,  3.465 ,  2.04  ,  5.877 ]),
  ([16.44 , 15.25 ,  0.888,  5.884,  3.505,  1.969,  5.533]),
  ([15.26  , 14.85  ,  0.8696,  5.714 ,  3.242 ,  4.543 ,  5.314 ])])}

coord_centroid = {0: np.array([18.72180328, 16.29737705,  0.88508689,  6.20893443,  3.72267213, 3.60359016,  6.06609836]),
 1: np.array([11.96441558, 13.27480519,  0.8522    ,  5.22928571,  2.87292208, 4.75974026,  5.08851948]),
 2: np.array([14.64847222, 14.46041667,  0.87916667,  5.56377778,  3.27790278, 2.64893333,  5.19231944])}

print(computeSSE(repartition, coord_centroid))

Expected output:

```
95.1872
```

---

### Q6 KMeansClustering (8 marks)

We will now do the clustering.

The function `KMeansClustering` takes in a dataset `X`, the indexes of the data points used as the initial centroids `index_centroids`, the number of clusters `k` and the stopping criterion `n`. It returns the dictionary `repartition` as described in Q5, the dictionary `coordinates` that is the same as `coordinates_centroid` in Q4/5, and a list that stores the SSE of each iteration `SSE_list`. The numbering of the clusters is not important, as long as the composition and centroids of the clusters are correct. Correct implementations of `closestCentroid` and `computeSSE` have been given to you in Coursemology (i.e. you don't need to code it again)


In [None]:
# Submit to Coursemology
def KMeansClustering(X, index_centroids, k, n):
    '''
    X: numpy array, shape = [N, D]
    index_centroids: list, shape = [k]
    k: int value
    n: int value
    RETURN
        repartition: dictionary, key = int, value = numpy array of shape [number of points in cluster, D]
        coordinates: dictionary, key = int, value = numpy array of shape [D]
        SSE_list: list, shape = [n]
    '''
    repartition, coordinates, SSE_list = None, None, None
    ## start your code here
    # Initialise your first centroids
  
    # Define stopping criterion

        # Initialise new dictionaries for repartition and coordinates

        # Assign all the points to the closest cluster centroid

        # Recompute the new centroids of the newly formed clusters
    
    ## end
    return repartition, coordinates, SSE_list

In [None]:
# Testing

nb_rows = Xstd.shape[0]
index_centroids = [83, 140, 28] # can be any integer number as long its within the number of datapoints
repartition, coordinates, SSE_list = KMeansClustering(Xstd,index_centroids, 3, 100)
print(np.sum(list(coordinates.values())))

centers = []
for i in coordinates:
  centers.append(coordinates[i])

centers2 = np.array(centers.copy())
while len(centers2) > 0:
  index_print = np.argmin(centers2[:,0])
  print(centers2[index_print])
  centers2 = np.delete(centers2,index_print,0)

Expected output:

```
0.39993332340644117
[-1.03387528 -1.00636857 -0.99853572 -0.89230372 -1.09505824  0.72393622
 -0.61279152]
[-0.16200547 -0.19323201  0.44310121 -0.28050529 -0.01905477 -0.65268272
 -0.59848248]
[ 1.25668163  1.26196622  0.56046437  1.23788278  1.16485187 -0.04521936
  1.29230787]
```

---

### Q7 unstandardize (1 mark)

Everything's working fine, except that the cluster centroids being output are in their standardized forms. We can reverse that by reversing the equation of the standardization.

The function `unstandardize` takes in a list of centroid coordinates `centers`, and `meanArray` and `stdArray` as described in Q2, and returns the list of unstandardized centroid coordinates `result`.

In [None]:
# Submit to Coursemology
def unstandardize(centers, meanArray, stdArray):
    '''
    centers: list, shape = [number of clusters]
    meanArray: numpy array, shape = [D]
    stdArray: numpy array, shape = [D]
    RETURN
        result: list, shape = [number of clusters]
    '''
    result = None
    ## start your code here


    ## end
    return result

In [None]:
# Testing

for i in unstandardize(centers, meanArray, stdArray):
  print(i)

Expected Output:

```
[18.49537313 16.20343284  0.88421045  6.17568657  3.69753731  3.63237313
  6.04170149]
[11.84642857 13.24814286  0.84746     5.23412857  2.84597143  4.78608571
  5.10761429]
[14.37726027 14.30753425  0.88144384  5.50454795  3.25142466  2.72119452
  5.11463014]
```

---

### Q8 sklearnKmeans (3 marks)

While we have manually implemented K-means, we can also use libraries with pre-built functions, such as from `sklearn`. The function `KMeans` from `sklearn.cluster` has been imported for you. (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

The function `sklearnKmeans` takes in a dataset `X`, the number of clusters `k` and the number of iterations `m`, and returns an array of cluster numbers for each data point `position` and an array of centroid coordinates `centers`. The parameter `n_init` in `KMeans` should be set to 1.

In [None]:
from sklearn.cluster import KMeans

# Submit to Coursemology
def sklearnKmeans(X, k, m):
    '''
    X: numpy array, shape = [N, D]
    k: int value
    m: int value
    RETURN
        position: numpy array, shape = [N]
        centers: numpy array, shape = [k, D]
    '''
    position, centers = None, None
    ## start your code here
    
    
    ## end
    return position, centers

In [None]:
# Testing

position, centers = sklearnKmeans(X, 3, 100)

centers2 = centers.copy()
while len(centers2) > 0:
  index_print = np.argmin(centers2[:, 0])
  print(centers2[index_print])
  centers2 = np.delete(centers2, index_print, 0)

Expected output:

```
[11.98865854 13.28439024  0.85273659  5.22742683  2.88008537  4.58392683
  5.0742439 ]
[14.81910448 14.53716418  0.88052239  5.59101493  3.29935821  2.70658507
  5.21753731]
[18.72180328 16.29737705  0.88508689  6.20893443  3.72267213  3.60359016
  6.06609836]
```

OR

```
[11.96441558 13.27480519  0.8522      5.22928571  2.87292208  4.75974026
   5.08851948]
[14.64847222 14.46041667  0.87916667  5.56377778  3.27790278  2.64893333
   5.19231944]
 [18.72180328 16.29737705  0.88508689  6.20893443  3.72267213  3.60359016
   6.06609836]
```

---

### Q9 Plotting SSE for different parameters (2 marks)

Let's plot the SSE for each iteration of our KMeans algorithm for different numbers of iterations. Your `KMeansClustering` from Q6 should already output a `SSE_list`. `matplotlib.pyplot` as `plt` has been imported for you.

Plot the SSE (on separate graphs) for:
* dataset `X`, `index_centroids` = [83, 140, 28], `k` = 3, `n` = 100
* dataset `X`, `index_centroids` = [83, 140, 28], `k` = 3, `n` = 10

Please upload both graphs, along with the code you used to plot in Coursemology as screenshots. Graph title and legend are optional but encouraged. Note that you won't get a mark when you submit this question, but you will automatically be awarded the full mark when finalising submission (subject to manual marking afterwards)

In [None]:
import matplotlib.pyplot as plt
nb_rows = Xstd.shape[0]
index_centroids = [83, 140, 28]

In [None]:
# First graph
repartition,coordinates,SSE_list = KMeansClustering(Xstd,index_centroids, 3, 100)
plt.plot(SSE_list)

In [None]:
# Second graph
repartition1,coordinates1,SSE_list1 = KMeansClustering(Xstd,index_centroids, 3, 10)
plt.plot(SSE_list1)

---

# End of Assignment