# Assignment 1: Centroid based clustering

### Make sure you read through this entire notebook before getting started with implementing the algorithm.


This notebook has the following structure:

- We first shortly explain the idea of the assignment.
- We follow thus up with a short walkthrough of the assignment, after which you can start implementing the necessary functions.


# Introduction to this template notebook

* This is a **personal** notebook.
* Make sure you work in a **copy** of `...-template.ipynb`,
**renamed** to `...-yourIDnr.ipynb`,
where `yourIDnr` is your TU/e identification number.

<div class="alert alert-danger" role="danger">
<h3>Integrity</h3>
<ul>
    <li>In this course you must act according to the rules of the TU/e code of scientific conduct.</li>
    <li>All the exercises and the graded assignments are to be executed individually and independently.</li>
    <li>You must not copy from the Internet, your friends, books... If you represent other people's work as your own, then that constitutes fraud and will be reported to the Examination Committee.</li>
    <li>Making your work available to others (complicity) also constitutes fraud.</li>
</ul>
</div>

You are expected to work with Python code in this notebook.

The locations where you should write your solutions can be recognized by
**marker lines**,
which look like this:

>`#//`
>    `BEGIN_TODO [Label]` `Description` `(n points)`
>
>`#//`
>    `END_TODO [Label]`

<div class="alert alert-warning" role="alert">Do NOT modify or delete these marker lines.  Keep them as they are.<br/>
<br/>
NEVER write code <i>outside</i> the marked blocks.
Such code cannot be evaluated.
</div>

Proceed in this notebook as follows:
* **Read** the text.
* **Fill in** your solutions between `BEGIN_TODO` and `END_TODO` marker lines.
* **Run** _all_ code cells (also the ones _without_ your code),
    _in linear order_ from the first code cell.

**Personalize your notebook**:
1. Copy the following three lines of code:

  ```python
  AUTHOR_NAME = 'Your Full Name'
  AUTHOR_ID_NR = '1234567'
  AUTHOR_DATE = 'YYYY-MM-DD'  # when notebook was first modified, e.g. '2020-02-26'
  ```

1. Paste them between the marker lines in the next code cell.
1. Fill in your _full name_, _identification number_, and the current _date_ as strings between quotes.
1. Run the code cell by putting the cursor there and typing **Control-Enter**.


In [1]:
#// BEGIN_TODO [Author] Name, Id.nr., Date, as strings (1 point)

AUTHOR_NAME = 'Jaime Bernal Moran'
AUTHOR_ID_NR = '1781782'
AUTHOR_DATE = '2025-05-14' 

#// END_TODO [Author]

AUTHOR_NAME, AUTHOR_ID_NR, AUTHOR_DATE

('Jaime Bernal Moran', '1781782', '2025-05-14')

# Centroid based clustering

For this assignment, you are expected to implement the k_means clustering algorithm, as discussed in class. The functions you are expected to implement are
* ```initialize_centroids```. This function chooses the initial cluster points. Currently, this function already contains an implementation, however, you are strongly encouraged to experiment with creating different initialisation functions once you have implemented the ```cluster``` function.
* ```cluster```. This function should, once implemented by you, perform the actual clustering and find the proper placement of the centroids.

In addition to this file, you are provided two more python files:
* ```point.py```.  This file contains the datastructures ```Point```, ```ClusterPoint``` and ```CentroidPoint``` that you can use in your implementation of the two functions.
* ```dataset.py```. This file contains the tools that read the input from and write the output to file.

<b>Testing.</b> After (partially) implementing the two functions, you can run the <b>run</b> function at the bottom of this document. Just above this function, you are provided with the ```test_case_nr``` field. Changing this field to an integer value $x \in [0, 6]$ allows you to choose which testcase you would like to run. This then reads the file 0$x$.in from the <b>input</b> folder and provides the resulting clustering in 0$x$.out in the <b>output</b> folder.

<b>Visulisation.</b> To view the clustering you created, you can use the ```Visualizer.ipynb``` notebook that you are provided alongside this assignment. After executing the first cell, you can simply execute the cell that corresponds to the testcase you want to view and a visualisation is provided.

<b>Handing in.</b> To verify your implementation, you are expected to hand in this file on Canvas. Here, we use the automated checking tool Momotor to check your implementation of the algorithm. After it has run through all the testcases (should take at most a couple minutes), you can see in the Momotor tab of the course how you scored. 
Note: you should only hand in _this_ file. The visualizer and other python files should not be handed in.

We move on to the actual coding part of this assignment. At the start we import some useful tools.

In [103]:
import sys
import os
import time
import numpy as np
import math

from point import *
from dataset import *

""" Runtime parameters """
assignment_nr = 1       # The assignment number. Used by the visualizer to determine what has to be visualized

**Extra functions.** When implementing the ```cluster()``` and ```initialize-centroids()``` functions, you'll likely want to create a couple functions yourself. To make sure the automated grader picks up on these, make sure you place these functions in the below cell.

In [104]:
#// BEGIN_TODO [YOUR-OWN-FUNCTIONS] 

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [YOUR-OWN-FUNCTIONS]

**Initialize centroids.** We continue with the ```initialize_centroids()``` function. This function currently already has a basic implementation: it takes the first $k$ cluster points from the input and sets these as initial centroids. When checking your algorithm implementations in Momotor, only this basic implementation will be used. However, for the report you are expected to write about this assignment, you are very much encouraged to overwrite this basic implementation with something different and report on your findings.

In [105]:
def initialize_centroids(input_obj):
    """
    Calculates the initial centroid placement

    :param input_obj:   the input object
    :return:            a list of k centroids in the plane
    """
    #centroids = [CentroidPoint(p.dimension, list(p.coords)) for p in input_obj.cluster_points[:input_obj.k]]

#// BEGIN_TODO [IMPLEMENT-INITIALIZE-CENTROIDS]

      # pull out the two fields supplied by the harness:
    points = input_obj.cluster_points   # this is your list of ClusterPoint
    k      = input_obj.k                # this is your integer k

    # now pick your k initial centroids however you like—
    # here’s the “first k” strategy the template originally showed:
    centroids = [CentroidPoint(p.dimension, list(p.coords)) for p in points[:k] ]

#// END_TODO [IMPLEMENT-INITIALIZE-CENTROIDS]
    return centroids

**Cluster.** Below we find the ```cluster()``` function. Currently, no implementations has been provided. It is your task to implement the k-means clustering algorithm here.

In [106]:
def cluster(input_obj):
    """
    Perform k-means clustering on the input set

    :param input_obj:   the input object
    :return:            a list of k centroids in the plane
    """    
#// BEGIN_TODO [IMPLEMENT-CLUSTER]    
    tol = 1e-6   
    max_iters = 100
    P = input_obj.cluster_points
    k = input_obj.k
    

    # 1) initialize
    centroids = initialize_centroids(input_obj)

    for _ in range(max_iters):
        # 2) assignment step
        clusters = [[] for _ in range(k)]
        for p in P:
            # compute distance to each centroid
            dists = [ math.dist(p.coords, c.coords) for c in centroids ]
            # pick the index of the nearest (ties broken by first occurrence)
            j = dists.index(min(dists))

            p.cluster_label = j

            clusters[j].append(p)

        # 3) update step
        new_centroids = []
        for pts in clusters:
            if pts:
                # compute per-coordinate means
                dims = zip(*(p.coords for p in pts))
                mean_coords = [sum(c_list)/len(pts) for c_list in dims]
                # wrap it in a CentroidPoint
                new_centroids.append(
                    CentroidPoint(pts[0].dimension, mean_coords)
                )
            else:
                # pick a random ClusterPoint to re-seed, but still wrap
                p = random.choice(P)
                new_centroids.append(
                    CentroidPoint(p.dimension, list(p.coords))
                )

        # 4) convergence check
        shifts = [math.dist(c_old.coords, c_new.coords)
                  for c_old, c_new in zip(centroids, new_centroids)]
        if max(shifts) < tol:
            break
          
        centroids = new_centroids



#// END_TODO [IMPLEMENT-CLUSTER] 
    return centroids

Next, we define the function that will take the input from file, use your clustering algorithm on get the clustering and write the result to file again.

In [107]:
def run(path_in, path_out):
    """
    Reads the input set, clusters the points and writes to output

    :param path_in:     location of the input set
    :param path_out:    location to print the output
    """

    # read input from file
    try:
        with open(path_in, "r") as f:
            input_obj = Dataset.read_input(f)
    except IOError:
        print("Could not read input file: " + path_in, file=sys.stderr)
        return        

    # find the best centroids using k_means
    centroids = cluster(input_obj)

    # simple check if the correct number of centroids has been given
    assert len(centroids) == input_obj.k

    # print result to file
    try:
        input_obj.write_output(centroids, path_out, assignment_nr)
    except IOError:
        print("Could not write output to file: " + path_out, file=sys.stderr)

    print("Cluster counter:      ", input_obj.k, file=sys.stderr)
    print("Mean squared distance: {:.3f}".format(input_obj.avg_score(centroids)), file=sys.stderr)                

**Running testcases.** Lastly, you can check your implementation by running some tests on it. Below, you can choose which testcase ```test_case_nr``` $\in[0,6]$ you would like to run.

In [108]:
test_case_nr =   5   # which input file to read

if __name__ == "__main__":
    start = time.time()
    run("input/{:02d}.in".format(test_case_nr), "output/{:02d}.out".format(test_case_nr))
    end = time.time()
    print("Time taken:            {:.3f}s".format(end - start), file=sys.stderr)   

Cluster counter:       4
Mean squared distance: 0.197
Time taken:            0.208s


That is all. At this point you should have all the information you should need to get started. If you have any questions you are free to ask the tutor overseeing the class. If you are experiencing issues with handing in your submission to Momotor, you can contact the responsible teaching assistant via email. Their email address can be found on Canvas.

Best of luck and happy clustering!

&copy; 2019-2020 - **TU/e** - Eindhoven University of Technology