<a href="https://colab.research.google.com/github/valentinaslisser/valentina_slisser/blob/main/Assignment_3_CodeGrade_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# P3:  k-Means Clustering [20 pts]

*K-means* is a classic example of *unsupervised learning*: there is no target that we are trying to predict.  Instead we are trying to extract a hidden structure.  The goal of *k-means* is to find $k$ clusters, or groups, within a dataset.  Each cluster is represented by a point $m$, and the data points that are closest to that point are assigned to the corresponding cluster.  The *means* aspect of the algorithm comes from how these cluster representations are calculated: by computing the mean of the points currently assigned to that cluster.  For more details, read Section *7.3* of Alpaydin. Below I give pseudo-code for the algorithm:

***
* **INPUT**: a data set X (of size NxD), the number of clusters $k$
* Initialize $k$ mean vectors $m_{j}$ (each of size 1xD) to k random data points in X
* Do:
    * For each data point $x_n$ in $X$:
        * Find the mean $m_{j}$ with the minimum squared distrance to $x_n$: $argmin_{j} \ || x_{n} - m_{j}||^{2}$
        * Assign $x_n$ to the $j$th cluster $C_{j}$, which had the nearest mean $m_j$.
    * For each mean $m_{j}$ for $j \in [1,\ldots, k]$:
        * Re-compute the mean $m_{j}$ by taking the mean of the points currently assigned to cluster $C_{j}$: $m_j = \frac{1}{|C_{j}|}\sum_{x \in C_{j}} x$, where $|C_{j}|$ denotes the number of points currently assigned to cluster $C_{j}$ and the sum is taken over all data points in $C_{j}$.
* While:
    * Not Converged: The cluster assignments have changed from the previous iteration of the Do-loop.
* **OUTPUT**: the cluster means $\{m_{1},\ldots,m_{k} \}$, the cluster assignments (cluster index per data point)
***
    
This k-means algorithm iteratively minimizes the following global cost function w.r.t. the means $\{m_{1},\ldots,m_{k} \}$: $$ \ell(X, \{m_{1},\ldots,m_{k} \}) = \sum_{j=1}^{k} \sum_{n=1}^{N} 1_{x_{n} \in C_{j}} \cdot || x_{n} - m_{j}||^{2} $$ where $1_{x_{n} \in C_{j}}$ is an indicator function that is one if $x_{n} \in C_{j}$ (meaning that $x_{n}$ is currently assigned to the $j$th cluster) and is zero otherwise.

In [None]:
# Necessary set ups & imports

import numpy as np

%matplotlib inline
import matplotlib.pylab as plt
import matplotlib.colors as mcolors

from sklearn import datasets

# Sets the seed for deterministic randomization (DO NOT CHANGE WHEN HANDING IN TO CODEGRADE)!
RANDOM_SEED = 1

# Use this Numpy random Generator object throughout your randomization code for consistency.
rng = np.random.default_rng(seed = RANDOM_SEED)

### **<span style="color:red">Do not forget to answer the final question at the bottom of the file!</span>**


## Implementing the algorithm [15 pts]

For your implementation, you'll complete the following components of the k-means algorithm:

* `init_clusters`: Randomly initialize the $k$ means from $k$ random data points [1 pt]
* `distance`: Compute the squared distance between two input points [1 pt]
* `global cost`: Compute the global squared cost $\ell(X, \{m_{1},\ldots,m_{k} \})$ defined above [1 pt]
* `assign_clusters`: Compute the assignments of data points to clusters, based on the current means [4 pts]
* `compute_means`: Compute the (KxD)-matrix containing the mean vectors, based on the current cluster assignments [3 pts]
* `is_converged`: Determine if the algorithm has converged or not by comparing the current cluster assignments with the previous iteration's assignments. [1 pt]
* `run_kmeans`: Combine all of these functions in a general k-means function [4 pts]

In [None]:
def init_clusters(x, k, rng):
    """
    Sets up the initial clusters from random data points.

    Input:
    x - Nx2 data array
    k - integer representing number of clusters
    rng - The random Generator used for randomization

    Output:
    kx2 array - the initial mean values, randomly assigned to points in x

    #NOTE: Remember to use the *new* Numpy Random Number Generator used in the previous assignments as well!
    #NOTE: Do not forget to use the RANDOM_SEED when seeding your RNG object!
    #NOTE: Do not apply any random operations in-place! Rather, use functions that make & return a copy.
    """
    # YOUR SOLUTION HERE
    raise NotImplementedError

In [None]:
def distance(x0, x1):
    """
    Computes the squared distance between two input points.

    Input:
    x0 - 2-dimensional Numpy array
    x1 - 2-dimensional Numpy array

    Output:
    float - the squared distance between x0 and x1
    """
    # YOUR SOLUTION HERE
    raise NotImplementedError

In [None]:
def global_cost(x, means, assignments):
    """
    Computes the global squared cost according to the formula.

    Input:
    x - Nx2 data array
    means - kx2 array containing current mean values
    assignments - N-dimensional array containing the cluster assignment index (integer) per data point

    Output:
    float - summed cost for all data points, with the distance computed to their currently assigned mean
    """
    # YOUR SOLUTION HERE
    raise NotImplementedError

In [None]:
def assign_clusters(x, means):
    """
    Groups the data points to the current means.

    Input:
    x - Nx2 data array
    means - kx2 array containing current mean values

    Output:
    N-dimensional array - updated cluster assignment index (integer) per data point
    """
    # YOUR SOLUTION HERE
    raise NotImplementedError

In [None]:
def compute_means(x, assignments, k, rng):
    """
    Computes the matrix containing the mean vectors based on the current data assignment.

    Input:
    x - Nx2 data array
    assignments - N-dimensional array containing the cluster assignment index (integer) per data point
    k - integer representing number of clusters
    rng - The random Generator used for randomization

    Output:
    kx2 array - updated mean values

    #NOTE: The np.where() function might
    be of help for retrieving cluster members

    #NOTE #2: Remember to handle the case
    in which no points are assigned to a cluster.
    In that case, set the mean to be a random
    point in the data (same as how you initialized).

    #NOTE #3: Remember to use the new Random Number Generators
    from Numpy to apply randomization!
    """
    # YOUR SOLUTION HERE
    raise NotImplementedError

In [None]:
def is_converged(old_assignments, new_assignments):
    """
    Determines whether the algorithm has converged by comparing assignments.

    Input:
    old_assignments - N-dimensional array containing the cluster assignment index (integer) per data point *obtained by the previous iteration*
    new_assignments - N-dimensional array containing the cluster assignment index (integer) per data point *computed during the current iteration*

    Output:
    boolean - true if all elements in the two arrays are equal
    """
    # YOUR SOLUTION HERE
    raise NotImplementedError

In [None]:
def run_kmeans(x, k, rng):
    """
    Combines all of the above functions into one algorithm.

    Input:
    x - Nx2 data array
    k - integer representing number of clusters
    rng - The random Generator used for randomization

    Output:
    means - kx2 array containing final mean values
    assignments - N-dimensional array containing final assignment indices
    cost_per_iteration - list containing the global cost for each interation
    """
    if k < 1:
        print("k=%d: Need to run k-Means with k > 0!"%(k))
        return

    means = init_clusters(x, k, rng)
    assignments = np.zeros((x.shape[0],)).astype(int)
    cost_per_iteration = []
    converged = False

    # YOUR SOLUTION STARTS HERE

    raise NotImplementedError # <- Remove this when implementing!

    # YOUR SOLUTION ENDS HERE

    return means, assignments, cost_per_iteration

## Showing the results

I've provided a function below that will visualize the results of the k-Means algorithm.  It will create two subplots.  One shows the means and cluster assignments of each data point via color-coding.  The second shows the global cost function per iteration of the algorithm.

In [None]:
def plot_clusters_and_cost(x, means, assignments, costs):
    """
    The following plotting function is provided for you.

    Input:
    x -- Nx2 data array
    means --- Kx2 array with each cluster mean per row
    assignments --- length-N array containg cluster assignment indices
    costs --- list containing global cost for each interation

    Output:
    Two plots --- one that shows the color-coded cluster assignments,
    and one that shows the cost per iteration
    """
    k = means.shape[0]

    # there are 17 unique colors in this list, so there'll be an error if
    # we try to plot more than 17 clusters.
    colors = list(mcolors.TABLEAU_COLORS.keys()) + list(mcolors.BASE_COLORS.keys())[:-1]
    if k > len(colors):
        print("Too many clusters, not enough colors for plotting!")
        print("Try running again with 0 < k <= %d."%(len(colors)))
        return

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))

    # PLOT POINTS & MEANS
    for k_idx in range(means.shape[0]):
        assignment_idxs = np.where(assignments == k_idx)
        ax1.scatter(x[assignment_idxs, 0], x[assignment_idxs, 1], color=colors[k_idx], marker="o", alpha=.3, s=60)
        ax1.scatter(means[k_idx,0], means[k_idx, 1], color=colors[k_idx], marker="x", s=550, linewidth=7)
    ax1.set_title("Clusters (k=%d)"%(means.shape[0]), fontsize=20)
    ax1.set_xlabel(r"$X_1$", fontsize=25)
    ax1.set_ylabel(r"$X_2$", fontsize=25)

    # PLOT GLOBAL COST FUNCTION
    ax2.plot(range(1, len(costs)+1), costs, "k-", lw=5)
    ax2.set_title("Cost Per Iteration", fontsize=20)
    ax2.set_xlabel(r"Iteration", fontsize=25)
    ax2.set_ylabel(r"Cost", fontsize=25)

    plt.show()

## Running on Iris data set [3 pts]

Now that k-means is fully implemented, run the algorithm on the Iris dataset from last week.  Like last week, we will use only the last two features: $x_{1}=$**petal length** and $x_{2}=$**petal width**.

In the three code blocks below, run the algorithm for three different values of $k$.  (Choose $k < 18$ or else the plotting function will break.)

In [None]:

iris = datasets.load_iris()

# YOUR SOLUTION HERE

#plot_clusters_and_cost(???)

In [None]:
# YOUR SOLUTION HERE

#plot_clusters_and_cost(???)

In [None]:
# YOUR SOLUTION HERE

#plot_clusters_and_cost(???)

## Describe your results [2 pts]

Below, describe what you see in the three different runs.  In particular, describe the changes between runs in (i) the final cluster assignments and (ii) the global cost function.  Do you see evidence of underfitting or overfitting?  If so, describe it.


# **<span style="color:red">YOUR RESPONSE HERE</span>**