# 1 Cluster with GMM vs KMeans

Homework 7 included clustering with K-Means. In this notebook we look at another clustering method in Gaussian Mixture Models.

### Load in data

Run the following cell to generate data for the notebook. The true seperation of the data is shown by the distinct coloring of points. Additionally, the true centers of the clusters has been plotted.

In [None]:
#################### RUN THIS CELL DO NOT EDIT ############################################################
import numpy as np
import matplotlib.pyplot as plt

# Define parameters for each Gaussian cluster
np.random.seed(42)
means = [[-4, 4], [4, 4], [0,4]]
covariances = [[[1, 0], [0, 8]], [[1, 0], [0, 8]], [[1, 0], [0, 8]]]
num_samples = 500

# Generate data for each cluster
data = []
count = 0
labels = []
for mean, cov in zip(means, covariances):
    cluster_data = np.random.multivariate_normal(mean, cov, num_samples)
    count += 1
    labels.append(np.ones(shape = cluster_data.shape[0]) * count)
    data.append(cluster_data)

# Combine data and plot
data = np.vstack(data)
plt.scatter(data[:, 0], data[:, 1], s=10, alpha=0.7, c = labels, cmap ="cividis")
plt.scatter([mean[0] for mean in means], [mean[1] for mean in means], c='red', marker='X', label='True Means')
plt.title("True Classes")
plt.xlabel("X1")
plt.ylabel("X2")
plt.show()
##############################################################################################################

# 1.1 Clustering with K-Means

Use the sklearn `Kmeans` package to cluster `data`

**Tasks**
- [1 pt] Import `KMeans` from sklearn
- [1 pt] Create a classifier `km_clf` with 3 clusters and a `random_state` of 42
- [1 pt] Assign predicted data to `km_clusters`

In [24]:
# TODO Clustering with KMeans

# 1.2 Plotting

In this section you will need to use the clusters that you found in (1) to assign points to a cluster via the `c=` argument in the scatter plot. Additionally, extract the cluster centers from the KMeans classifier and plot those to see how they differ from the true cluster centers. Make sure to use a different color than the true centers to distinguish them. 

**Tasks** [2 pt]
-  Plot the data with the KMeans clusters
-  Plot the original centers (means) 
-  Extract the centroids from the KMeans classifier
-  Plot the centroids in a different color on the same plot

In [None]:
km_centroids = ... # TODO: Extract Centroids

# TODO: Plot data with KMeans cluster assignments
# TODO: Plot True Centers
# TODO: Plot Centroids
plt.title("KMeans Cluster Assignments")
plt.xlabel("X1")
plt.ylabel("X2")
plt.legend()
plt.show()

# 1.3 Clustering with GMM

Use the sklearn `GaussianMixture` package to cluster `data`

**Tasks**
- [1 pt] Import `GaussianMixture` from sklearn
- [1 pt] Create a classifier `gmm_clf` with 3 clusters and a `random_state` of 42
- [1 pt] Assign predicted data to `gmm_clusters`

In [26]:
# TODO Clustering with GMM

# 1.4 Plotting

In this section you will need to use the clusters that you found in (1) to assign points to a cluster via the `c=` argument in the scatter plot. Additionally, extract the cluster centers from the KMeans classifier and plot those to see how they differ from the true cluster centers. Make sure to use a different color than the true centers to distinguish them. 

**Tasks** [2 pt]
-  Plot the data with the GMM clusters
-  Plot the original centers (means) 
-  Extract the centers of the gaussians from the GMM classifier
-  Plot these centers in a different color on the same plot

In [None]:
gmm_means = ... # TODO: Extract Means

# TODO: Plot data with GMM cluster assignments
# TODO: Plot True Centers
# TODO: Plot Means
plt.title("GMM Cluster Assignments")
plt.xlabel("X1")
plt.ylabel("X2")
plt.legend()
plt.show()

### 1.5 Discussion

[2 pt] Which clustering method performed better? Explain.

**Ans** 

# Spectral embedding and clustering
This homework will not only help you get a better sense of how spectral embedding and clustering work, but also give you a stronger intuition for the "radial basis function" kernel, which is a common kernel used in many scientific applications.


Make sure you have downloaded:
- X1.csv
- X2.csv

In [28]:
import numpy as np
import matplotlib.pyplot as plt

##  Intro

### The following few markdowns are for your refrence, there is nothing to do until (2)

Remember: 
- when performing spectral clustering, there is an adjacency matrix, which represents the graph/network of all data points. 
- Data points are connected if they are "close enough" to each together.
- **Question:** But what constitutes close enough?
- There are two ways to do this, as seen in the lecture notes on Embedding:
    - k-nearest neighbors
    - $\exp\left(-d^2/\epsilon\right)$ similarity: <u>radial basis function (RBF)</u> proximity.

You know how k-nearest neighbors works. So this homework will start with introducing basic understanding of RBF and how it measures proximity. After that, we will make some functions that streamlines your code and experiment with spectral clustering models on synthetically generated data.


## Radial Basis Function (RBF)
The RBF kernel, a.k.a. squared exponential/Gaussian kernel, is a function that takes in two points and outputs a number to reflect the proximity of those two points.

More formally, if the two points are $x_1$ and $x_2$, the "proximity" of these points are 
$$
k(x_1, x_2) \ := \ \exp\left( - \frac{\|x_1-x_2\|^2}{2\ell^2}\right),
$$
where $\ell$ is a lengthscale parameter you can choose. 

($\ell$ _essentially_ corresponds to the $\epsilon$ in the lecture notes and should remind you of the variance of a normal distribution PDF.)

The next task is to code up plot for RBF kernel, both in 1d and 2d. 

But before we do that, we make a simplifying trick to make these plots easier. 
- Notice that we are only concerned with the **distance** between two points, as represented in the $\|x_1-x_2\|$ term in the formula above.
- Hence, we can set distance $d = \|x_1-x_2\|$ to get a new formula for RBF kernel:
    $$
    k(d) \ := \ \exp\left( - \frac{d^2}{2\ell^2}\right),
    $$



### RBF Function

In [29]:
def rbf(d,l):
    return np.exp(-d**2 / (2*l**2))

### 1d plot

In [None]:
x = np.linspace(-10,10,1000)
l_list = [1,2,3]

for l in l_list:
    dist_1d = rbf(np.abs(x),l)
    plt.plot(x,dist_1d,label=r'$\ell$=%d'%l)
plt.title('RBF kernel 1D')
plt.xlabel('displacement from 0, d'); plt.ylabel(r'RBF, $k(d)$')
plt.legend(); plt.grid(); plt.show()

### 2d plot

In [None]:
xx, yy = np.meshgrid(x, x)

fig = plt.figure(figsize=(15,4))
plt.suptitle('RBF kernel 2D')
for i in np.arange(len(l_list)):
    l = l_list[i]
    ax = fig.add_subplot(1,3,i+1,projection='3d')
    dist_2d = rbf(np.sqrt((xx**2 + yy**2)),l)
    ax.plot_surface(xx,yy,dist_2d)
    ax.set_title(r'$\ell$=%d'%l)
    ax.set_xlabel(r'$x$'); ax.set_ylabel(r'$y$'); ax.set_zlabel(r'RBF, $k(d)$')


## 2 Spectral embedding/clustering functions
Before completing the spectral clustering experiments, complete the functions below.

**Task:**

[3 pt] complete code below for `embed_and_plot()`
- arguments, which will all be fed into the sklearn.manifold.SpectralEmbedding object:
    - `X`, data, number of samples by number of dimensions
    - `aff`, 'rbf' or 'nearest_neighbors' (short for affinity argument in SpectralEmbedding)
    - `gam`, gamma parameter for RBF kernel
    - `num_neighbors`, parameter for k-nearest neighbors
    - You can set default values for gam and num_neigbors to make your function calls later easier. Here is an [example](https://www.geeksforgeeks.org/default-arguments-in-python/) of how to write default parameters.
- your code should perform two things:
    1. make SpectralEmbedding object, fit it to data X, and obtain eigenvectors of embedding.
    2. display three plots in the same row.
        - scatter plot of embedding coordinates (eigenvector 1 against eigenvector 2)
        - scatter plot of embedding coordinates (eigenvector 2 against eigenvector 3)
        - scatter plot of sorted entries of eigenvector 1 (sorted index i against entry i)
- as usual, your plots should contain appropriate titles, axis labels, etc.
- You should read the [SpectralEmbedding documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html) and [SpectralClustering documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html) **carefully**, so as to not make mistakes when implementing. Useful skill for any coding activity.
    - In particular, pay attention to their definition of $\gamma$ compared to the definition of $\ell$ in the earlier RBF kernel exercise. What is the inverse relationship between $\gamma$ and $\ell$?

In [35]:
from sklearn.manifold import SpectralEmbedding

def embed_and_plot(X, aff, gam=1, num_neighbors=1):
    # TODO make model, fit, and obtain eigenvectors
    
    plt.figure(figsize=(18,4))
    
    plt.subplot(131)
    # TODO scatter eigenvector 1 entries against eigenvector 2 entries
    
    plt.subplot(132)
    # TODO scatter eigenvector 2 entries against eigenvector 3 entries

    plt.subplot(133)
    # TODO scatter sorted eigenvector 1 entries

    
    plt.show()

**Task:**

[2 pt] complete code below for `cluster_and_plot`
- arguments:
    - `X`, `aff`, `gam`, `num_neighbors` as in embed_and_plot
    - `n_clus` is the number of clusters that you want to split the data X into.
- your code should have two components:
    1. make SpectralClustering object, fit to data X, and predict the cluster labels.
    2. scatter plot of data with the predicted cluster labels. Include a title.


In [36]:
from sklearn.cluster import SpectralClustering

def cluster_and_plot(X, n_clus, aff, gam=1, num_neighbors=1):
    # TODO make model, fit, get cluster labels
    
    # TODO plot

    plt.show()

## 3 Cluster X1
Load and scatter plot `X1`.

In [None]:
X1 = np.loadtxt('X1.csv', delimiter=',')
plt.scatter(X1[:,0], X1[:,1]); plt.show()

**Task:**

[2 pt] call embed_and_plot and cluster_and_plot on X1.
- you should correctly cluster the data into the two obvious clusters.
- experiment with different choices of affinity (rbf/nearest_neighbors) and parameter values (gamma/n_neighbors). 
- But you only need to submit plots for one choice of affinity and parameter that successfully clusters the data.
- all plots should be visible.

In [None]:
# TODO

## 4 Cluster X2
Load and scatter plot `X2`.

In [None]:
X2 = np.loadtxt('X2.csv', delimiter=',')
plt.scatter(X2[:,0], X2[:,1]); plt.show()

### 4.1 cluster
[2 pt] call embed_and_plot and cluster_and_plot on X2.
- same instructions as X1, except that you should obtain 3 clusters.

In [None]:
# TODO

### 4.2 Compare to k-means and GMM
**Task:**

1. [2 pt] Think back to other clustering algorithms in this course. Between k-means and GMM, which is more approapriate for X2 data? 
    - Give a reason why your chosen clustering algorithm is suited the data. 
    - Give a reason why clustering algorithm you did not choose has a limitation that prevents it from being suitable to the data.

    **Ans:** 

2. [3 pt] Implement the clustering algorithm that you chose in the previous question to be appropriate for X2 data.
    - You may use relevant sklearn packages.
    - Include a plot of the cluster labels to ensure that it is working exactly to how you expect.

In [None]:
# TODO implement clustering based on your answer
