# Section 10 - Clustering Models: GMM, Spectral Embedding and Clustering

This section we're going to:

- Review lecture content on clustering methods;
- Better understand Spectral Embedding model;
- Hands on experience using this clustering method.

---

## Gaussian Mixture Models (GMMs)

### Overview
Gaussian Mixture Models (GMMs) are a probabilistic model that assumes all data points are generated from a mixture of Gaussian distributions with unknown parameters. We'll explore:


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

In [None]:
# Generate synthetic data
X, y_true = make_blobs(n_samples=500, centers=3, cluster_std=1, random_state=42)
scatter = plt.scatter(X[:, 0], X[:, 1], c=y_true, s=40)
plt.title("Original Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend(*scatter.legend_elements())
plt.show()

In [3]:
# TODO: Fit a GMM with 3 components


In [None]:
# Plot data points and Gaussian components
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
x, y = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 15, 100))
XX = np.array([x.ravel(), y.ravel()]).T
Z = -gmm.score_samples(XX)
Z = Z.reshape(x.shape)
plt.contour(x, y, Z, levels=10, zorder=1)
plt.title("Gaussian Components and Contour Levels")
plt.show()

## 1 - Load data

In [None]:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=300, noise=0.07, random_state=0)
print(X.shape, y.shape)

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1])
plt.show()

## 2 - Recap

### 1a. Big Idea
- Spectral Embedding is a **<u>dimensionality reduction technique</u>** that lies within manifold learning methods. 
- The central idea for those methods is that, even though the data may be represented in a high-dimensional space, the important patterns/characteristics of the data is actually **inherently lower dimensional**.

### 1b. Similarity to PCA
- Like PCA, if we can **<u>"project" our data onto some appropriate low-dimensional space</u>** (and using <u>eigen-magic!</u>), then we can perform tasks like classification or clustering to this new (and hopefully simpler) representation of the data. 
- Unlike PCA, Spectral Embedding uses the <u>**graph/network of the data**</u>, instead of the covariance of the data. 

### 2a. Spectral Embedding Method
1. **Adjacency A.** Construct the graph adjacency matrix $A$, which has dimension $n$ by $n$, "number of data points" by "number of data points". 
    - Vertices are connected $A_{ij} = A_{ji} = 1$ if and only if data points are "close enough" $\|x_i-x_j\| < d_T$ some threshold distance.
2. **Degree D.** Compute graph degree matrix $D$, where diagonal entries are degree of each vertex. 
3. **Laplacian L.** Compute graph Laplacian $L=D-A$. 
4. **Eigendecomposition.** Perform an eigendecomposition $L = U \Lambda U^T$ to get the eigenvectors and eigenvalues, and study those objects to get a sense of the data in a new basis.

### 2b. Spectral Clustering Method
- After embedding, we can perform clustering on the eigenvectors using any of your favorite clustering technique. (manually thresholding, k-means, gmm, etc.)
- Like PCA, we can consider muliple eigenvectors ("multiple PCs") when performing our clustering.


## 3 - Spectral embedding
**Task:**
- complete the code below steps 1-4 of spectral embedding
- tweak the parameters where prompted to see how they affect the clustering assignment

#### Step 1: Adjacency

In [7]:
import numpy as np

# Compute the adjacency matrix
sqd_residual = (X[np.newaxis, :, :] - X[:, np.newaxis, :]) ** 2
dist = None         # TODO

## Tweak the epsilon param between 0.1 to 1
## see how this affects clustering
## e.g. 0.1, 0.3, 0.6, 1
epsilon = None      # TODO
A = None            # TODO

#### Step 2:  Degree

In [8]:
# Compute degree matrix
degree = None       # TODO
D      = None       # TODO

#### Step 3:  Laplacian

In [9]:
# Compute the graph Laplacian
L = None            # TODO

#### Step 4:  Eigendecomposition

In [10]:
# Eigendecomposition
# TODO

#### Visualize embedding

In [None]:
## Try using other evecs, such as 3rd and 4th lowest,
## to see how that affects the clustering assignment
evec_num = None       # TODO
evec     = None       # TODO
s = np.argsort(evec)

plt.plot(evec[s], 'x', alpha=0.5)
plt.title('Sorted eigenvector %d' % evec_num)
plt.show()

## Spectral Clustering
### Option 1: heuristic/round-off/manual on evec

In [None]:
# Clustering data
label_a = (evec < 0).astype(int)

plt.figure(figsize=(9,4))
plt.suptitle('epsilon = %.1f,   evec %d' % (epsilon, evec_num))

plt.subplot(121)
plt.scatter(X[:,0], X[:,1], c=label_a)
plt.title("Clusters via spectral embedding + thresholding")

plt.subplot(122)
plt.scatter(np.arange(len(evec)), evec[s], marker='x', c=label_a[s], alpha=0.5)
plt.hlines(0,0,300,color='k',alpha=0.4,linewidth=5, label='threshold')
plt.title('Sorted evec %d, with labels' % evec_num)
plt.legend()

plt.show()

### Option 2: k-means, gmm, etc.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, n_init=100, init='random')
kmeans.fit(evec.reshape(-1,1))
label_b = kmeans.labels_       # can negate with ~kmeans.labels_ for coloring purposes too.
centers = kmeans.cluster_centers_

plt.figure(figsize=(9,4))
plt.suptitle('epsilon = %.1f,   evec %d' % (epsilon, evec_num))

plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=label_b)
plt.title("Clusters via spectral embedding + k-means")

plt.subplot(122)
plt.scatter(np.arange(len(evec)), evec[s], marker='x', c=label_b[s], alpha=0.5)
plt.scatter([0,0],centers,c='k',s=200,alpha=0.4,edgecolor='none', label='k-means centroid (1d)')
plt.hlines(centers,0,300,color='k')
plt.title('Sorted evec %d, with labels' % evec_num)
plt.legend(loc='center left')

plt.show()

## 3 - K-Means
Compare spectral clustering against k-means.

One can notice that Spectral Clustering groups the datapoints better than K-Means, which was expected, given the shape of our data.

In [14]:
kmeans_model = KMeans(n_clusters=2, n_init=100, init='random').fit(X)
label_c = ~kmeans_model.labels_      # Try negating the labels: ~kmeans_model.labels_
C = kmeans_model.cluster_centers_

In [None]:
plt.scatter(X[:,0], X[:,1], c=label_c) 
plt.scatter(C[:,0],C[:,1],c='k',s=300,alpha=0.4,edgecolor='none')
plt.title("Clusters via k-means only")
plt.show()