# Spectral Clustering

### Project members:
Mengshu Shao, Yelena Kernogitski

### GitHub Repositories: 
https://github.com/yelenakernogitski/Spectral-Clustering-Project

### Abstract
Clustering is one of the building blocks in modern data analysis, and has been widely used in machine learning and pattern recognition. Two commonly used approaches are K-means and learning a mixture model using EM. However, these methods have some drawbacks, such as violation of the harsh simplifying assumption which the density of each cluster is Gaussian. One method that provides a possible solution in finding useful clusters is spectral clustering, which utilizes eigenvectors derived from the distance between points. The method performs dimensionality reduction in order to find clusters. In this report, First we implement a simple spectral clustering algorithm for clustering points in . Second we analyze how it works in “ideal” case in which the points are exactly far apart (i.e., when affinity matrix s strictly block diagonal), and in general case in which affinity’s off-diagonal blocks are non-zero. Then we test the algorithm by applying it to a number of challenging clustering problems. Further, we attempt to optimize the algorithm using within-Python options (such as vectorization) in addition to JIT and Cython wrapping functions. Finally, we compare the original method in Python and the latter, higher performance method by determining the efficiency of each method. 


### Algorithm

In [83]:
import numpy as np
import scipy.linalg as la
from numpy.core.umath_tests import inner1d

In [3]:
S = np.array([[2,1], [3,4], [5,4]])

#### Form the affinity matrix

In [24]:
n = np.shape(s)[0]
def affinity(s, var):
    A = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            A[i, i] = 0
            A[i,j] = np.exp(-(la.norm(S[i] - S[j])**2) / (2*var))
    return A

#### Define D and L

In [40]:
D = np.zeros((n, n))
for i in range(n):
    D[i, i] = 1 / (A[i].sum())
D

array([[ 145.74376886,    0.        ,    0.        ],
       [   0.        ,    7.03862366,    0.        ],
       [   0.        ,    0.        ,    0.88070135]])

In [130]:
L = np.sqrt(D).dot(A).dot(np.sqrt(D))
L

array([[ 0.        ,  0.21580746,  0.00139817],
       [ 0.21580746,  0.        ,  0.33695293],
       [ 0.00139817,  0.33695293,  0.88070135]])

#### Find the k largest eigenvectors of L

In [63]:
value, vector = la.eig(L)

In [66]:
idx = np.argsort(value)[::-1]
value = value[idx]
vector = vector[:, idx]
vector

array([[ 0.07308967,  0.79087251, -0.60760067],
       [ 0.33258841,  0.5550276 ,  0.76244954],
       [ 0.94023553, -0.25780813, -0.22246825]])

In [67]:
k = 2
X = vector[:, 0:k]
X

array([[ 0.07308967,  0.79087251],
       [ 0.33258841,  0.5550276 ],
       [ 0.94023553, -0.25780813]])

#### Form the matrix Y

In [80]:
Y = X / np.sum(X, 1)[:, np.newaxis]
Y

array([[ 0.08459823,  0.91540177],
       [ 0.37469853,  0.62530147],
       [ 1.37778103, -0.37778103]])

#### K-means clustering

In [128]:
def kmeans(y, k, max_iter=10):
    idx = np.random.choice(len(y), k, replace=False)
    print(idx)
    idx_data = y[idx]
    for i in range(max_iter):
        dist = np.array([inner1d(y-c, y-c) for c in idx_data])
        clusters = np.argmin(dist, axis=0)
        idx_data = np.array([y[clusters==i].mean(axis=0) for i in range(k)])
    return (clusters, idx_data)

In [129]:
kmeans(Y, 2, max_iter=10)

[2 0]


(array([1, 1, 0]), array([[ 1.37778103, -0.37778103],
        [ 0.22964838,  0.77035162]]))

#### Assign the original point S to clusters

In [97]:
np.concatenate((S, zs.reshape((3,1))), axis = 1)

array([[2, 1, 0],
       [3, 4, 0],
       [5, 4, 1]])

### Test algorithm on simulation datasets