# Sparse Subspace Clustering (Vidal IEEE Trans. on PAMI 2013) on Yale Dataset

#### References:


1.   Main Paper (Vidal IEEE Trans. on PAMI 2013): https://arxiv.org/pdf/1203.1005.pdf
2.   Supplementary Paper (Vidal CVPR 2009): http://cis.jhu.edu/~ehsan/Downloads/SSC-CVPR09-Ehsan.pdf
3.   Spectral Clustering: http://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf
4.   Spectral Clustering Code: https://juanitorduz.github.io/spectral_clustering/
5.   Yale Dataset: http://vision.ucsd.edu/content/yale-face-database

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Our objective is given a set of points drawn from a `union of subspaces`, we need to find the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data.

In [21]:
import numpy as np
import cvxpy as cp
from scipy.sparse import identity
from cvxpy.atoms.elementwise.power import power

##### Set Parameters of the data space

In [2]:
N = 165 # Total no of points
D = 1024  # Dimension of space
K = 15 # Number of clusters

In [3]:
yale_data = np.load('yale_X.npy')
input_data = yale_data.astype('float32') / 255.
input_data = input_data.T
orig_label = np.load('yale_Y.npy')

The matrix *input_data* contains the data points arranged sequentially forming a matrix of shape $(1024, 165), ie. (D,N)$. The *input_data* matrix is of the form $[y[0],y[1], \ldots, y[N-1]], $where each $y_i$ denotes the vector of a data point of dimension $D$.

Also, *input_data* can be assumed to be of the form $[Y_0, Y_1, \ldots, Y_n]$, where each $Y_i$ denotes the set of $N_i$ data points belonging to subspace $i$. Also assume that the dimension of each subspace $i$ is $d_i$ and $A_i$ is its basis.

In [4]:
def find_sparse_sol(Y,i,N):
    if i == 0:
        Ybari = Y[:,1:N]
    if i == N-1:
        Ybari = Y[:,0:N-1]
    if i!=0 and i!=N-1:
        Ybari = np.concatenate((Y[:,0:i],Y[:,i+1:N]),axis=1)
    yi = Y[:,i].reshape(D,1)
    
    # this ci will contain the solution of the l1 optimisation problem:  
    # min (||yi - Ybari*ci||F)^2 + lambda*||ci||1   st. sum(ci) = 1
    
    ci = cp.Variable(shape=(N-1,1))
    constraint = [cp.sum(ci)==1]
    obj = cp.Minimize(power(cp.norm(yi-Ybari@ci,2),2) + 199101*cp.norm(ci,1)) #lambda = 199101
    prob = cp.Problem(obj, constraint)
    prob.solve()
    return ci.value

The above function `find_sparse_sol` finds the solution to the following optimisation problem $(\textrm{min} \; ||y_i - Y_{\hat{i}}c_i||_{F}^2 + \lambda||c_i||_1)$ subject to the constraint $\sum_{i=i}^{N}c_i=1$. We use this optimisation term to *deal with all the possible noise values in the data.*

In [5]:
C = np.concatenate((np.zeros((1,1)),find_sparse_sol(input_data,0,N)),axis=0)

for i in range(1,N):
    ci = find_sparse_sol(input_data,i,N)
    zero_element = np.zeros((1,1))
    cif = np.concatenate((ci[0:i,:],zero_element,ci[i:N,:]),axis=0)
    C = np.concatenate((C,cif),axis=1)
print(C.shape)

(165, 165)


We now include a zero vector of size $(1,1)$ at the $i$ th position of $c_i$ to form $\hat{c_i}$ which we represent as *cif* in the code.
Then we concatenate $[\hat{c_1}$, $\hat{c_2}$, ..., $\hat{c_N}]$ to form the matrix $C$.  

$C$ is the *Matrix of Coefficients* and is of the form $C = [\hat{c_1}, \hat{c_2}, ..., \hat{c_N}] \in \mathbb{R}^{NXN}$.

In [6]:
W = np.add(np.absolute(C), np.absolute(C.T))
print(W.shape)

(165, 165)


In [7]:
# Check sparsity by counting the number of zeros

cz = 0
for i in range(W.shape[0]):
    for j in range(W.shape[1]):
        if W[i,j] < 1e-5 and W[i,j] > -1e-5:
            cz += 1
print(cz)

25137


In the above code block, we make $C$ a symmetric matrix by the operation $W = |C| + |C^T|$. It is still a valid representation of the similarity since if $y_i$ can write itself as a linear combination of all the points in the same subspace including $y_j$, then $y_j$ can also be represented as a linear combination of all the other points belonging to the same subspace including $c_i$.
            In the above code block, we print the number of $0$'s in the matrix $W$, we denote any $|W_{ij}|$ less then $e^{-5}$ as $0$.

In [8]:
D = np.zeros((N,N))
sum_list=[]
for i in range(N):
    csum = np.sum(W[:,i],axis=0)
    sum_list.append(csum)

D = np.diag(sum_list)
print(D)

[[2.13075991 0.         0.         ... 0.         0.         0.        ]
 [0.         2.70513964 0.         ... 0.         0.         0.        ]
 [0.         0.         2.48878329 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.36819346 0.         0.        ]
 [0.         0.         0.         ... 0.         1.50284329 0.        ]
 [0.         0.         0.         ... 0.         0.         1.5949385 ]]


$D$ is the `degree matrix`. In this case, $D \in \mathbb{R}^{NxN}$ is a diagonal matrix with $D_{ii} = \sum_{j}W_{ij}$. 

In [48]:
L = np.subtract(D, W)
LN =  np.diag(np.divide(1, np.sqrt(np.sum(D, axis=0) + np.finfo(float).eps)))@ L @  np.diag(np.divide(1, np.sqrt(np.sum(D, axis=0) + np.finfo(float).eps)))
print(LN.shape)

(165, 165)


This $L$ is the Laplacian matrix, which can be defined as $L = D - W$. Next, we calculate the `eigenvalues` and `eigenvectors` of the Normalized Laplacian matrix, which we will use for Spectral clustering of the data points. $L$ is a *positive, semi-definite matrix* this means all the eigenvalues of the matrix will be greater than equal to $0$.

### Perform Spectral Clustering with Laplacian Matrix LN

In [50]:
from scipy import linalg

eigenvals, eigenvcts = linalg.eig(LN)

eigenvals = np.real(eigenvals)
eigenvcts = np.real(eigenvcts)

eig = eigenvals.reshape((N,1))

Sort Eigen Values

In [51]:
eigenvals_sorted_indices = np.argsort(eigenvals)
eigenvals_sorted = eigenvals[eigenvals_sorted_indices]

In [52]:
indices = []
for i in range(0,K):
    ind = []
    print(eigenvals_sorted_indices[i])
    ind.append(eigenvals_sorted_indices[i])
    indices.append(np.asarray(ind))

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14


In the above code, we find out the indices of the eigenvectors corresponding to the $K$ smallest eigenvectors.

In [53]:
indices = np.asarray(indices)

In [54]:
zero_eigenvals_index = np.array(indices)

Here, the indices are arranged according to their sorted order of values and the sorted eigen values are stored in *eigenvals_sorted*.

In [55]:
eigenvals[zero_eigenvals_index]

array([[3.81639165e-17],
       [2.11135860e-02],
       [4.37594863e-02],
       [5.82583877e-02],
       [6.56693678e-02],
       [7.08488808e-02],
       [7.43887285e-02],
       [8.18797572e-02],
       [8.53868541e-02],
       [9.01608483e-02],
       [1.00053060e-01],
       [1.03587080e-01],
       [1.10542086e-01],
       [1.14718588e-01],
       [1.16976620e-01]])

In [56]:
import pandas as pd

proj_df = pd.DataFrame(eigenvcts[:, zero_eigenvals_index.squeeze()])
proj_df.columns = ['v_' + str(c) for c in proj_df.columns]
proj_df.head()

Unnamed: 0,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,-0.080355,-0.021171,-0.066623,0.015366,0.006728,0.065854,0.014297,-0.016624,0.050521,0.073861,0.040437,0.023062,0.243547,0.164874,0.078104
1,-0.090539,0.005729,0.177743,-0.379185,0.074941,-0.012566,-0.012606,-0.092918,-0.018254,0.116905,-0.065979,-0.092517,0.086272,-0.045964,-0.10193
2,-0.086843,-0.027734,-0.090533,0.005871,0.033236,-0.140689,0.116367,-0.066421,0.029939,0.007743,-0.029717,0.001177,-0.013345,-0.013236,0.029353
3,-0.075656,-0.02438,-0.060182,0.018189,0.004732,-0.030948,-0.190639,-0.112559,-0.042435,-0.015155,-0.009945,-0.002585,-0.012979,-0.013607,0.018158
4,-0.061761,-0.018836,0.01648,0.03162,-0.039806,-0.011981,0.025399,-0.006658,-0.051142,0.013397,-0.011228,0.00797,-0.01662,-0.006497,0.002835


Stack the Eigen Vectors corresponding to the zero Eigen Values in a dataframe *proj_df*. This can be thought of as a $N X K$ matrix where the columns denote an eigen vector and the rows denote a data point.

Apply *K-Means Clustering* with $K = 15$.

In [57]:
from sklearn.cluster import KMeans

def run_k_means(df, n_clusters):
    k_means = KMeans(random_state=25, n_clusters=n_clusters)
    k_means.fit(df)
    cluster = k_means.predict(df)
    return cluster

cluster = run_k_means(proj_df, n_clusters=K) +1

*run_k_means* applies `K-Means Clustering` on *proj_df* with number of clusters = $3$.The clustering of the data points is returned in the variable *'cluster'*.

Display clusters formed

In [58]:
print(cluster)

[ 9  5  2  3  1 10 13  1  1  6 14 15  3  1 10  4  1  6  7 11  6 10  1  7
 12  4  1  2  1 12 15  5 12  3  2  1  1 13  1  1  8  1 11  1  5 12 12 12
 12  5  1  2  1  1  5  1  1 12 14  1 12 11 12  3  7 14  1  1  1 15  1  1
  4  1  7  6  9  2  1  1  3  6  1 14 13  1 12  1  9  1  1  4  1 12  1 15
  4  5  9 11 11  3 12  1  1  7  1  6 13 12 12 15 10  1  1  1  1 14  1  2
  4  1 14  1  5  4 11  1  5  6 13  1 11  1  6  7 14 12 15 12 15  8  4  1
 10 14  6  1 10  1 13 10 10  2  7  3  5  2 14  7  9  1  3 14  6]


As we can see, the data points have been clustered into $fifteen$ subspaces: $1$ to $15$ corresponding to the $fifteen$ subspaces that we have considered.

In [59]:
pred = np.asarray(cluster)

### Calculate Performance

In [60]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
print("ARI = " + str(adjusted_rand_score(orig_label, pred)))
print("NMI = " + str(normalized_mutual_info_score(orig_label, pred)))

ARI = 0.2693514485027569
NMI = 0.6418333201314705


In the above code block, we calculate the `Adjusted random score` and the `normalized mutual info score` between the `original` and the `predicted` labels for the various data points.