# Sparse Subspace Clustering with Entropy-Norm (ICML 2020) on Yale Dataset

#### References:


1.   Main Paper (SSC Entropy-Norm ICML 2020): https://proceedings.icml.cc/static/paper_files/icml/2020/1982-Paper.pdf
2.   Spectral Clustering: http://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf
3.   Spectral Clustering Code: https://juanitorduz.github.io/spectral_clustering/
4.   Yale Dataset: http://vision.ucsd.edu/content/yale-face-database

Our objective is given a set of points drawn from a `union of subspaces`, we need to find the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data.

In [39]:
import numpy as np
from scipy.sparse import identity

##### Set Parameters of the data space

In [40]:
N = 165 # Total no of points
D = 1024  # Dimension of space
K = 15 # Number of clusters

In [41]:
yale_data = np.load('yale_X.npy')
input_data = yale_data.astype('float32') / 255.
input_data = input_data.T
orig_label = np.load('yale_Y.npy')

The matrix *input_data* contains the data points arranged sequentially forming a matrix of shape $(1024, 165), ie. (D,N)$. The *input_data* matrix is of the form $[y[0],y[1], \ldots, y[N-1]], $where each $y_i$ denotes the vector of a data point of dimension $D$.

Also, *input_data* can be assumed to be of the form $[Y_0, Y_1, \ldots, Y_n]$, where each $Y_i$ denotes the set of $N_i$ data points belonging to subspace $i$. Also assume that the dimension of each subspace $i$ is $d_i$ and $A_i$ is its basis.

In [42]:
def zij(Y,i,j,lam,N):
    if i==j:
        return 0.0
    else:
        numerator = 2 * np.exp(-(np.linalg.norm(Y[:,i]-Y[:,j]))/lam)
        #print(numerator)
        sum_i=0
        sum_j=0
        for h in range(N):
            if h!=i:
                sum_i += np.exp(-(np.linalg.norm(Y[:,i]-Y[:,h]))/lam)
        for h in range(N):
            if h!=j:
                sum_j += np.exp(-(np.linalg.norm(Y[:,j]-Y[:,h]))/lam)
        return numerator/(sum_i+sum_j)

In [43]:
Z =np.zeros((N,N), dtype='float64')
for i in range(N):
    for j in range(N):
        Z[i,j] = zij(input_data,i,j,995505,N)

In [44]:
D = np.zeros((N,N))
sum_list=[]
for i in range(N):
    csum = np.sum(Z[:,i],axis=0)
    sum_list.append(csum)

D = np.diag(sum_list)
print(D)

[[1.00000061 0.         0.         ... 0.         0.         0.        ]
 [0.         0.99999952 0.         ... 0.         0.         0.        ]
 [0.         0.         1.00000066 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.00000086 0.         0.        ]
 [0.         0.         0.         ... 0.         1.00000063 0.        ]
 [0.         0.         0.         ... 0.         0.         1.00000013]]


$D$ is the `degree matrix`. In this case, $D \in \mathbb{R}^{NxN}$ is a diagonal matrix with $D_{ii} = \sum_{j}W_{ij}$. 

In [45]:
L = np.subtract(D, Z)
LN =  np.diag(np.divide(1, np.sqrt(np.sum(D, axis=0) + np.finfo(float).eps)))@ L @  np.diag(np.divide(1, np.sqrt(np.sum(D, axis=0) + np.finfo(float).eps)))
print(LN.shape)

(165, 165)


This $L$ is the Laplacian matrix, which can be defined as $L = D - W$. Next, we calculate the `eigenvalues` and `eigenvectors` of the Normalized Laplacian matrix, which we will use for Spectral clustering of the data points. $L$ is a *positive, semi-definite matrix* this means all the eigenvalues of the matrix will be greater than equal to $0$.

### Perform Spectral Clustering with Laplacian Matrix LN

In [46]:
from scipy import linalg

eigenvals, eigenvcts = linalg.eig(LN)

eigenvals = np.real(eigenvals)
eigenvcts = np.real(eigenvcts)

eig = eigenvals.reshape((N,1))

Sort Eigen Values

In [47]:
eigenvals_sorted_indices = np.argsort(eigenvals)
eigenvals_sorted = eigenvals[eigenvals_sorted_indices]

In [48]:
indices = []
for i in range(0,K):
    ind = []
    print(eigenvals_sorted_indices[i])
    ind.append(eigenvals_sorted_indices[i])
    indices.append(np.asarray(ind))

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14


In the above code, we find out the indices of the eigenvectors corresponding to the $K$ smallest eigenvectors.

In [49]:
indices = np.asarray(indices)

In [50]:
zero_eigenvals_index = np.array(indices)

Here, the indices are arranged according to their sorted order of values and the sorted eigen values are stored in *eigenvals_sorted*.

In [51]:
eigenvals[zero_eigenvals_index]

array([[4.37659078e-17],
       [1.00609590e+00],
       [1.00609657e+00],
       [1.00609695e+00],
       [1.00609709e+00],
       [1.00609720e+00],
       [1.00609727e+00],
       [1.00609732e+00],
       [1.00609736e+00],
       [1.00609737e+00],
       [1.00609740e+00],
       [1.00609743e+00],
       [1.00609744e+00],
       [1.00609748e+00],
       [1.00609749e+00]])

In [52]:
import pandas as pd

proj_df = pd.DataFrame(eigenvcts[:, zero_eigenvals_index.squeeze()])
proj_df.columns = ['v_' + str(c) for c in proj_df.columns]
proj_df.head()

Unnamed: 0,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,0.07785,0.077773,-0.021564,-0.027693,-0.007208,-0.006871,0.041433,-0.088549,-0.020449,0.071363,0.048204,0.041359,0.087182,-0.073864,-0.131946
1,0.07785,-0.128399,-0.012117,-0.046922,0.147033,-0.063362,-0.128516,0.014661,0.087995,0.178037,-0.023953,0.046839,-0.032116,0.015506,-0.016208
2,0.07785,0.073276,0.005316,-0.034263,-0.091887,-0.042795,-0.065412,0.053112,0.057993,0.048013,0.078217,-0.002621,0.231163,0.051965,0.047338
3,0.07785,0.038496,0.046542,-0.07586,-0.02109,-0.087964,0.005962,-0.087548,0.066138,-0.071061,-0.227177,-0.061104,0.067953,-0.004817,0.008373
4,0.07785,0.022744,0.072771,-0.001014,-0.039595,0.049833,-0.014933,-0.000818,0.082914,-0.043903,-0.042328,0.025725,0.027002,0.006483,-0.160293


Stack the Eigen Vectors corresponding to the zero Eigen Values in a dataframe *proj_df*. This can be thought of as a $N X K$ matrix where the columns denote an eigen vector and the rows denote a data point.

Apply *K-Means Clustering* with $K = 15$.

In [53]:
from sklearn.cluster import KMeans

def run_k_means(df, n_clusters):
    k_means = KMeans(random_state=25, n_clusters=n_clusters)
    k_means.fit(df)
    cluster = k_means.predict(df)
    return cluster

cluster = run_k_means(proj_df, n_clusters=K) +1

*run_k_means* applies `K-Means Clustering` on *proj_df* with number of clusters = $3$.The clustering of the data points is returned in the variable *'cluster'*.

Display clusters formed

In [54]:
print(cluster)

[ 2 13  9 14  2  4  3 10 10  5 12 11 14  2  4  6  3  5 15  1  5  4 10 15
  7  6  2  9 15  7 11 13  7 14  9  2 13  3  3  3  3 10  1  3 13  7  7  7
  7 13 10  9  1  3 13  8  2  7 12  3  7  1  7 14 15 12  9  3  2 11  2  3
  6 10 15  5  2  9  3  2 14  5 10  8  3 10  1  2  2  2  2  6  6  7  2 11
  6 13  2  1  1 14  7  1  1 15  2  5  3  7  1 11  4 10  3  2  3  8  2  9
  6  1 12 10 13  6  1  2 13  5  3 10  1  3  5 15 12  7 11  7 11  3  6  1
  4  8  5  3  4  3  3  4  4  9 15 14 13  9 12 15  2  3 14 12  5]


As we can see, the data points have been clustered into $fifteen$ subspaces: $1$ to $15$ corresponding to the $fifteen$ subspaces that we have considered.

In [55]:
pred = np.asarray(cluster)

### Calculate Performance

In [56]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
print("ARI = " + str(adjusted_rand_score(orig_label, pred)))
print("NMI = " + str(normalized_mutual_info_score(orig_label, pred)))

ARI = 0.4053919688640787
NMI = 0.6570342831476702




In the above code block, we calculate the `Adjusted random score` and the `normalized mutual info score` between the `original` and the `predicted` labels for the various data points.