# Sparse Subspace Clustering with Entropy-Norm (ICML 2020) on Yale Dataset

#### References:


1.   Main Paper (SSC Entropy-Norm ICML 2020): https://proceedings.icml.cc/static/paper_files/icml/2020/1982-Paper.pdf
2.   Spectral Clustering: http://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf
3.   Spectral Clustering Code: https://juanitorduz.github.io/spectral_clustering/
4.   Yale Dataset: http://vision.ucsd.edu/content/yale-face-database

Our objective is given a set of points drawn from a `union of subspaces`, we need to find the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data.

In [40]:
import numpy as np
from scipy.sparse import identity

##### Set Parameters of the data space

In [41]:
N = 165 # Total no of points
D = 1024  # Dimension of space
K = 15 # Number of clusters

In [42]:
yale_data = np.load('yale_X.npy')
input_data = yale_data.astype('float32') / 255.
input_data = input_data.T
orig_label = np.load('yale_Y.npy')

The matrix *input_data* contains the data points arranged sequentially forming a matrix of shape $(1024, 165), ie. (D,N)$. The *input_data* matrix is of the form $[y[0],y[1], \ldots, y[N-1]], $where each $y_i$ denotes the vector of a data point of dimension $D$.

Also, *input_data* can be assumed to be of the form $[Y_0, Y_1, \ldots, Y_n]$, where each $Y_i$ denotes the set of $N_i$ data points belonging to subspace $i$. Also assume that the dimension of each subspace $i$ is $d_i$ and $A_i$ is its basis.

In [43]:
def zij(Y,i,j,lam,N):
    if i==j:
        return 0.0
    else:
        numerator = 2 * np.exp(-(np.square(np.linalg.norm(Y[:,i]-Y[:,j],2)))/lam)
        #print(numerator)
        sum_i=0
        sum_j=0
        for h in range(N):
            if h!=i:
                sum_i += np.exp(-(np.square(np.linalg.norm(Y[:,i]-Y[:,h],2)))/lam)
        for h in range(N):
            if h!=j:
                sum_j += np.exp(-(np.square(np.linalg.norm(Y[:,j]-Y[:,h],2)))/lam)
        return numerator/(sum_i+sum_j)

In [44]:
Z =np.zeros((N,N), dtype='float64')
for i in range(N):
    for j in range(N):
        Z[i,j] = zij(input_data,i,j,995505,N)

In [45]:
# Check sparsity by counting the number of zeros

cz = 0
for i in range(Z.shape[0]):
    for j in range(Z.shape[1]):
        if Z[i,j] < 1e-5 and Z[i,j] > -1e-5:
            cz += 1
print(cz)

165


In [46]:
LN = np.subtract(np.eye(N,N), Z)
print(LN.shape)

(165, 165)


This $LN$ is the Normalized Laplacian matrix, which can be defined as $LN = I - Z$, this is because from the paper we know that $Z$ is a lower bound of the normal Gaussian Kernel. Next, we calculate the eigenvalues and eigenvectors of the Normalized Laplacian matrix $LN$, which we will use for Spectral clustering of the data points. $LN$ is a positive, semi-definite matrix this means all the eigenvalues of the matrix will be greater than equal to $0$.

### Perform Spectral Clustering with Laplacian Matrix LN

In [47]:
from scipy import linalg

eigenvals, eigenvcts = linalg.eig(LN)

eigenvals = np.real(eigenvals)
eigenvcts = np.real(eigenvcts)

eig = eigenvals.reshape((N,1))

Sort Eigen Values

In [48]:
eigenvals_sorted_indices = np.argsort(eigenvals)
eigenvals_sorted = eigenvals[eigenvals_sorted_indices]

In [49]:
indices = []
for i in range(0,K):
    ind = []
    print(eigenvals_sorted_indices[i])
    ind.append(eigenvals_sorted_indices[i])
    indices.append(np.asarray(ind))

0
1
2
3
4
5
6
7
9
8
10
11
12
13
14


In the above code, we find out the indices of the eigenvectors corresponding to the $K$ smallest eigenvectors.

In [50]:
indices = np.asarray(indices)

In [51]:
zero_eigenvals_index = np.array(indices)

Here, the indices are arranged according to their sorted order of values and the sorted eigen values are stored in *eigenvals_sorted*.

In [52]:
eigenvals[zero_eigenvals_index]

array([[4.17958335e-10],
       [1.00605589e+00],
       [1.00607897e+00],
       [1.00608606e+00],
       [1.00608821e+00],
       [1.00609180e+00],
       [1.00609266e+00],
       [1.00609454e+00],
       [1.00609515e+00],
       [1.00609527e+00],
       [1.00609586e+00],
       [1.00609654e+00],
       [1.00609668e+00],
       [1.00609698e+00],
       [1.00609711e+00]])

In [53]:
import pandas as pd

proj_df = pd.DataFrame(eigenvcts[:, zero_eigenvals_index.squeeze()])
proj_df.columns = ['v_' + str(c) for c in proj_df.columns]
proj_df.head()

Unnamed: 0,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,0.077851,-0.066929,0.020962,-0.027023,-0.017792,-0.001221,-0.025435,0.113326,0.005595,0.130606,0.008839,0.08033,-0.005199,-0.029553,0.011234
1,0.077849,0.116941,-0.013711,-0.048794,0.129095,0.028244,0.098699,-0.10037,0.10342,0.147778,-0.038298,-0.029582,-0.027265,-0.030486,0.024688
2,0.077851,-0.059607,-0.009261,-0.013277,-0.095527,0.031575,0.025993,-0.061884,0.046813,0.074288,0.103209,0.178596,0.104049,0.042352,0.090881
3,0.077851,-0.027676,-0.050662,-0.065176,-0.03267,0.085483,0.005743,0.05999,0.119113,-0.078515,-0.14599,0.064368,0.106585,0.044473,0.009736
4,0.077851,-0.019896,-0.065845,0.018125,-0.040284,-0.048256,0.001364,-0.011957,0.086165,-0.084236,-0.002703,0.079187,-0.027383,-0.186439,-0.069664


Stack the Eigen Vectors corresponding to the zero Eigen Values in a dataframe *proj_df*. This can be thought of as a $N X K$ matrix where the columns denote an eigen vector and the rows denote a data point.

Apply *K-Means Clustering* with $K = 15$.

In [54]:
from sklearn.cluster import KMeans

def run_k_means(df, n_clusters):
    k_means = KMeans(random_state=25, n_clusters=n_clusters)
    k_means.fit(df)
    cluster = k_means.predict(df)
    return cluster

cluster = run_k_means(proj_df, n_clusters=K) +1

*run_k_means* applies `K-Means Clustering` on *proj_df* with number of clusters = $3$.The clustering of the data points is returned in the variable *'cluster'*.

Display clusters formed

In [55]:
print(cluster)

[ 9  4  6 11 10  8  1 12 12  5 15 14 11 15  8  7 10  5  9  2  5  8  2  9
 13  7  3  6  9 13 14  4 13 11  6 10  4  1  1 10  1 12  2 10  4 13 13 13
 13  4 12  6  2 10  4  3  2 13 15  3 13  2 13 11  9 15  6  3 10 14  3  3
  7 12  9  5  9  6  3 10 11  5 12  3  1 12  9 10  9 15 10  7  4 13  9 14
  7  4  9  2  2 11 13  2  2  9 10  5  1 13  9 14  8 12 10 10 10  3 10  6
  7  9 15 12  4  7  2 15  4  5  1 12  2 10  5  9 15 13 14  4 14  1  7  2
  8  3  5 10  8 10  1  8  8  6  9 11  4  6 15  9  9 10 11 15  5]


As we can see, the data points have been clustered into $fifteen$ subspaces: $1$ to $15$ corresponding to the $fifteen$ subspaces that we have considered.

In [56]:
pred = np.asarray(cluster)

### Calculate Performance

In [57]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
print("ARI = " + str(adjusted_rand_score(orig_label, pred)))
print("NMI = " + str(normalized_mutual_info_score(orig_label, pred)))

ARI = 0.48996272510890554
NMI = 0.6988005129829699




In the above code block, we calculate the `Adjusted random score` and the `normalized mutual info score` between the `original` and the `predicted` labels for the various data points.