# Sparse Subspace Clustering with Entropy-Norm (ICML 2020) on Wine Dataset

#### References:


1.   Main Paper (SSC Entropy-Norm ICML 2020): https://proceedings.icml.cc/static/paper_files/icml/2020/1982-Paper.pdf
2.   Spectral Clustering: http://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf
3.   Spectral Clustering Code: https://juanitorduz.github.io/spectral_clustering/
4.   Wine dataset: From Scikit-learn

Our objective is given a set of points drawn from a `union of subspaces`, we need to find the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data.

In [1]:
import numpy as np
import cvxpy as cp
from cvxpy.atoms.elementwise.power import power

##### Set Parameters of the data space

In [2]:
N = 178 # Total no of points
D = 13  # Dimension of space
K = 3 # Number of clusters

In [3]:
from sklearn.datasets import load_wine
wine_data = load_wine()
wine_X = np.asarray(wine_data.data)
input_data = wine_X.T

The matrix *input_data* contains the data points arranged sequentially forming a matrix of shape $(13, 178), ie. (D,N)$. The *input_data* matrix is of the form $[y[0],y[1], \ldots, y[N-1]], $where each $y_i$ denotes the vector of a data point of dimension $D$.

Also, *input_data* can be assumed to be of the form $[Y_0, Y_1, \ldots, Y_n]$, where each $Y_i$ denotes the set of $N_i$ data points belonging to subspace $i$. Also assume that the dimension of each subspace $i$ is $d_i$ and $A_i$ is its basis.

In [4]:
def zij(Y,i,j,lam,N):
    if i==j:
        return 0.0
    else:
        numerator = 2 * np.exp(-(np.linalg.norm(Y[:,i]-Y[:,j]))/lam)
        #print(numerator)
        sum_i=0
        sum_j=0
        for h in range(N):
            if h!=i:
                sum_i += np.exp(-(np.linalg.norm(Y[:,i]-Y[:,h]))/lam)
        for h in range(N):
            if h!=j:
                sum_j += np.exp(-(np.linalg.norm(Y[:,j]-Y[:,h]))/lam)
        return numerator/(sum_i+sum_j)

In [5]:
Z =np.zeros((N,N), dtype='float64')
for i in range(N):
    for j in range(N):
        Z[i,j] = zij(input_data,i,j,995505,N)

In [6]:
# Check sparsity by counting the number of zeros

cz = 0
for i in range(Z.shape[0]):
    for j in range(Z.shape[1]):
        if Z[i,j] < 1e-5 and Z[i,j] > -1e-5:
            cz += 1
print(cz)

178


In the above code block, we check the number of zeros in the matrix $Z$.

In [7]:
LN = np.subtract(np.eye(N,N),Z)
print(LN.shape)

(178, 178)


This $LN$ is the Normalized Laplacian matrix, which can be defined as $LN = I - Z$, this is because from the paper we know that $Z$ is a lower bound of the normal Gaussian Kernel. Next, we calculate the `eigenvalues` and `eigenvectors` of the Normalized Laplacian matrix $LN$, which we will use for Spectral clustering of the data points. $LN$ is a *positive, semi-definite matrix* this means all the eigenvalues of the matrix will be greater than equal to $0$.

### Perform Spectral Clustering with Normalized Laplacian Matrix LN

In [8]:
from scipy import linalg

eigenvals, eigenvcts = linalg.eig(LN)

eigenvals = np.real(eigenvals)
eigenvcts = np.real(eigenvcts)

eig = eigenvals.reshape((N,1))

Sort Eigen Values

In [9]:
eigenvals_sorted_indices = np.argsort(eigenvals)
eigenvals_sorted = eigenvals[eigenvals_sorted_indices]

In [10]:
indices = []
for i in range(0,K):
    ind = []
    print(eigenvals_sorted_indices[i])
    ind.append(eigenvals_sorted_indices[i])
    indices.append(np.asarray(ind))

0
1
2


In the above code, we find out the indices of the eigenvectors corresponding to the $K$ smallest eigenvectors.

In [11]:
indices = np.asarray(indices)

In [12]:
zero_eigenvals_index = np.array(indices)

Here, the indices are arranged according to their sorted order of values and the sorted eigen values are stored in *eigenvals_sorted*.

In [13]:
eigenvals[zero_eigenvals_index]

array([[3.76390076e-09],
       [1.00544603e+00],
       [1.00559986e+00]])

In [14]:
import pandas as pd

proj_df = pd.DataFrame(eigenvcts[:, zero_eigenvals_index.squeeze()])
proj_df.columns = ['v_' + str(c) for c in proj_df.columns]
proj_df.head()

Unnamed: 0,v_0,v_1,v_2
0,-0.074951,-0.093372,0.007408
1,-0.074952,-0.090278,0.014735
2,-0.074948,-0.115551,-0.049847
3,-0.074939,-0.138335,-0.121612
4,-0.074957,0.002295,0.103711


Stack the Eigen Vectors corresponding to the zero Eigen Values in a dataframe *proj_df*. This can be thought of as a $N X K$ matrix where the columns denote an eigen vector and the rows denote a data point.

Apply *K-Means Clustering* with $K = 3$.

In [15]:
from sklearn.cluster import KMeans

def run_k_means(df, n_clusters):
    k_means = KMeans(random_state=25, n_clusters=n_clusters)
    k_means.fit(df)
    cluster = k_means.predict(df)
    return cluster

cluster = run_k_means(proj_df, n_clusters=K)

*run_k_means* applies `K-Means Clustering` on *proj_df* with number of clusters = $3$.The clustering of the data points is returned in the variable *'cluster'*.

Display clusters formed

In [16]:
print(cluster)

[0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 0 0 2 2 0 0 2 0 0 0 2 0 0 2 2
 0 0 2 2 0 0 2 2 0 0 2 0 0 0 0 0 0 0 0 2 0 0 1 2 1 2 1 1 2 1 1 2 2 2 1 1 2
 2 1 1 1 2 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1 2 1 1 1 1 2 1
 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2
 2 1 1 1 1 2 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1]


As we can see, the data points have been clustered into two subspaces: $0$ and $1$ corresponding to the $2$ subspaces that we have considered.

In [17]:
c0 = 0
for l in cluster:
    if l == 0:
        c0 += 1
print(c0)

43


In [18]:
c1 = 0
for l in cluster:
    if l == 1:
        c1 += 1
print(c1)

65


In [19]:
c2 = 0
for l in cluster:
    if l == 2:
        c2 += 1
print(c2)

70


So we find that $43$ data points have been labelled $0$, signifying that they belong to  the $1$st subspace, while $65$ have been labelled $1$ and the remaining $70$ points have been labelled $2$. 

In [20]:
orig = wine_data.target

In [21]:
pred = np.asarray(cluster)

### Calculate Performance

In [22]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
print("ARI = " + str(adjusted_rand_score(orig, pred)))
print("NMI = " + str(normalized_mutual_info_score(orig, pred)))

ARI = 0.3454261501931663
NMI = 0.43048380210683745


In the above code block, we calculate the `Adjusted random score` and the `normalized mutual info score` between the `original` and the `predicted` labels for the various data points.