# Sparse Subspace Clustering with Entropy-Norm (ICML 2020) on Iris Dataset

#### References:


1.   Main Paper (SSC Entropy-Norm ICML 2020): https://proceedings.icml.cc/static/paper_files/icml/2020/1982-Paper.pdf
2.   Spectral Clustering: http://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf
3.   Spectral Clustering Code: https://juanitorduz.github.io/spectral_clustering/
4.   Iris Dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

Our objective is given a set of points drawn from a `union of subspaces`, we need to find the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data.

In [33]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

##### Set Parameters of the data space

In [34]:
N = 150 # Total no of points
D = 4  # Dimension of space
K = 3 # Number of clusters

In [35]:
from sklearn.datasets import load_iris
iris_data = load_iris()
iris_X = np.asarray(iris_data.data)
scaler = MinMaxScaler()
input_data = scaler.fit_transform(iris_X)
input_data = input_data.T

In [36]:
def zij(Y,i,j,lam,N):
    if i==j:
        return 0.0
    else:
        numerator = 2 * np.exp(-(np.square(np.linalg.norm(Y[:,i]-Y[:,j],2)))/lam)
        #print(numerator)
        sum_i=0
        sum_j=0
        for h in range(N):
            if h!=i:
                sum_i += np.exp(-(np.square(np.linalg.norm(Y[:,i]-Y[:,h],2)))/lam)
        for h in range(N):
            if h!=j:
                sum_j += np.exp(-(np.square(np.linalg.norm(Y[:,j]-Y[:,h],2)))/lam)
        return numerator/(sum_i+sum_j)

In [37]:
Z =np.zeros((N,N), dtype='float64')
for i in range(N):
    for j in range(N):
        Z[i,j] = zij(input_data,i,j,0.0273,N)

In [38]:
# Check sparsity by counting the number of zeros

cz = 0
for i in range(Z.shape[0]):
    for j in range(Z.shape[1]):
        if Z[i,j] < 1e-5 and Z[i,j] > -1e-5:
            cz += 1
print(cz)

12978


In [39]:
LN = np.subtract(np.eye(N,N),Z)
print(LN.shape)

(150, 150)


This $LN$ is the Normalized Laplacian matrix, which can be defined as $LN = I - Z$, this is because from the paper we know that $Z$ is a lower bound of the normal Gaussian Kernel. Next, we calculate the `eigenvalues` and `eigenvectors` of the Normalized Laplacian matrix $LN$, which we will use for Spectral clustering of the data points. $LN$ is a *positive, semi-definite matrix* this means all the eigenvalues of the matrix will be greater than equal to $0$.

### Perform Spectral Clustering with Laplacian Matrix L

In [40]:
from scipy import linalg

eigenvals, eigenvcts = linalg.eig(LN)

eigenvals = np.real(eigenvals)
eigenvcts = np.real(eigenvcts)

eig = eigenvals.reshape((N,1))

Sort Eigen Values

In [41]:
eigenvals_sorted_indices = np.argsort(eigenvals)
eigenvals_sorted = eigenvals[eigenvals_sorted_indices]

In [42]:
indices = []
for i in range(0,K):
    ind = []
    print(eigenvals_sorted_indices[i])
    ind.append(eigenvals_sorted_indices[i])
    indices.append(np.asarray(ind))

0
1
2


In the above code, we find out the indices of the eigenvectors corresponding to the $K$ smallest eigenvectors.

In [43]:
indices = np.asarray(indices)

In [44]:
zero_eigenvals_index = np.array(indices)

Here, the indices are arranged according to their sorted order of values and the sorted eigen values are stored in *eigenvals_sorted*.

In [45]:
eigenvals[zero_eigenvals_index]

array([[0.01646392],
       [0.02191162],
       [0.13631833]])

In [46]:
import pandas as pd

proj_df = pd.DataFrame(eigenvcts[:, zero_eigenvals_index.squeeze()])
proj_df.columns = ['v_' + str(c) for c in proj_df.columns]
proj_df.head()

Unnamed: 0,v_0,v_1,v_2
0,-0.171509,7.9e-05,-7e-06
1,-0.147316,6.6e-05,-2e-06
2,-0.164805,7.5e-05,-5e-06
3,-0.15322,6.9e-05,-3e-06
4,-0.163594,7.5e-05,-7e-06


Stack the Eigen Vectors corresponding to the zero Eigen Values in a dataframe *proj_df*. This can be thought of as a $N X K$ matrix where the columns denote an eigen vector and the rows denote a data point.

Apply *K-Means Clustering* with $K = 3$.

In [47]:
from sklearn.cluster import KMeans

def run_k_means(df, n_clusters):
    k_means = KMeans(random_state=25, n_clusters=n_clusters)
    k_means.fit(df)
    cluster = k_means.predict(df)
    return cluster

cluster = run_k_means(proj_df, n_clusters=K)

*run_k_means* applies `K-Means Clustering` on *proj_df* with number of clusters = $3$.The clustering of the data points is returned in the variable *'cluster'*.

Display clusters formed

In [48]:
print(cluster)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0 0
 0 0 2 0 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 2 0 0 0 2 0 0 0 2 0 0 0 2 0
 0 2]


As we can see, the data points have been clustered into two subspaces: $0,1 \, \textrm{and} \, 2$ corresponding to the $3$ subspaces that we have considered. 

In [49]:
c0 = 0
for l in cluster:
    if l == 0:
        c0 += 1
print(c0)

37


In [50]:
c1 = 0
for l in cluster:
    if l == 1:
        c1 += 1
print(c1)

49


In [51]:
c2 = 0
for l in cluster:
    if l == 2:
        c2 += 1
print(c2)

64


$53$ data points have been labelled to the 1st subspace, $50$ data points have been labelled to the 2nd subspace and $47$ data points have been labelled to the 3rd subspace.

In [52]:
orig = iris_data.target

In [53]:
pred = np.asarray(cluster)

In [54]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
print("ARI = " + str(adjusted_rand_score(orig, pred)))
print("NMI = " + str(normalized_mutual_info_score(orig, pred)))

ARI = 0.7223514206678999
NMI = 0.7466951615091428




In the above code block, we calculate the `Adjusted random score` and the `normalized mutual info score` between the `original` and the `predicted` labels for the various data points.