# Demo: Clustering STL-10 data with FINCH

The notebook demonstrate a common usage example of clustering a dataset with FINCH.

In [28]:
import numpy as np
import h5py
from finch import FINCH

# Load STL-10 dataset

We load the STL-10 data which contains resent50 features of 13000 samples. The data is included in the github repo in the data folder. Since this data is in a mat file format, we will use h5py to load it as an numpy array.

In [18]:
f = h5py.File('../data/STL-10/data.mat', 'r')
data = f.get('data')
data = np.array(data).T
f = h5py.File('../data/STL-10/labels.mat', 'r')
gt = f.get('labels')
gt = np.squeeze(np.array(gt))

In [19]:
data.shape


(13000, 2048)

# Cluster with FINCH

In [20]:
c, num_clust, req_c = FINCH(data)

Partition 0: 2061 clusters
Partition 1: 177 clusters
Partition 2: 37 clusters
Partition 3: 10 clusters
Partition 4: 2 clusters


In [21]:
print(c.shape)
print(num_clust)
print(req_c)

(13000, 5)
[2061, 177, 37, 10, 2]
None


It returns the cluster labels for each partition in the variable 'c' which is of size (N x numPartitions) e.g., (13000, 5) in this case. Each column in array c provides cluster labels for that partition.

num_clust provides how many cluster it has produced in each partition or step of the run. As you see above: num_clust = 2061, 177, 37, 10, 2 indicating it found 2061 clusters in partition 0, 177 in partition 2 and so on to 10 clusters in step 4. We can pick the respective cluster labels for the data in the returned array 'c'. 

## Evaluate quality of clustering

For example, here since the fourth partition c[:, 3] provides labels for a 10 clustering result. we will evaluate its quality

In [22]:
from sklearn.metrics import normalized_mutual_info_score as nmi
score = nmi(gt, c[:, 3])
print('NMI Score: {:.2f}'.format(score * 100))

NMI Score: 85.04


### Required number of clusters

In many application we need to cluster a dataset to obtain a specifioed number of clusters. This can be done easily with FINCH by setting the input param req_clust. For example lets cluster the dataset to also a get a 15 cluster partition.

In [23]:
c, num_clust, req_c = FINCH(data, req_clust=15, verbose=False)

Here the variable req_c will contain the labels of the 15 clusters.

In [24]:
print(req_c.shape)
print(req_c)
print(np.unique(req_c))

(13000,)
[0 1 1 ... 6 6 6]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


### Run FINCH on large data using approximate nearest neighbour

To run on large scale data, computing the the nearest enghbour is not possible using exact distances, here an approximate nerest neighbour method such as pynndescent can be used. This is also useful for low compute machines with small memory.

We can specify number of samples above which it uses pynndesent to compute nearest neighbor distances. Since our data has 13000 samples we force it to use pynndescent by specifying use_ann_above_samples parameter to 10000 i.e any data with above 10000 samples will be processed by computing distances with pynndescent.

In [25]:
c, num_clust, req_c = FINCH(data, req_clust=None, use_ann_above_samples=10000, verbose=True)

Using PyNNDescent to compute 1st-neighbours at this step ...
Step PyNNDescent done ...
Partition 0: 2079 clusters
Partition 1: 192 clusters
Partition 2: 35 clusters
Partition 3: 11 clusters
Partition 4: 2 clusters
