We use [RAPIDS](https://rapids.ai/) for clustering faces in the train dataset. Rapids is a package developed and maintained by NVidia and uses the GPU for fast calculations

The faces were cropped using the facenet's pytorch version. They are 160x160 in dimension. Sample 2000(aprox) images can be found in the following [dataset](https://www.kaggle.com/skylord/sample-face-crop) 

Inspiration is taken :) from the following awesome notebooks: 

- @Bojan's MNIST 2-D t-sne with rapids: [Link](https://www.kaggle.com/tunguz/mnist-2d-t-sne-with-rapids)
- @Henrique's Proper clustering with facenet embeddings + PCA: [Link](https://www.kaggle.com/hmendonca/proper-clustering-with-facenet-embeddings-eda/)



So who is the fastest! 
![FastestSuperHero](https://www.kaggle.com/skylord/sample-face-crop#best-flash-super-hero-dc-laser-time.jpg)


In [1]:
%%time
# We add the rapids kaggle dataset [Link](https://www.kaggle.com/cdeotte/rapids)
# This installs the package offline. Installation takes place under a minute! 
import sys
!cp ../input/rapids/rapids.0.11.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz
sys.path = ["/opt/conda/envs/rapids/lib"] + ["/opt/conda/envs/rapids/lib/python3.6"] + ["/opt/conda/envs/rapids/lib/python3.6/site-packages"] + sys.path
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

rapids/
rapids/bin/
rapids/bin/msgunfmt
rapids/bin/pyvenv
rapids/bin/msgen
rapids/bin/djpeg
rapids/bin/gdaltindex
rapids/bin/gss-client
rapids/bin/png-fix-itxt
rapids/bin/python3-config
rapids/bin/gif2rgb
rapids/bin/ncgen
rapids/bin/gdal-config
rapids/bin/PParse
rapids/bin/xzdec
rapids/bin/numba
rapids/bin/SAX2Count
rapids/bin/h5unjam
rapids/bin/glib-gettextize
rapids/bin/fc-list
rapids/bin/xzgrep
rapids/bin/giftool
rapids/bin/grpc_python_plugin
rapids/bin/k5srvutil
rapids/bin/gifbuild
rapids/bin/gflags_completions.sh
rapids/bin/uuclient
rapids/bin/gio-querymodules
rapids/bin/dask-ssh
rapids/bin/msguniq
rapids/bin/lz4c
rapids/bin/ncdump
rapids/bin/vacuumlo
rapids/bin/bzfgrep
rapids/bin/ngettext
rapids/bin/gdal_sieve.py
rapids/bin/gdalsrsinfo
rapids/bin/2to3-3.6
rapids/bin/lzdiff
rapids/bin/pg_receivewal
rapids/bin/rdjpgcom
rapids/bin/gnmanalyse
rapids/bin/pyvenv-3.6
rapids/bin/ktutil
rapids/bin/invproj
rapids/bin/pgbench
rapids/bin/protoc


In [2]:
import cudf,cuml
import pandas as pd
import numpy as np
from cuml.manifold import TSNE
from cuml import PCA  
#from cuml.decomposition import PCA << this is also supported
from cuml.cluster import DBSCAN
#from cuml import DBSCAN << this is also supported
import matplotlib.pyplot as plt
%matplotlib inline

The following scatter function is defined below, but not used in this notebook. You could call it your self for some interesting visualizations

In [3]:
def scatter_thumbnails(data, images, zoom=0.12, colors=None):
    assert len(data) == len(images)

    # reduce embedding dimentions to 2
    x = PCA(n_components=2).fit_transform(data) if len(data[0]) > 2 else data

    # create a scatter plot.
    f = plt.figure(figsize=(22, 15))
    ax = plt.subplot(aspect='equal')
    sc = ax.scatter(x[:,0], x[:,1], s=4)
    _ = ax.axis('off')
    _ = ax.axis('tight')

    # add thumbnails :) Displaying thumbnails is something I have commented out. 
#     from matplotlib.offsetbox import OffsetImage, AnnotationBbox
#     for i in range(len(images)):
#         image = plt.imread(images[i])
#         im = OffsetImage(image, zoom=zoom)
#         bboxprops = dict(edgecolor=colors[i]) if colors is not None else None
#         ab = AnnotationBbox(im, x[i], xycoords='data',
#                             frameon=(bboxprops is not None),
#                             pad=0.02,
#                             bboxprops=bboxprops)
#         ax.add_artist(ab)
    return ax


- Read pre-encoded embeddings. Created using the [original notebook](https://www.kaggle.com/skylord/face-clustering)> 
- This encodes the first-frame face crops, using the following codeblock

```
from torchvision.transforms import ToTensor

tf_img = lambda i: ToTensor()(i).unsqueeze(0)
embeddings = lambda input: resnet(input)

list_embs = []
with torch.no_grad():
    for face in tqdm(face_files):
        t = tf_img(Image.open(face)).to(device)
        e = embeddings(t).squeeze().cpu().tolist()
        list_embs.append(e)
```


In [4]:
%%time
import pickle

embeddings = pd.read_pickle('/kaggle/input/sample-face-crop/embeddings_face_clusters.pkl')
print(embeddings.shape)
embeddings.head()

(116573, 3)
CPU times: user 5.07 s, sys: 1.25 s, total: 6.32 s
Wall time: 6.69 s


Unnamed: 0,embedding,faceFile,cluster
0,"[0.06744061410427094, 0.007214107550680637, 0....",cnjssbpoun_frame0.jpg,0
1,"[-0.057064350694417953, -0.015465210191905499,...",aupwvhmmzg_frame0.jpg,1
2,"[-0.003266997169703245, -0.032444849610328674,...",twqwnsblvn_frame0.jpg,2
3,"[-0.06885179132223129, 0.0053096953779459, -0....",zaicbihiam_frame0.jpg,3
4,"[0.06825964897871017, -0.03303952515125275, -0...",hemhhjgnld_frame0.jpg,4


In [5]:
# Convert the embeddings to columns
colnames = list()

for idx in range(512):
    colnames.append('colname_'+str(idx))
    
colnames;
embeddings[colnames] = pd.DataFrame(embeddings['embedding'].values.tolist(), index = embeddings.index)

In [6]:
#Convert to numpy array
embed_numpy = embeddings[colnames].to_numpy()


In [7]:
%%time
# PCA first to speed it up
x = PCA(n_components=50).fit_transform(embed_numpy)


CPU times: user 2.38 s, sys: 1.05 s, total: 3.43 s
Wall time: 3.6 s


Default dimensions for t-sne is n_components=2. This uses the fast Barnes-Hut clustering technique. 
With greater dimensions the exact method for calculating tsne is used

In [8]:
%%time
tsne = TSNE(random_state = 99) # 
x = tsne.fit_transform(embed_numpy)

CPU times: user 4.17 s, sys: 3.38 s, total: 7.56 s
Wall time: 7.7 s


Total time to fit the transform was ~ 7.35 secs !!! 

This can be compared to the 3-5+ hours if you used sklearn's t-sne 

In [9]:
%%time
tsne50 = TSNE(random_state=99, n_components=50)
x50= tsne50.fit_transform(embed_numpy)

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


CPU times: user 1h 31min 52s, sys: 1h 43s, total: 2h 32min 35s
Wall time: 2h 32min 46s


DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisions in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.


In [10]:
%%time 
dbscan = DBSCAN(eps=1.5, verbose=True ) #min_samples (default is 5)
clusters =  dbscan.fit_predict(x)
embeddings['RapidDBSCAN'] = clusters

CPU times: user 16.4 s, sys: 611 ms, total: 17 s
Wall time: 17.1 s


In [11]:
embeddings.to_pickle('/kaggle/working/embeddings.pkl')
