<a href="https://colab.research.google.com/github/leolani/cltl-face-all/blob/master/examples/colab/3.find-relevant-faces-colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Clone the repo, download the necessary files, etc.

In [None]:
%cd /content
!git clone https://github.com/leolani/cltl-face-all
!cd cltl-face-all/ && pip install .
!pip install omegaconf==2.0.5
!pip install tqdm av

# Download the annotations and stuff
!wget https://raw.githubusercontent.com/declare-lab/MELD/master/data/MELD/train_sent_emo.csv
!wget https://raw.githubusercontent.com/leolani/cltl-face-all/master/examples/smaller-datatsets-jsons/dataset-small.json
!wget https://raw.githubusercontent.com/leolani/cltl-face-all/master/examples/smaller-datatsets-jsons/dataset-medium.json
!wget https://raw.githubusercontent.com/leolani/cltl-face-all/master/examples/smaller-datatsets-jsons/dataset-large.json
!wget https://raw.githubusercontent.com/leolani/cltl-face-all/master/examples/smaller-datatsets-jsons/datasets.json

!gdown --id 1rsLbfgQYztDtrPFqEmkh-2d_0ap1qd_s
!unzip visual-features-smaller-dataset.zip
!rm visual-features-smaller-dataset.zip

!gdown --id 16ck7plW9v9eSHGCs5wuB2AhhufPRt3Wi
!unzip smaller-dataset.zip
!rm smaller-dataset.zip

!ls smaller-dataset/  |  wc -l
!ls visual-features | wc -l

In [None]:
import json

with open('datasets.json', 'r') as stream:
    datasets = json.load(stream)

datasets = datasets['large']

from glob import glob
import numpy as np
import os
visual_features = glob('visual-features-smaller-dataset/*.npy')
visual_features = {os.path.basename(vf).split('.npy')[0] : np.load(vf, allow_pickle=True).item() for vf in visual_features}

# Load the data from storage to memory

The biggest difficulty is that the annotated face is not always in the video. It might be hidden. Let's try to match the annotated names with the face embeddings, if possible

## Let's start from the smallest

In [None]:
FACE_PROB = 0.975
EVERY_N_FRAME = 4
# SPEAKERS_OF_INTEREST = ['Chandler', 'Joey', 'Monica', 'Phoebe', 'Rachel', 'Ross']
DATASET_chosen = 'train'
dataset_chosen = datasets[DATASET_chosen]

speakers_mentioned = []
embeddings_all = []

# This is gonna help us to find back to the source frame and video
idx2source = {}
embeddings_all = []
bboxes_all = []
landmarks_all = []

count = 0
for diautt, annot in dataset_chosen.items():
    # There is one face annotated in the entire video.
    # We are not even sure if the face is actually there or not.
    # Even though the face is there, we are not sure which frame number it is.

    # if annot['Speaker'] not in SPEAKERS_OF_INTEREST:
    #     continue

    for framenum, list_of_findings in visual_features[diautt].items():
        if framenum % EVERY_N_FRAME != 0:
            continue
        for finding in list_of_findings:
            if finding['bbox'][-1] < FACE_PROB:
                continue
            
            embeddings_all.append(finding['embedding'])
            bboxes_all.append(finding['bbox'])
            landmarks_all.append(finding['landmark'])
            idx2source[count] = {'diautt':diautt, 'frame': framenum}
            count+=1
            speakers_mentioned.append(annot['Speaker'])

assert len(embeddings_all) == len(bboxes_all) == len(landmarks_all) == \
        len(idx2source)

speakers_mentioned = sorted(list(set(speakers_mentioned)))

print(f"Out of the {len(dataset_chosen)} number of videos (utterances),")
print(f"There are in total of {len(speakers_mentioned)} unique speakers mentioned")
print()
print(speakers_mentioned)
print()
print(f"and {len(embeddings_all)} faces detected")

Out of the 584 number of videos (utterances),
There are in total of 31 unique speakers mentioned

['Ben', 'Chandler', 'Charlie', 'Chip', 'Danny', 'Dr. Green', 'Dr. Johnson', 'Dr. Ledbetter', 'Dr. Rhodes', 'Hoshi', 'Joey', 'Julie', 'Katie', 'Leslie', 'Marc', 'Mike', 'Mischa', 'Mona', "Mona's Date", 'Monica', 'Pete', 'Phoebe', 'Rachel', 'Receptionist', 'Richard', 'Rick', 'Ross', 'Student', 'The Assistant Director', 'The Director', 'Tom']

and 19645 faces detected


You can't just assume that every face detected is one of the 16 speakers mentioned, since there are also other faces in the scene. Let's try simple clustering of the embeddings. Refer to https://scikit-learn.org/stable/modules/clustering.html for data aclustering.

Clustering is not an easy topic. Our data are all unit vectors, which means that they are all located on the surface of a high-dimensional sphere. There must be exisitng work done here, which fits to our case. But I'll just copy and paste what I can easily find from scikit-learn.

In [None]:
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


X = np.stack(embeddings_all)

# #############################################################################
# Compute DBSCAN
# DBSCAN uses euclidean distance between the data points.
# TODO: find a way to replace it with angle distance.
# eps and min_samples are hyper parameters that you have to tune.
# At the moment 0.75 and 10, respectively, works decent.
db = DBSCAN(eps=0.8, min_samples=20).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))
print(f"Number of faces that are clustered: {len(embeddings_all) - n_noise_}")
print()

(label_num, counts) = np.unique(labels, return_counts=True)

for l, c in zip(label_num, counts):
    print(f" label {l} \t has {c} counts")

np.save('embeddings-clusters.npy', labels)
!cp 'embeddings-clusters.npy' drive/MyDrive/

We got several clusters. This is not that many. We can go through one by one. The label `-1` is considered noise.


Below cell will save all of the images

In [None]:
import numpy as np

labels = np.load(f"drive/MyDrive/embeddings-clusters.npy")

import shutil
from tqdm.notebook import tqdm
import cv2
import av
import random
from cltl_face_all.face_alignment import FaceDetection

NUM_MAX_VID_PER_LABEL = 25

shutil.rmtree('faces', ignore_errors=True)

assert len(embeddings_all) == len(bboxes_all) == len(landmarks_all) == \
        len(idx2source) == len(labels)

VIDS_DIR = 'smaller-dataset/'


list_all = []


indices = list(idx2source.keys())
for idx in indices:
    label_ = labels[idx]

    embedding_ = embeddings_all[idx]
    bbox_ = bboxes_all[idx]
    landmark_ = landmarks_all[idx]
    source_ = idx2source[idx]

    to_append = {'label': label_, 
                'embedding': embedding_,
                 'bbox': bbox_,
                 'landmark': landmark_,
                 'diautt': source_['diautt'],
                 'frame': source_['frame']}

    list_all.append(to_append)


assert len(list_all) == len(labels)

random.shuffle(list_all)


fd = FaceDetection(device='cpu', face_detector='sfd')

labels_processed = {l: 0 for l in set(labels)}

for finding in tqdm(list_all):
    label_ = finding['label']
    embedding_ = finding['embedding']
    bbox_ = finding['bbox']
    landmark_ = finding['landmark']
    diautt_ = finding['diautt']
    frame_num = finding['frame']
    video_path = os.path.join(VIDS_DIR, diautt_) + '.mp4'

    os.makedirs(os.path.join('faces', str(label_)), exist_ok=True)

    # if labels_processed[label_] > NUM_MAX_VID_PER_LABEL:
    #     continue

    assert os.path.isfile(video_path)

    container = av.open(video_path)
    for frame in container.decode(video=0):
        idx = frame.index
        img = np.array(frame.to_image())

        if idx == frame_num:
            break

    batch = img[np.newaxis, ...]
    face = fd.crop_and_align(batch, [bbox_[np.newaxis, ...]], [landmark_[np.newaxis, ...]])
    face = np.squeeze(face)

    img_write_path = os.path.join('faces', 
                                  str(label_), 
                                  f"{diautt_}_frame{frame_num}_{'_'.join([str(foo) for foo in bbox_.astype(np.int).tolist()[:4]])}.jpg")

    cv2.imwrite(img_write_path, cv2.cvtColor(face, cv2.COLOR_RGB2BGR))
    labels_processed[label_] +=1


!zip -r faces.zip faces
!cp faces.zip drive/MyDrive/

Downloading: "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" to /root/.cache/torch/hub/checkpoints/s3fd-619a316812.pth


HBox(children=(FloatProgress(value=0.0, max=89843225.0), HTML(value='')))




Downloading: "https://www.adrianbulat.com/downloads/python-fan/2DFAN4-11f355bf06.pth.tar" to /root/.cache/torch/hub/checkpoints/2DFAN4-11f355bf06.pth.tar


HBox(children=(FloatProgress(value=0.0, max=95641761.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=19645.0), HTML(value='')))

  r, _, _, _ = lstsq(X, U)


Now go through the faces if they match the names


In [None]:
import matplotlib.pyplot as plt
from glob import glob
import random
import os
import cv2

print(sorted(os.listdir('faces/')))

for label in sorted(os.listdir('faces/')):
    images = glob(os.path.join('faces', label, '*.jpg'))
    img = random.choice(images)
    img = cv2.imread(img)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    plt.figure()
    plt.imshow(img)
    plt.title(f"{label}, {len(images)}")


## Here are my findings

```
 label character   note
 -1  noise
 0   Chandler
 1   Joey
 2   Ross
 3   Phoebe
 4   noise
 5   Gunther
 6   Rachel
 7   Ross (side) remove it
 8   Ben
 9   Ross (side) remove it
 10  Monica
 11  Monica (side)   remove it
 12  Leslie
 13  a dude (side)   remove it
 14  a dude (side)   remove it
 15  Richard Gary Oldman
 16  Monica
 17  DONT KNOW
 18  Tom
 19  DONT KNOW
```
It was almost perfect that it clustered 2 front Monicas, not 1. From now on, we only keep the frontal face, which gives us

```
 0   Chandler
 1   Joey
 2   Ross
 3   Phoebe
 4   noise
 5   Gunther
 6   Rachel
 8   Ben
 10  Monica
 12  Leslie
 15  Richard
 16  Monica
 18  Tom
```

10 and 16 will be merged




In [None]:
to_keep = {'Chandler': [],
           'Joey': [],
           'Ross': [],
           'Phoebe': [],
           'Gunther': [],
           'Rachel': [],
           'Ben': [],
           'Monica': [],
           'Leslie': [],
           'Richard': [],
           'Tom': []}

label2name = {0: 'Chandler',
              1: 'Joey',
              2: 'Ross',
              3: 'Phoebe',
              5: 'Gunther',
              6: 'Rachel',
              8: 'Ben',
              10: 'Monica',
              12: 'Leslie',
              15: 'Richard',
              16: 'Monica',
              18: 'Tom'}

for finding in list_all:
    label_ = finding['label']
    embedding_ = finding['embedding']
    bbox_ = finding['bbox']
    landmark_ = finding['landmark']
    diautt_ = finding['diautt']
    frame_num = finding['frame']

    if label_ in list(label2name.keys()):
        to_keep[label2name[label_]].append(embedding_)

In [None]:
final_vectors = {}
for name, list_of_embs in to_keep.items():
    sum_of_vecs = np.sum(list_of_embs, axis=0)
    sum_of_vecs = sum_of_vecs / np.linalg.norm(sum_of_vecs)
    print(name, sum_of_vecs.shape, np.linalg.norm(sum_of_vecs), sum_of_vecs.dtype)
    final_vectors[name] = sum_of_vecs

np.save('friends-embeddings.npy', final_vectors)
!cp friends-embeddings.npy drive/MyDrive/