# Demostration and Comparison of Similarity Robustification Methods

This jupyter notebook allows to compare different similarity robustification methods in terms of the identification rate (IR) and verification rate (VR).

We use CPLFW dataset with distractors from MegaFace dataset here.

In [11]:
import numpy as np
from sklearn.preprocessing import normalize
import os
from tqdm.notebook import tqdm
import pandas as pd
import sys

sys.path.append('../')
from ir_package.ir_class import IrBigData
IrBigData._print_info = False

if os.getcwd().endswith('quadrics') is not True:
    os.chdir('../')

List the robustification techniques for which to calculate IR.
**Note:** `'quadrics'` are not listed by default because their preparation takes very long time.

Then run bash scripts to build the models corresponding to the listed techniques and save _CPLFW_ and _MegaFace_ features to features folder.

In [2]:
methods = ['PCA', 'norms']

! python3 create_models.py --methods {' '.join(methods)}
! python3 calculate_features.py --methods {' '.join(methods)} --datasets cplfw megaface 

The basic similarity function is the standard cosine similarity. Because of this, for simplicity, we normalize embeddings.

In [12]:
cplfw_emb = normalize(np.load('image_embeddings/cplfw.npy'))
megaface_emb = normalize(np.load('image_embeddings/megaface.npy'))

with open('image_embeddings/labels/cplfw_labels.txt', encoding='utf-8') as txt_labels_file:
    lines = txt_labels_file.readlines()
cplfw_labels = np.array([i.rstrip('\n') for i in lines])

Specify parameters (see full list of parameters with explanations in the documentation for IrBigData class). 

In [4]:
parameters_ir = {
    "similarity_type": "features",
    "fpr_threshold": 1e-5,
    "dist_type": "max_threshold",
    "protocol": "data_distractors_no_false_pairs",
    "N_distractors": 10000
}

Next, for each of the robustification techniques we run experiments with two similarity functions: the basic `cosine` and the robustified `features`.

If we denote the basic `cosine` similarity by $s(x ,y) = \langle x, y \rangle$, then the robustified `features` similarity $s_h(x, y)$ is given by:

$$s_h(x, y) = \begin{cases}
s(x ,y), &\max({\sf o}(x)), {\sf o}(y))) < \alpha, \\
0, &\max({\sf o}(x), {\sf o}(y)) \geq \alpha,
\end{cases}$$

where $\alpha$ is the threshold parameter and ${\sf o}(x)$ is the list of features of $x$ given by the robustification technique.

We tune $\alpha$ by grid search in the vicinity of 0.99 quantile of features, that roughly corresponds to the portion of outliers in _CPFLW_ dataset.

In each case we calculate both the identification and verification (identification with no outliers) rates.


In [5]:
# choose random distractors
indices_random = np.random.choice(len(megaface_emb), 
                                  size=parameters_ir['N_distractors'], 
                                  replace=False)
megaface_emb = megaface_emb[indices_random]


results_dict = {}
results_vr_dict = {}

pbar = tqdm(methods)
for method in pbar:
    pbar.set_description(method)        
    results_dict[method] = {'cosine':{}, 'features':{}}
    results_arr = []
    results_vr_arr = []

    cplfw_features = np.load('features/cplfw/{}_dist.npy'.format(method))
    megaface_features = np.load('features/megaface/{}_dist.npy'.format(method))[indices_random]
    
    IR = IrBigData(cplfw_emb, cplfw_features, 
               cplfw_labels, parameters_ir, distractors=megaface_emb, 
               distractor_features=megaface_features)
    IR.params['similarity_type'] = 'features'
    
    quantiles_arr = [np.quantile(cplfw_features, i) for i in [0.984 + 0.001*i for i in range(11)]]
    for alpha in tqdm(quantiles_arr, leave=False):
        IR.params['alpha'] = alpha
        IR.main()
        results_arr.append(IR.CMT_)
        results_vr_arr.append(IR.VR_)
    
    results_dict[method]['features']['ir'] = max(results_arr)
    results_dict[method]['features']['vr'] = max(results_vr_arr)
    
    IR = IrBigData(cplfw_emb, None,
               cplfw_labels, parameters_ir, distractors=megaface_emb, 
               distractor_features=None)
    IR.params['similarity_type'] = 'cosine'
    IR.main()
    results_dict[method]['cosine']['ir'] = IR.CMT_
    results_dict[method]['cosine']['vr'] = IR.VR_

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

### Results table

In [65]:
pca_table = pd.DataFrame.from_dict(results_dict['PCA'])
pca_table['method'] = 'PCA'
pca_table['metric'] = pca_table.index
norms_table = pd.DataFrame.from_dict(results_dict['norms'])
norms_table['method'] = 'norms'
norms_table['metric'] = norms_table.index
results_df = pd.concat([pca_table, norms_table]).set_index(['metric', 'method']).sort_index()

In [67]:
results_df

Unnamed: 0_level_0,Unnamed: 1_level_0,cosine,features
metric,method,Unnamed: 2_level_1,Unnamed: 3_level_1
ir,PCA,0.665443,0.719979
ir,norms,0.665443,0.742702
vr,PCA,0.666492,0.721552
vr,norms,0.666492,0.744625
