# Demostration and Comparison of Similarity Robustification Methods

This jupyter notebook allows to compare different similarity robustification methods in terms of the identification rate (IR) and verification rate (VR).

We use CPLFW dataset with distractors from MegaFace dataset here.

In [40]:
import numpy as np
from sklearn.preprocessing import normalize
import os
from tqdm.notebook import tqdm
import pandas as pd
import sys

sys.path.append('../')
from ir_package.ir_class import IrBigData, calculate_inv_dict
IrBigData._print_info = False

if os.getcwd().endswith('quadrics') is not True:
    os.chdir('../')

In [None]:
# parameters for representation article notation
names_dict = {'OneClassSVM': 'OCSVM', 'norms': 'NORM', 'quadrics': 'Q-FULL', 
              'quadrics_alg': 'Q-SUB', 'basic_cosine': 'initial'}
col_order = ['initial', 'Q-FULL', 'PCA', 'Q-SUB',  'OCSVM', 'NORM']

List the robustification techniques for which to train model and to calculate IR.
**Note:** `'quadrics'` are not listed by default among model_to_train because their preparation takes very long time.

Then run bash scripts to build the models corresponding to the listed techniques and save _CPLFW_ and _MegaFace_ features to features folder.

In [86]:
methods_train = ['OneClassSVM', 'PCA', 'norms']
methods = ['PCA', 'norms', 'OneClassSVM', 'quadrics', 'quadrics_alg']

! python3 create_models.py --methods {' '.join(methods_train)}
! python3 calculate_features.py --methods {' '.join(methods)} --datasets cplfw megaface 

The basic similarity function is the standard cosine similarity. Because of this, for simplicity, we normalize embeddings.

In [46]:
cplfw_emb = normalize(np.load('image_embeddings/cplfw.npy'))
megaface_emb = normalize(np.load('image_embeddings/megaface.npy'))

with open('image_embeddings/labels/cplfw_labels.txt', encoding='utf-8') as txt_labels_file:
    lines = txt_labels_file.readlines()
cplfw_labels = np.array([i.rstrip('\n') for i in lines])

Specify parameters (see full list of parameters with explanations in the documentation for IrBigData class). 

In [47]:
parameters_ir = {
    "similarity_type": "features",
    "fpr_threshold": 1e-5,
    "dist_type": "max_threshold",
    "protocol": "data_distractors_no_false_pairs",
    "N_distractors": 10000
}

Next, for each of the robustification techniques we run experiments with two similarity functions: the basic `cosine` and the robustified `features`.

If we denote the basic `cosine` similarity by $s(x ,y) = \langle x, y \rangle$, then the robustified `features` similarity $s_h(x, y)$ is given by:

$$s_h(x, y) = \begin{cases}
s(x ,y), &\max({\sf o}(x)), {\sf o}(y))) < \alpha, \\
0, &\max({\sf o}(x), {\sf o}(y)) \geq \alpha,
\end{cases}$$

where $\alpha$ is the threshold parameter and ${\sf o}(x)$ is the list of features of $x$ given by the robustification technique.

We tune $\alpha$ by grid search in the vicinity of 0.99 quantile of features, that roughly corresponds to the portion of outliers in _CPFLW_ dataset.

In each case we calculate both the identification and verification (identification with no outliers) rates.


In [49]:
def get_train_test_split(emb_names):
    # split cplfw into 2 parts
    inv_dict = calculate_inv_dict(emb_names)
    keys = list(inv_dict.keys())
    np.random.shuffle(keys)
    indices_train, indices_test = [], []

    for key in keys[:int(len(keys)/2)]:
        indices_train.extend(inv_dict[key])

    for key in keys[int(len(keys)/2):]:
        indices_test.extend(inv_dict[key])

    return indices_train, indices_test


In [100]:
results_dict = {}
for method in methods+['basic_cosine']:
    results_dict[method] = {}
    results_dict[method]['ir'] = []
    results_dict[method]['vr'] = []

bar_exp = tqdm(range(4))
for _ in bar_exp:
    # choose random distractors
    indices_random = np.random.choice(len(megaface_emb), 
                                      size=parameters_ir['N_distractors']*2, 
                                      replace=False)

    megaface_emb = megaface_emb[indices_random]
    indices_train, indices_test = get_train_test_split(cplfw_labels)

    megaface_emb_train = megaface_emb[:parameters_ir['N_distractors']]
    megaface_emb_test = megaface_emb[parameters_ir['N_distractors']:]
    cplfw_emb_train = cplfw_emb[indices_train]
    cplfw_emb_test = cplfw_emb[indices_test]

    pbar = tqdm(methods, leave=False)
    for method in pbar:
        pbar.set_description(method)                
        results_arr_train = []

        cplfw_features = np.load('features/cplfw/{}_dist.npy'.format(method))
        megaface_features = np.load('features/megaface/{}_dist.npy'.format(method))[indices_random]

        cplfw_features_train = cplfw_features[indices_train]
        cplfw_features_test = cplfw_features[indices_test]
        megaface_features_train = megaface_features[:parameters_ir['N_distractors']]
        megaface_features_test = megaface_features[parameters_ir['N_distractors']:]

        IR = IrBigData(cplfw_emb_train, cplfw_features_train, 
                   cplfw_labels[indices_train], parameters_ir, distractors=megaface_emb_train, 
                   distractor_features=megaface_features_train)
        IR.params['similarity_type'] = 'features'

        quantiles_arr = [np.quantile(cplfw_features, i) for i in [0.985 + 0.001*i for i in range(10)]]
        for alpha in tqdm(quantiles_arr, leave=False):
            IR.params['alpha'] = alpha
            IR.main()
            results_arr_train.append(IR.CMT_)

        parameters_ir['alpha'] = quantiles_arr[np.argmax(results_arr_train)]

        IR = IrBigData(cplfw_emb_test, cplfw_features_test, 
               cplfw_labels[indices_test], parameters_ir, distractors=megaface_emb_test, 
               distractor_features=megaface_features_test)
        IR.main()

        results_dict[method]['ir'].append(IR.CMT_)
        results_dict[method]['vr'].append(IR.VR_)

    IR = IrBigData(cplfw_emb_test, None, cplfw_labels[indices_test], 
                   parameters_ir, distractors=megaface_emb_test, 
                   distractor_features=None)
    IR.params['similarity_type'] = 'cosine'
    IR.main()
    results_dict['basic_cosine']['ir'].append(IR.CMT_)
    results_dict['basic_cosine']['vr'].append(IR.VR_)

    
for method in methods+['basic_cosine']:
    results_dict[method]['ir'] = np.mean(results_dict[method]['ir'])
    results_dict[method]['vr'] = np.mean(results_dict[method]['vr'])


### Results table

In [95]:
df_result = pd.DataFrame.from_dict(results_dict)
df_result.rename(columns=names_dict)[col_order]


Unnamed: 0,initial,Q-FULL,PCA,Q-SUB,OCSVM,NORM
ir,0.670617,0.743564,0.717837,0.693583,0.651183,0.742126
vr,0.672771,0.746623,0.720175,0.695658,0.652996,0.745185


**Remark**
Note that the results in the following table do not need to coincide with the results from Table 2 of the paper. The reason is that in this notebook we took parameters_ir['N_distractors'] = 10000 while the results in the paper were obtained with parameters_ir['N_distractors'] = 500000 (the latter case is much longer to run).
