### Metric Learning
In the previous tutorial, we saw how the pre-trained VGG-Face representations, combined with a distance metric (L2-distance) helped us in the problem of face verification i.e. determining whether two face images belong to the same identity or not. 

But we know that metrics such as L1 or L2 may not always be optimal for the task at hand. So in this tutorial, we will see if we can "learn" such distance metrics from the data, and whether that leads to an improvement in verification peformance.

There are cases where a simple L2 distance in the representation space may not preserve the semantic similarity between instances of the same class. Consider the example shown below. Let us assume that the squares and the circles represent instances belonging to two different classes. It is evident that the original D-dimensional representation space does not do a good job in separating the two classes. In such cases, we attempt to learn a $d \times D$ matrix L from the data such that the points are separated in a much better fashion in the projected space. That is, the L2-distances between pairs of points $x_i$ and $x_j$ in the projected space, given by $ ||\;Lx_i - Lx_j\;||_2^2 $ are such that similar points are brought closer together whereas dissimilar points are pushed apart.

<img src="images/ml-proj.png">

Such forms of metric learning approaches serve a dual purpose -- (a) they help in learning better metrics by bringing similar points closer and pushng dissimilar points away, and (b) they help in learning a compact and discriminative d-dimensional representation where $ d << D $ i.e. they also help in reducing the dimensionality of our face descriptors thereby making them suitable for large-scale applications.

In this tutorial, we will see how we can use face pairs from the CFPW dataset to learn a projection matrix that helps us generate compact and discriminative low-dimensional projections. We will also see whether these compact descriptors lead to a better verification performance than what we observed in the previous tutorial.

In [2]:
import cv2, math
import numpy as np

import torch
from torch.autograd import Variable
import torch.nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.serialization import load_lua
from torch.legacy import nn

from sklearn import metrics
from scipy.optimize import brentq
from scipy.interpolate import interp1d

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline 
plt.ion()

Again, we define certain utility functions to load, pre-process data and also generate the pre-trained face descriptors. This should be familiar to you by now.

In [3]:
def loadImage(imgPath):
    inputImg = cv2.imread(imgPath)

    # re-scale the smaller dim (among width, height) to refSize
    refSize, targetSize = 256, 224
    imgRows, imgCols = inputImg.shape[0], inputImg.shape[1]
    if imgCols < imgRows: resizedImg = cv2.resize(inputImg, (refSize, refSize * imgRows / imgCols))
    else: resizedImg = cv2.resize(inputImg, (refSize * imgCols / imgRows, refSize))

    # center-crop
    oH, oW = targetSize, targetSize
    iH, iW = resizedImg.shape[0], resizedImg.shape[1]
    anchorH, anchorW = int(math.ceil((iH - oH)/2)), int(math.ceil((iW - oW) / 2))
    croppedImg = resizedImg[anchorH:anchorH+oH, anchorW:anchorW+oW]

    # convert shape from (height, width, 3) to (3, width, height)
    channel_1, channel_2, channel_3 = croppedImg[:, :, 0], croppedImg[:, :, 1], croppedImg[:, :, 2]
    croppedImg = np.empty([3, croppedImg.shape[0], croppedImg.shape[1]])
    croppedImg[0], croppedImg[1], croppedImg[2] = channel_1, channel_2, channel_3

    # subtract training mean
    inputImg = inputImg.astype(float)
    trainingMean = [129.1863, 104.7624, 93.5940]
    for i in range(3): croppedImg[i] = croppedImg[i] - trainingMean[i]
    return croppedImg

In [4]:
def getVggFeatures(imgPaths, preTrainedNet):
    nImgs = len(imgPaths)
    preTrainedNet.modules[31] = nn.View(nImgs, 25088)
    preTrainedNet = preTrainedNet.cuda()
    
    batchInput = torch.Tensor(nImgs, 3, 224, 224)
    for i in range(nImgs): batchInput[i] = torch.from_numpy(loadImage(imgPaths[i]))
    
    batchOutput = preTrainedNet.forward(batchInput.cuda())
    return preTrainedNet.modules[35].output.cpu()

### Network Architecture
Recall that our aim is to learn a simple projection matrix over the pre-trained, D = 4096-d face representations so that the resultant compact descriptors are able to better separate face pairs in terms of identity (become better at face verification). we also know that the operation performed by an fc-layer is exactly the same as a matrix-vector product. So, our network architecture would essentially consist of learning a d = 64 dimensional linear layer over the pre-trained VGG-Face descriptors.

However, there is a difference in the manner in which we are going to train this network. Traditional CNN architectures for classification are trained using images and their corresponding class labels. Here, however, our network is learning to separate(bring together) (dis)similar face pairs. This means that the training should also happen in the form of image pairs where the label asosciated with each image pair during training tells us whether the faces in the pair belong to the same identity or not.

A conceptually convenient way to represent such networks schematically is to visualize this as two identical networks (which share all weights and biases) where each of the network accepts one image from the pair and computes the respective representations. Such networks are called Siamese networks (figure below).

<img src="images/siamese.png">

Keeping this in mind, we define the architcture of our Siamese netowrk in PyTorch. Note that the final output of our network is not some feature vector or a vector of log-likelihoods. Instead, the output is the L2 distance between the representations that have been computed by each branch of the Siamese network. During the process of training, the network learns to minimize/maximize this L2 distance depending upon whether the input images are similar/dissimilar.

In [5]:
class Siamese(torch.nn.Module):
    def __init__(self):
        super(Siamese, self).__init__()
        self.fc1 = torch.nn.Linear(4096, 64)

    def forward(self, x1, x2):
        o1 = self.fc1(x1)
        o2 = self.fc1(x2)
        o = torch.sqrt(torch.sum(torch.mul(o1-o2, o1-o2), 1))
        return o

A subtle point that needs to be remembered is that the two networks in the Siamese architecture are merely for representational convenience. There is only a single network which exists in memory at all times.

We have consolidated the code for getting L2 distances between pairs of images in our dataset in the form of a function definition. This code is similar to the one that we saw in the previous tutorial.

In [6]:
def evaluate(net, dataset):
    nPairs, batchSize = len(dataset['pairs']), 10
    classifierScores, labels = [], []
    
    for startIdx in range(0, nPairs, batchSize):
        endIdx = min(startIdx+batchSize-1, nPairs-1)
        size = (endIdx - startIdx + 1)

        imgPaths1, imgPaths2, batchLabels = [], [], []
        for offset in range(size):
            pair = dataset['pairs'][startIdx+offset]
            imgPaths1.append("../../data/lab3/Experiment_3/" + pair.img1)
            imgPaths2.append("../../data/lab3/Experiment_3/" + pair.img2)
            batchLabels.append(int(pair.label) * -1)
    
        descrs1 = getVggFeatures(imgPaths1, vggFace).clone()
        descrs2 = getVggFeatures(imgPaths2, vggFace).clone()
        batchOutput = net(Variable(descrs1).cuda(), Variable(descrs2).cuda())
        
        classifierScores += batchOutput.data.cpu().numpy().T[0].tolist()
        labels += batchLabels
    
    return classifierScores, labels

In [7]:
class Metrics():
    def __init__(self, classifierScores, labels):
        self.scores = classifierScores
        self.labels = labels

    def getAvgDist(self):
        nSim, nDiss = 0, 0
        avgDistSim, avgDistDiss = 0.0, 0.0
        for i in range(len(self.scores)):
            if self.labels[i] == 1: 
                avgDistDiss += self.scores[i] 
                nDiss += 1
            else: 
                avgDistSim += self.scores[i]
                nSim += 1
        return avgDistSim/nSim, avgDistDiss/nDiss
    
    def getROC(self):
        fpr, tpr, thresholds = metrics.roc_curve(self.labels, self.scores)
        auc = metrics.auc(fpr, tpr)
        eer, r = brentq(lambda x : 1. - x - interp1d(fpr, tpr)(x), 0., 1., full_output=True)
        return eer, auc, fpr, tpr

The dataset (CFPW) remains the same as the last tutorial. As always, we also load the pre-trained VGG-Face model.

In [8]:
vggFace = load_lua("../../data/lab3/VGG_FACE_pyTorch_small.t7")
dataset = load_lua("../../data/lab3/Experiment_3/cfpw-facePairs-dataset.t7")

We initialize the Siamese network architecture and some other hyperparameters related to training.

In [9]:
torch.manual_seed(0)
np.random.seed(0)

net = Siamese()
criterion = torch.nn.HingeEmbeddingLoss()
optimizer = optim.SGD(net.parameters(), lr=0.00005, weight_decay=0.0005)

net = net.cuda()
criterion = criterion.cuda()

### Loss Function
The loss function that we use here is called the Hinge Embedding loss. It's formulation is given below -- 

loss $ (x, y) = \left( \frac{1+y}{2} \right) \; x + \left ( \frac{1-y}{2} \right ) \max\;(0, margin \; - \; x) $

where x is the L2 distance between the pair of input images and $y \in \{ +1, -1 \}$ is the label for the image pair. As you can see, for similar image pairs with a class label of y = 1, the loss tries to minimize the L2 distance whereas for dissimilar image pairs, it tries to push the distance to be greater than the hyperparameter margin.

Before we start training, let us look at the verification metrics on our dataset of 100 image pairs.

In [10]:
# Before training

scores, labels = evaluate(net, dataset)
verifMetric = Metrics(scores, labels)
avgDistSim, avgDistDiss = verifMetric.getAvgDist()
print "avgDistSim = ", avgDistSim, ", avgDistDiss = ", avgDistDiss

eer, auc, fpr, tpr = verifMetric.getROC()
print "EER = ", eer, ", AUC = ", auc

avgDistSim =  11.5818845654 , avgDistDiss =  12.8276516151
EER =  0.39 , AUC =  0.6484


Given below is the code for training our Siamese network using image pairs from the dataset.

In [13]:
nEpochs, nPairs, batchSize = 2, len(dataset['pairs']), 10
for epochCtr in range(nEpochs):
    
    shuffle = np.random.permutation(nPairs)
    runningLoss, iterCnt = 0.0, 0
    for startIdx in range(0, nPairs, batchSize):
        endIdx = min(startIdx + batchSize - 1, nPairs - 1)
        size = endIdx - startIdx + 1
    
        imgPaths1, imgPaths2, labels = [], [], []
        for offset in range(size):
            pair = dataset['pairs'][shuffle[startIdx+offset]]
            imgPaths1.append("../../data/lab3/Experiment_3/" + pair.img1)
            imgPaths2.append("../../data/lab3/Experiment_3/" + pair.img2)
            labels.append(int(pair.label))
        
        descrs1 = getVggFeatures(imgPaths1, vggFace).clone()
        descrs2 = getVggFeatures(imgPaths2, vggFace).clone()
        
        batchOutput = net(Variable(descrs1).cuda(), Variable(descrs2).cuda())
        loss = criterion(batchOutput, Variable(torch.Tensor(labels)).cuda())
        loss.backward()
        optimizer.step()
        
        runningLoss += loss.data[0]
        iterCnt += 1
    
    print "epoch ", epochCtr, "/", nEpochs, ": loss = ", runningLoss/iterCnt 

epoch  0 / 2 : loss =  5.76211078167
epoch  1 / 2 : loss =  4.93768441677


After we are done training, let us take a look at the same metrics once again.

In [14]:
# After training

scores, labels = evaluate(net, dataset)
verifMetric = Metrics(scores, labels)
avgDistSim, avgDistDiss = verifMetric.getAvgDist()
print "avgDistSim = ", avgDistSim, ", avgDistDiss = ", avgDistDiss

eer, auc, fpr, tpr = verifMetric.getROC()
print "EER = ", eer, ", AUC = ", auc

avgDistSim =  8.24356043816 , avgDistDiss =  12.2217759705
EER =  0.153333333333 , AUC =  0.9352


We see that there is a significant increase in the verification performance as a result of metric learning. Also, note that the gap between the distances of the similar and dissimilar face pairs has increased. What makes these results even more interesting is the fact that these values are better than what we obtained in the previous tutorial. Recall that previously, we were working with 4096-d descriptors whereas now we have only 64-d descriptors.

### Exercises
1. These AUC metric that we obtain for verification prior to training is significantly worse than what we observed in the previous tutorial. Why is that so?

2. As part of the binary classification metrics, we have computed a quantity called the EER which stands for Equal Error Rate. What does it represent? By looking at the way it has been evaluated and the documentation for the functions used, can you figure out it's relation to fpr/tpr?

3. Also, given the code to compute EER, go back to the previous tutorial on Face Verification and compute the EER for the CFPW and the LFW datasets.