# MobileFaceNet_Training_Step_by_Step

Now let's switch the gear towards the MobileFaceNet Training and Evaluation. In this tutorial, we will introduce how to build up the arcface loss function, the neural net training and the system evaluation in a step by step fashion   

## Loss Function 

Since the facenet is intended to extract the face image to highly discriminative features, the challenge in feature learning is how to appropriately design a loss function that enhance discrimiative power. There are two leading approches. One is to directly learn the feature vector such as [Contrastive Loss](http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf) and [Triplet Loss](https://www.cv-foundation.org/openaccess/content_cvpr_2015/ext/1A_089_ext.pdf). The other is to deem face recognition as a classification problem to separate different identities. In this repo, we use one of the state-the-art algorithms [insightface](https://arxiv.org/pdf/1801.07698.pdf) as a multi-class classifier. Insightface can not only effectively separate the identities, but also allow strong intra-class compactness and inter-class discrepancy simultaneously. The MobileFaceNet training supervised by the insightface loss is described as below:

<img src="images/ipy_pic/diagram.png"  width="900" style="float: left;">
<img src="images/ipy_pic/equation.png"  width="400" style="float: left;">

The embedding feature vector dimension 'd' from MobileFaceNet is set as 512 in this work. The feature vector will go through a fully connected layer with weight 'W' in shape 'd x n' towards 'n' classes. In insightface algorithm, the feature vector and weight(bias = 0) will be normalized to get the angle between feature and weight. After adding angular margin penalty 'm' and feature scale 's', the logit will then go through the softmax function to obtain cross entropy loss. The weight 'W' here provides the centre for each class, enforcing higher similarity for intra-class samples and diversity for inter-class samples.

[wujiwang](https://github.com/wujiyang/Face_Pytorch) presents a very good demo to illustrate the performance difference between traditional softmax loass and softmax+center loss. Arcface loss essentially outperforms the center loss. 

<table><tr>
<td> <img src="images/ipy_pic/softmax.gif"  width="300" style="float: left;"> </td>
<td> <img src="images/ipy_pic/softmax_center.gif"  width="300" > </td>
</tr></table>

Here we can build up a class named 'Arcface' using Pytorch to calculate cos logit and then pass it to softmax function 

In [1]:
from torch.nn import Linear, Conv2d, BatchNorm1d, BatchNorm2d, PReLU, ReLU, Sigmoid, Dropout2d, Dropout, AvgPool2d, MaxPool2d, AdaptiveAvgPool2d, Sequential, Module, Parameter
import torch.nn.functional as F
from torch import nn
import torch
import math

class Arcface(Module):
    # implementation of additive margin softmax loss in https://arxiv.org/abs/1801.05599    
    def __init__(self, embedding_size=512, classnum=51332,  s=64., m=0.5):
        super(Arcface, self).__init__()
        self.classnum = classnum
        self.kernel = Parameter(torch.Tensor(embedding_size,classnum))
        nn.init.xavier_uniform_(self.kernel)
        # initial kernel
        self.kernel.data.uniform_(-1, 1).renorm_(2,1,1e-5).mul_(1e5)
        self.m = m # the margin value, default is 0.5
        self.s = s # scalar value default is 64, see normface https://arxiv.org/abs/1704.06369
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)
        self.mm = self.sin_m * m  # issue 1
        self.threshold = math.cos(math.pi - m)
    def forward(self, embbedings, label):
        # weights norm
        nB = len(embbedings)
        kernel_norm = l2_norm(self.kernel,axis=0) # normalize for each column
        # cos(theta+m)
        cos_theta = torch.mm(embbedings,kernel_norm)
        cos_theta = cos_theta.clamp(-1,1) # for numerical stability
        cos_theta_2 = torch.pow(cos_theta, 2)
        sin_theta_2 = 1 - cos_theta_2
        sin_theta = torch.sqrt(sin_theta_2)
        cos_theta_m = (cos_theta * self.cos_m - sin_theta * self.sin_m)
        cond_v = cos_theta - self.threshold
        cond_mask = cond_v <= 0
        keep_val = (cos_theta - self.mm) # when theta not in [0,pi], use cosface instead
        cos_theta_m[cond_mask] = keep_val[cond_mask]
        output = cos_theta * 1.0 # a little bit hacky way to prevent in_place operation on cos_theta
        idx_ = torch.arange(0, nB, dtype=torch.long)
        output[idx_, label] = cos_theta_m[idx_, label]
        output *= self.s # scale up in order to make softmax work, first introduced in normface
        return output

## Training

The training and evaluation data can be downloaded from [model zoo](https://github.com/deepinsight/insightface/wiki/Dataset-Zoo) provided by deepinsight. All face images have been aligned by MTCNN and cropped to 112x112. The MS1M-ArcFace which has 85742 classses with 5.8 million images is recommended for the MobileFaceNet training. The images are preprocessed before passing to MobileFaceNet 

In [2]:
import numpy as np
import cv2
import os
import torch.utils.data as data

import torch
import torchvision.transforms as transforms

def img_loader(path):
    try:
        with open(path, 'rb') as f:
            
            img = cv2.imread(path)
            if len(img.shape) == 2:
                img = np.stack([img] * 3, 2)
            return img
    except IOError:
        print('Cannot load image ' + path)

class MS1M(data.Dataset):
    def __init__(self, root, file_list, transform=None, loader=img_loader):

        self.root = root
        self.transform = transform
        self.loader = loader

        image_list = []
        label_list = []
        with open(file_list) as f:
            img_label_list = f.read().splitlines()
        for info in img_label_list:
            image_path, label_name = info.split(' ')
            image_list.append(image_path)
            label_list.append(int(label_name))

        self.image_list = image_list
        self.label_list = label_list
        self.class_nums = len(np.unique(self.label_list))
        print("MS1M dataset size: ", len(self.image_list), '/', self.class_nums)

    def __getitem__(self, index):
        img_path = self.image_list[index]
        label = self.label_list[index]

        img = self.loader(os.path.join(self.root, img_path))

        # random flip with ratio of 0.5
        flip = np.random.choice(2) * 2 - 1
        if flip == 1:
            img = cv2.flip(img, 1)

        if self.transform is not None:
            img = self.transform(img)
        else:
            img = torch.from_numpy(img)

        return img, label

    def __len__(self):
        return len(self.image_list)
    
transform = transforms.Compose([
        transforms.ToTensor(),  # range [0, 255] -> [0.0,1.0]
        transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))])  # range [0.0, 1.0] -> [-1.0,1.0]

root = 'data_set/faces_emore_images'
file_list = 'data_set/faces_emore_images/faces_emore_align_112.txt'
dataset = MS1M(root, file_list, transform=transform) 

trainloader = data.DataLoader(dataset, batch_size=128, shuffle=False, num_workers=2, drop_last=False)
for det in trainloader:
    print('{} data loading as shape:'.format('MS1M'), det[0].shape)
    print('{} label loading as shape:'.format('MS1M'), det[1].shape)
    break

MS1M dataset size:  5822653 / 85742
MS1M data loading as shape: torch.Size([128, 3, 112, 112])
MS1M label loading as shape: torch.Size([128])


The overall training process is demonstrated as below. SGD+Momentum optimization algorithm is applied with scheduled learning rate decay. Here we provide a demo to show how the training will proceed. For detailed training script, one can refer to the "train.py" in this repo. The accuracy will stay as zero for a while, be patient :) 

In [3]:
import sys
sys.path.append('..')
import argparse
import torch
import torch.utils.data as data
import torch.optim as optim
from torch.optim import lr_scheduler
from face_model import Backbone, MobileFaceNet, Arcface
import time

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = MobileFaceNet(512).to(device)  
margin = Arcface(embedding_size=512, classnum=85742,  s=32., m=0.5).to(device)
    
criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer_ft = optim.SGD([
    {'params': model.parameters(), 'weight_decay': 5e-4},
    {'params': margin.parameters(), 'weight_decay': 5e-4}], lr=0.01, momentum=0.9, nesterov=True)

exp_lr_scheduler = lr_scheduler.MultiStepLR(optimizer_ft, milestones=[6, 8, 10], gamma=0.3) 
total_iters = 0
total_epoch = 12
for epoch in range(total_epoch):
    # train model
    exp_lr_scheduler.step()
    model.train()     
    since = time.time()
    for det in trainloader: 
        img, label = det[0].to(device), det[1].to(device)
        optimizer_ft.zero_grad()

        with torch.set_grad_enabled(True):
            raw_logits = model(img)
            output = margin(raw_logits, label)
            loss = criterion(output, label)
            loss.backward()
            optimizer_ft.step()
            
            total_iters += 1
            # print train information
            if total_iters % 100 == 0:
                # current training accuracy 
                _, preds = torch.max(output.data, 1)
                total = label.size(0)
                correct = (np.array(preds.cpu()) == np.array(label.data.cpu())).sum()                  
                time_cur = (time.time() - since) / 100
                since = time.time()

                for p in  optimizer_ft.param_groups:
                    lr = p['lr']
                print("Epoch {}/{}, Iters: {:0>6d}, loss: {:.4f}, train_accuracy: {:.4f}, time: {:.2f} s/iter, learning rate: {}"
                      .format(epoch, total_epoch-1, total_iters, loss.item(), correct/total, time_cur, lr))
        if total_iters == 1000:
            break
    break

Epoch 0/11, Iters: 000100, loss: 27.6344, train_accuracy: 0.0000, time: 0.37 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000200, loss: 27.5277, train_accuracy: 0.0000, time: 0.37 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000300, loss: 27.9309, train_accuracy: 0.0000, time: 0.38 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000400, loss: 27.5351, train_accuracy: 0.0000, time: 0.38 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000500, loss: 27.3709, train_accuracy: 0.0000, time: 0.38 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000600, loss: 27.9678, train_accuracy: 0.0000, time: 0.38 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000700, loss: 27.6118, train_accuracy: 0.0000, time: 0.38 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000800, loss: 27.6302, train_accuracy: 0.0000, time: 0.38 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 000900, loss: 28.2122, train_accuracy: 0.0000, time: 0.38 s/iter, learning rate: 0.01
Epoch 0/11, Iters: 001000, loss: 28.0151, train_accurac

The overall training results are shown below. It is definiately not the best result to be reached. One can fine tune the learning rate hyperparameters or arcface loss parameters 's' or 'm' to drive a higher training accuracy.  
<table><tr>
<td> <img src="images/ipy_pic/loss_train.png"  width="500" style="float: left;"> </td>
<td> <img src="images/ipy_pic/accuracy_train.png"  width="500" > </td>
</tr></table>

## Evaluation 

Here are quite a lot of published dataset that we can utilize to evaluate our system. In this repo, we use LFW, AgeDB-30 and CFP-FP to evalute the training results. These dataset provide the pair matching paradigm: given a pair of face images, decide whether the images are of the same person. LFW (labelled faces in the wild) contains 13233 web-collected images from 5749 identities with large variations in pose, exosure and illuminations. AgeDB-30 (Age Database) is a dataset with additional variations in age. The minimum and maximum ages are 3 and 101, therefore more challenging. CFP-FP (Celebrities in Frontal Profile) consists of frontal and profile images, thus the most challenging database. 

The evaluation result in this work is shown as below table. 'Flip' the image can be applied to encode the embedding feature vector with ~ 0.07% higer accuracy. L2 distance score slightly outperforms cos similarity (not necessarily the truely trend for other cases, but it is what we conclude in this work) 


|  Eval Type     |   Score   |   LFW   | AgeDB-30 | CFP-FP 
|:--------------:|:---------:|:-------:|:--------:|:-------
|Flip            |  L2       |  99.52  |   96.30  |  92.93    
|Flip            |  Cos      |  99.50  |   96.18  |  92.84   
|UnFlip          |  L2       |  99.45  |   95.63  |  93.10   
|UnFlip          |  Cos      |  99.45  |   95.65  |  93.10     


The detailed evaluation code can be found from "Evaluation.py". The code is to extract all the feature vectors from each pair of images, separete the dataset into folds, find out the best threshold for 9 out of 10 folds and evaluate the accuracy for the last one. The process is repeated and averaged accuracy is calculated.  

In [4]:
from Evaluation import getFeature, evaluation_10_fold
from data_set.dataloader import LFW, CFP_FP, AgeDB30

detect_model = MobileFaceNet(512).to(device) 
detect_model.load_state_dict(
            torch.load('Weights/MobileFace_Net', map_location=lambda storage, loc: storage))
detect_model.eval()

root = 'data_set/LFW/lfw_align_112'
file_list = 'data_set/LFW/pairs.txt'
dataset = LFW(root, file_list, transform=transform)
dataloader = data.DataLoader(dataset, batch_size=128, shuffle=False, num_workers=2, drop_last=False)
featureLs, featureRs = getFeature(detect_model, dataloader, device, flip = True)
ACCs, threshold = evaluation_10_fold(featureLs, featureRs, dataset, method = 'l2_distance')
    
for i in range(len(ACCs)):
    print('{} accuracy: {:.2f} threshold: {:.4f}'.format(i+1, ACCs[i] * 100, threshold[i]))
print('--------')
print('Average Acc:{:.4f} Average Threshold:{:.4f}'.format(np.mean(ACCs) * 100, np.mean(threshold)))

1 accuracy: 99.33 threshold: 1.4237
2 accuracy: 99.83 threshold: 1.4228
3 accuracy: 100.00 threshold: 1.4213
4 accuracy: 99.33 threshold: 1.4214
5 accuracy: 98.67 threshold: 1.4244
6 accuracy: 99.67 threshold: 1.4220
7 accuracy: 98.67 threshold: 1.4206
8 accuracy: 100.00 threshold: 1.4282
9 accuracy: 100.00 threshold: 1.4229
10 accuracy: 99.67 threshold: 1.4227
--------
Average Acc:99.5167 Average Threshold:1.4230


## End