# What is Face Recognition?
## Face Verification vs Face Recognition
- Verification :
    - Input image, name/ID
    - Outputs whether the input image is that of the claimed person or not.
    - This is one-to-one problem where we just want to know if the person is the person they claim to be.
- Recognition :
    - Has a database of K persons
    - Get an input image
    - Output ID if the image is any of the K persons (or not recognized)

# One Shot Learning
- One of the challenges of face recognition is that we need to solve the one-shot learning problem.
- What that means is that for most face recognition applications we need to be able to recognize a person given just one single image or given just one example of that person's face. Historically, deep learning algorithms don't work well if we have only one training example.
- Let's say we have a database of 4 pictures of employess in our orgnization. 
    - Now let's say someone shows up at the office and they want to be let through the turnstile. 
    - What the system has to do is, despite ever having seen only one image of a person, to recognize that this is actually the same person. 
    - In contrast, if it sees someone that's not in the database, then it should recognize that this is not any of the 4 persons in the database.
    - So in the one shot learning problem, we have to learn from just one example to recognize the person again.
    - We need this for most face recognition systems use, because we might have only one picture of each of our employees or of our team members in our employee database.
- One approach we could try is to input the image of the person, feed it to a ConvNet and have it output a label y, using a softmax unit with 5 outputs corresponding to each of these 4 persons or none ot the above. So that would be 5 outputs in the softmax.
- But this really doesn't work well. Because if we have such a small training set it is really not enough to train a robust neural network for this task.
- Also what is a new person joins our team? So now we have 5 persons we need to recognize, so there should now be 6 outputs. Do we have to retain the ConvNet every time?
- ![image.png](attachment:image.png)
## Learning a 'similarity' function
- So to carry out face recognition, to carry out one-shot learning. To make this work, what we're going to do is to learn a similarity function. 
- We want a neural network to learn a function d, which inputs 2 images and outputs the degree of difference b/w the 2 images.
- If the 2 images are of the same person, we want this to output a small number. If the 2 images are of the very different perople we want it to output a large number.
- So during recognition time, if the degree of difference b/w them is less than some threshold tau, which is a hyperparameter. Then we would predict that these 2 pictures are the same person. If it is greater than tau, we would predict that these are different persons.
- To use this for a recognition task, what we do is, given the new picture we will use the function d to compare these 3 images and maybe output a larege number. Then we compare the new image with the 2nd image in our database, we might output a very small number. We do this for the other images in out database and so on.
- In contrast, if someone not in our database shows up, as we use the function d to make all of these pairwise comparisions, hopefully d will output have a very large number for all 4 pairwise comparisons. Then we say that this is not any one of the 4 persons in the database. 
- This allows us to solve the one-shot learning problem. We can learn the function d, which inputs a pair of images and tells us if they're the same person or different persons. Then if we have someone new join our team, we can add a 5th person to our database, and it just works fine.
- ![image-2.png](attachment:image-2.png)

# Siamese Network
- The job of the similarity function d, is to input 2 faces and tell us how similar or how different they are. A good way to do this is to use a Siamese network.
- We input an image X(1) through a sequence of convolutional and pulling and fully connected layers, end up with feature vector.
- This feature vector be 128 computed by some fully connected layers. This 128 vector is the re-representing of the input image x(1).
- We can build a face recognition system is then that if we want to compare 2 pictures, what we can do is feed the 2nd picture to the same neural network with the same parameters and get a different vector of 128 numbers, which encodes the 2nd picture.
- If we believe that these encodings (128 vectors) are good representation of the 2 images, what we can do is then define the image d.
- The idea of running 2 identical, convolutional neural networks on 2 different inputs and then comparing them, that's called a Siamese neural network architecture. 
- ![image.png](attachment:image.png)
- How do we train this Siamense neural network ? 
    - Remember that these 2 neural networks have the same parameters. 
    - So what we want to do is really train the neural network so that the encoding that it computes results in a function d that tells us when 2 pictures are of the same person. So more formally, the parameters of the neural network define an encoding f(x_i)
    - So given any input image x_i, the neural network outputs the 128 dimensional encoding f(x_i). 
    - So what we want to do is learn parameters so that if 2 pictures, x_i and x_j are of the same person, then we want that distance b/w their encodings to be small. In contrast, if x_i and x_j are of different persons, then we want that distance b/w their encodings to be larege.
    - So as we vary the parameters in all of these layers of the neural network, we end up with different encodings. What we can do is use back propagation to vary all thoese parameters in order to make sure these conditions are satisfied. 
- ![image-2.png](attachment:image-2.png)

# Triplet Loss
- One way to learn the parameters of the neural network, so that is gives us a good encoding for our pictures of faces, is to define and apply gradient descent on the triplet loss function.
- To apply the triplet loss we need to compare pairs of images.
- For example given the picture, to learn the parameters of the neural network, we have to look at several pictures at the same time.
- Givn a pair of images, we want their encodings to be similar if the images are of the same person. Whereas, we want their encoding to be quite different if the images are of different persons.
- In Triplet loss, Positive = same person, negative = different person.
- In triplet loss, we always be looking at 3 images at a time. We'll be looking at an anchor image(A), a positive image(P), as well as a negative image(N).
- What we want for the parameters of our neural network is encoding(A)-encoding(P) should be smaller than or equal to encoding(A)-encoding(N).
- We have Alpha which is called a margin, which solve the problem of all the encodings becoming zero or identical.
- Here, the margin (alpha) pushes the anchor-positive pair and the anchor-negative pair away from each other. 
- ![image.png](attachment:image.png)

## Loss Function
- The triplet loss function is defined on triples of images.
- Given 3 images : A, P and N
    - Positive is of the same person as the anchor. but the negative is of a differnt person than the anchor.
- If we have a training set of 10,000 pictures of 1000 different person, we have to do is take our 10,000 pictures and use it to generate triplets, and then train our learning algorithm using gradient descent on the cost function. Inorder to define the dataset of triplets, we do need some paris of A and P, pairs of pictures of the same person. In this training set on average of each of our 1000 persons we have 10 pictures. If we have just 1 picture of each person, then we can't acutally train the syste,
- After having trained a system, we can then apply it to out one-shot learning problem where for our face recognition system maybe we have only a single picture of someone we might be trying to recognize. 
- But for our training set, we do need to make sure we have multiple images of the same person, atleast for some people in our training set, so that we can have pairs of anchor and positive images
- ![image-2.png](attachment:image-2.png)
## Choosing the triplets A, P, N
- ![image-3.png](attachment:image-3.png)

## Training set using Triplet Loss
- To train on triplet loss, we need to take our training set and map it to a lot of triples.
- What we do, having to find the training set of Anchor, Positive and Negative triples is use gradient descent to try to minimize the cost function J.
- This will have the effect of backpropagation to all the parameters of the Neural Network in order to learn an encoding so that d of 2 images will be small when these 2 images are of the same person and they'll be large when these 2 images are of the different persons.

![image.png](attachment:image.png)

# Face Verification and Binary Classification
- The Triplet Loss is one good way to learn the parameters of a continent for face recognition. There's another way to learn these parameters.
- Another way to train a neural network, is to take the pair of neural networks to take this Siamese Network and have them both compute these embeddings (128 dimensional embeddings), and then have these be input to a logistic regression unit to then just make a prediction where the target output will be 1 if both of images are the same persons, and zero if both of these images are of different persons.
- This is a alternative to triplet loss for training system.
- What does this final logistic regression unit acutally do?
    - The output y_hat will be a sigmoid function, applied to some set of features but rather than feedingh in these encodings, we take the differences b/w the encodings.
- In this learning formulation, the input is a pair of images and the ouput y is either 0 or 1 depending on whether we're inputting a pair of similar or dissimilar images.
- ![image.png](attachment:image.png)

## Face Verification Supervised Learning
- We create a training set of just pairs of images where the target label is 1 when these are a pair of pictures of the same person and where the tag label is zero when these are pictures of different persons. 