# Detecting Deep Fakes
### author: Telemachos Chatzitheodorou  
Total time to complete the task: 38 hours

## Table of Contents

1.   [Introduction](#introduction)  
    1.1 [Face generation](#fg)  
    1.2 [Face alternation](#fa) 
2. [How do you find which is real using Machine Learning?](#detectml)
3. [Solution Implementation](#solutionimplem)  
  3.1. [Autoencoder](#ae)  
  3.2  [Classification](#clas)  
   [Sources](#sources)


## Introduction <a name="introduction" />

### Domain exploration.

Deepfakes are AI generated images/videos with the ultimate purpose to provide   a realistic simulation of a scenario, although if build well enough to be undistinguisable to humans (and machines), they maybe used for nefarious ways [[1]](https://www.bbc.com/news/business-51204954). 
The main uses are face generation, face alternation and facial attributes and expression.

#### Face generation <a name="fa" />
Although the today's SOTA algorithms are very good at face generation, luckily for us the are a few areas that fall short and a trained eye can have great accuracy in detecting them. 

The main areas face generated images are distinguishable from real ones are:
- artifacts in background  
<img src="https://graphics.reuters.com/CYBER-DEEPFAKE/ACTIVIST/nmovajgnxpa/img/background.jpg" width=256/> <img src="https://cms.qz.com/wp-content/uploads/2018/12/earring2.png?quality=75&strip=all&w=620&h=638&crop=1" height=256/>   
<sub>[*Source left*](https://graphics.reuters.com/CYBER-DEEPFAKE/ACTIVIST/nmovajgnxpa/), [*Source right*](https://qz.com/1510746/how-to-identify-fake-faces-generated-by-ai/)</sub>

- teeth (fuzzy rendering, asymmetrical, improperly alligned)  
<img src="https://miro.medium.com/max/700/0*Wmh_X7JDjHho5d5V.png" width=600/>    
<sub>[*Source*](https://arxiv.org/pdf/1912.04958.pdf)</sub>

- eyes (non circular pupils)  
<img src="https://i.ibb.co/82pMQNX/pupils.png" width=256/>  
<sub>[*Source*](#mdfc)</sub>

- ears (asymmetric)  
- hair (fuzzy, not connecting)  
<img src="https://cms.qz.com/wp-content/uploads/2018/12/teeth3.png?quality=75&strip=all&w=620&h=626&crop=1" width=256/>  
<sub>[*Source*](https://qz.com/1510746/how-to-identify-fake-faces-generated-by-ai/)</sub>

- strange clothing  
<img src="https://cms.qz.com/wp-content/uploads/2018/12/background.png?quality=75&strip=all&w=620&h=614&crop=1" width=256/>  
<sub>[*Source*](https://qz.com/1510746/how-to-identify-fake-faces-generated-by-ai/)</sub>



The algorithms used for face generation are StyleGAN and StyleGAN2 which is the successor of StyleGAN and has resolved many issues from the first edition but the result is still not ideal.

#### Face alteration <a name="fg" />

Face alteration can be applied to both photos and videos. In a photo you swap a face with another persons. In a video the process is more complicated because you have to match the facial expressions, lighting, shadows etc to be more realistic. Combine it with the proper audio dialog and you have the swapped person talking in the video.  
Luckily, the result has various flaws and most of the times you can distinguish such types of photos/videos, although the best submission in the Kaggle's Deepfake Detection challenge was at 65% accuracy [[2]](#mdfc) which shows how hard is to generate an automatic detection solution.  
Because the video part is far more vast than the photo one and out of scope for this project, I will present only some issues with deep fake photos that also apply in videos.  
Usually, the major faults in face alternation are:
- blur faces
- unatural skin tones  
<img src="https://miro.medium.com/max/700/1*w-ak0m1yxSVHS4eDHjWvtg.jpeg" width=400/>  
<sub>[*Source*](#mdfc)</sub>

- neglect face characteristics (longer/shorter forehead, double chin.. etc)  
<img src="https://miro.medium.com/max/700/1*NBbHYKkG6wqcvguDkFwI8Q.jpeg" width=500/>  
<sub>[*Source*](#mdfc)</sub>

- blurred boundaries on the edges of the face  
<img src="https://miro.medium.com/max/700/1*wyQasy3zWhzAeK24Kl8gQg.jpeg" width=500/>  
<sub>[*Source*](#mdfc)</sub>







## How do you find which is real using Machine Learning? <a name="detectml" />

To classify the real from fake images I will use a technique called feature extraction (with the use of the AutoEncoder architecture) and then use the encoder part to classify the given dataset using logistic regression

### AutoEncoder

The AutoEncoder network consists of two parts. The encoder and the decoder, as is shown in the picture below.  

<img src="https://pythonmachinelearning.pro/wp-content/uploads/2017/10/Vanilla-Autoencoder.png.webp" alt="autoencoder architecture" />  
<sub>[*Source*](https://pythonmachinelearning.pro/all-about-autoencoders/)</sub>  
The input layer and output layer are the same size and the hidden layer is smaller than the size of the input and output layer. The hidden layer is compressed representation, the network learns two sets of weights (and biases) that encode input data into the compressed representation and decode the compressed representation back into input space. This type of autoencoder is also known as undercomplete because the compressed representation has less information than the input [[6]](#dl_book).
In order to evaluate how good the recostruction is I use the recostruction error, which is nothing more than the Euclidean distance loss $||x-\hat{x}||^2$ aka Mean Squared Error.  
To train an AutoEncoder you don't need labeled data, so it is considered as an unsupervised learning method.  
The goal is to extract the minimum characteristics of a face (compressed representation) and use the encoder part as a pretrained part of a classifier. 

### Classification
Having trained the autoencoder, I will use only the encoder part which extracted the features of the training data and I will add an output layer for Logistic regression. The role of the layer is two-fold, it classifies the data and also fine tunes the network.  

An illustration of the network is shown below.  
<img src="https://i.ibb.co/zhHLcJk/AE-LR.jpg" alt="AE-LR" border="0"></a>  
<sub>[*Source*](#SDA)</sub>  
Because it's a supervised learning task, I need labeled data. For that task, we have two folders of fake and real images.










## Solution Implementation <a name="solutionimplem" />  

### Autoencoder <a name="ae" />
#### Available data 
First, let's examine the data we have at hand. Specifically the images in folder dataset1.

I imported the dataset using tf.data.Dataset.

<img src="https://i.ibb.co/h8FkTYW/data-eda.png" alt="dataset" />  
I see that the pixels values of the images range from 0 to 255.  I need to scale that range to [0., 1.] as machine learning models work better at that range due to the effective range of activation functions and bigger numbers tend to have greater impact on the output of a network. Due to the nature of the autoencoder, the normalization should be done to input and the outputs.

#### Dataset Preperation
For the dataset I chose a split of 70%, 15%, 15% for train, validation and test sets.  
Also:
- batch number = 64
- shuffle with buffer size of 200

The resulting datasets have shapes

- train_ds: ((64, 256, 256, 3), (64, 256, 256, 3)) 
- val_ds:  ((64, 256, 256, 3), (64, 256, 256, 3)) 
- test_ds: ((64, 256, 256, 3), (64, 256, 256, 3))
 
The first element of the tuple corresponds to the input and the second to the output.  

The number of samples for each one is:
- train_ds: 7000 
- val_ds: 1500 
- test_ds: 1500  

The datasets are ready, but let's check if they are correctly assigning the same image to input and target.
<img src="https://i.ibb.co/prgY3f0/train-ds-input-target.png" alt="input target dataset" />  
The dataset seems to be built fine.  
Let's continue to the model building section.

#### Model Implementation
For the implementation of the network it was used keras.
The network has 3 layers in the encoder and 4 in the decoder. The fourth layer of the decoder is added because the images have 3 channels.

After copious testing the best results came with the following hyperparameters.
- Encoder
  - convolution 2d (filters=168, kernel=3, activation=relu, padding=same, strides=2)  
  - convolution 2d (filters=64, kernel=3, activation=relu, padding=same, strides=2)
  - convolution 2d (filters=32, kernel=3, activation= relu, padding=same, strides=2)
- Decoder
  - convolution 2d transpose (filters=32, kernel=3, activation= relu, padding=same, strides=2)  
  - convolution 2d transpose (filters=64, kernel=3, activation=relu, padding=same, strides=2)
  - convolution 2d transpose (filters=168, kernel=3, activation=relu, padding=same, stride=2)
  - convolution 2d (filters=3, kernel=3, activation=relu, padding=same)

For the optimizer I used ADAM with learning rate at 0.001. In the bibliography is the SOTA for computer vision tasks [[5]](#https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/). For the loss, as stated above, the best choise is the Mean Squared Error. I also used accuracy as a more intuitive metric of the model's performance.
The model was trained for 10 epochs.  
The plot of the model is shown below  
<img src="https://i.ibb.co/52mtY90/ae.png" />  

#### Results
The training procedure can be seen in the image below (also available in the assignment.ipynb).  
<img src="https://i.ibb.co/WgtfhJw/training-ae.jpg" />  
The plot for loss and validation loss   
<img src="https://i.ibb.co/ypkXszc/training-plot-ae.png" />  
and accuracy  
<img src="https://i.ibb.co/VMW62K3/ae-accuracy.png" />   
The model generalizes well as we can see from the test dataset
<img src="https://i.ibb.co/YP5S2yC/ae-evaluation.jpg" />  
Also tried one image from the samples folder and the result is adequate.  
<ins>Input</ins>  
<img src="https://i.ibb.co/NS3LMDr/real-test-image.png" />  
<ins>Output</ins>  
<img src="https://i.ibb.co/xMk3rHv/decoded-test-image.png" />  






### Classification <a name="clas" />

> **Note**: After investigation I discover that I have made some mistakes in that part that made my model untrainable.
> 1. The datasets were not balanced (fixed [here](https://shorturl.at/psyTW))
> 2. I mistakenly take the problem of classification as a transfer learning problem and used many layers in the classification part being influenced by other classification networks and consequently used incorrect activation functions.
> 3. The one hot encoding was integer encoding.

For the classification task I follow the same procedure. Examine the data!

The images have the same characteristics. The differences are going to be in the dataset preperation. This time I will use one-hot-encoding to encode the fake and real information as the target variable in each image. Also the normalization is going to be applied only on the input image.

#### Dataset Preperation

For the dataset I chose a split of 80%, 10%, 10% for train, validation and test sets as requested.  
I used 30% of the total number images which translates to 6000 images.  
Also:
- batch number = 128 (best results)
- shuffle with buffer size of 100 (applied only to the training set)

The resulting datasets have shapes

- train_ds_cal: ((128, 256, 256, 3), (128, 2))
- val_ds_cal:  ((128, 256, 256, 3), (128, 2))
- test_ds_cal: ((128, 256, 256, 3), (128, 2))

and the number of samples per dataset is:
- train_ds_cal: 4800
- val_ds_cal:  600
- test_ds_cal: 600

Now, let's see the data I am going to feed in the classifier.

<img src="https://i.ibb.co/Gn4TyTK/real-fake-ds.png" alt="input target dataset" />  
The labeling seems correct. 

#### Model Implementation
As I mentioned above I will use the encoder part of the AutoEncoder model and add a logistic regression layer in the output.

The architecture is as follows:

- Encoder
- Flatten layer
- Dense layer (units=2, activation=softmax)

The plot of the model is presented below  


The rest of the hyperparameters are:
- optimizer: ADAM (with learning rate at 0.001),
- loss: categorical cross entropy,
- metrics: accuracy, precision and recall.  

The model was trained for 20 epochs.

#### Results  

The network was heavily underfitted so I added two regularization terms. The best combination are for:
- l1 = 0.01
- l2 = 0.1  

Also tried the sigmoid activation function [[8]](#SDA) instead of softmax but the results were much worse.

The training procedure is shown in the image below  
<a href="https://ibb.co/6JKyRqQ"><img src="https://i.ibb.co/3YxTyPK/ae-lr-training-2.jpg" alt="ae-lr-training-2" border="0"></a>

also the loss,  
<img src="https://i.ibb.co/NF7Gcrc/ae-lr-loss.png" alt="ae-lr-loss" border="0">

the accuracy,  
<img src="https://i.ibb.co/nDtw9qG/ae-lr-accuracy.png" alt="ae-lr-accuracy" border="0">

the precision,  
<img src="https://i.ibb.co/nDtw9qG/ae-lr-accuracy.png" alt="ae-lr-accuracy" border="0">

and finally the recall.  
<img src="https://i.ibb.co/nDtw9qG/ae-lr-accuracy.png" alt="ae-lr-accuracy" border="0">

The generalization of the model is on par with the training results.
<img src="https://i.ibb.co/zZPv9QZ/ae-lr-evaluate.jpg" alt="ae-lr-evaluate" border="0">

Although the model seems to train well from the loss graph, the metrics indicate that there is room for improvement.

#### Afterthoughts
I believe this is the best result with this configuration. Some possible points for achieving greater performance are:
- using a greater portion of the provided dataset.
- use of denoising autoencoders
- use a different train-validation-test split
----

#### Optional subtask


For this task the only think I have to do is import the images from the samples folder, normalize them and add an extra dimension as the batch size.

The images are shown below 
<img src="https://i.ibb.co/rwdt1F3/download-1.png" border="0">



with a little searching I found the actual labels for them which are:

|  image number |  image label |  
|---------------|--------------|
|       1       |       real   |
|       2       |     fake     |
|       3       |     fake     |
|       4       |     fake     |
|       5       |     real     |
|       6       |     real     |

The outputs of the model are probabilities. Passing the images from it, yields the following results

    [[0.4371662  0.5628338 ]  
    [0.6060826  0.39391732]  
    [0.59835744 0.40164256]  
    [0.62500703 0.37499294]  
    [0.5071038  0.49289626]  
    [0.2657548  0.73424524]]

 Each line correspond to one image and each element inside the line corresponds to the labels (fake, real).  
 So this array translates to the table below.

|  image number |  predicted label| |  
|---------------|--------|-------- -| 
|               | fake    | real    |
|               |       |           |
|       1       | 0.4371662 | 0.5628338|
|       2       | 0.6060826 | 0.39391732|
|       3       | 0.59835744 | 0.40164256 |
|       4       | 0.62500703 | 0.37499294 |
|       5       | 0.5071038 | 0.49289626 |
|       6       | 0.2657548 | 0.73424524 |


We see that the model performs adequately, given the training results.

## Sources <a name="sources" />

1. [<a name="bbc"> https://www.bbc.com/news/business-51204954</a>](https://www.bbc.com/news/business-51204954)
2. <a name="mdfc"> https://jonathan-hui.medium.com/detect-ai-generated-images-deepfakes-part-1-b518ed5075f4</a>  
3. <a name="autoenc"> https://pythonmachinelearning.pro/all-about-autoencoders/</a>  
4. <a name="dlbook"> https://www.deeplearningbook.org/contents/autoencoders.html </a>
5. <a name="adam"> https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/</a>
6. https://theaisummer.com/deepfakes/
7. https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
8. <a name="SDA"> C. Xing, L. Ma, and X. Yang, “Stacked Denoise Autoencoder Based Feature Extraction and Classification for Hyperspectral Images,” J. Sensors, vol. 2016, p. 3632943, 2016.</a>
9. Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification of hyperspectral data,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 2014.


