Yuming Yao 2032754

Dataset：https://www.kaggle.com/c/deepfake-detection-challenge


# 1. Introduction

The continuous development of science and technology enables deep learning (DL) models to be applied to a variety of applications. For example, Generative Adversarial Networks (GAN) models can produce super-realistic images, languages, and even videos. For example, the so-called deep forgeries made by GANs (Generative Adversarial Networks) manipulating audio or video clips are too close to real content to be distinguished from real images in human perception. 

One of its most influential applications, DeepFake, is very advanced. In principle, it can turn the protagonist of the video into anyone. In principle, DeepFake uses one of the most well-known deep learning algorithms, GANs. Its first appearance at the end of 2017 caused a sensation. At that time, DeepFake made its first appearance in the adult exchange community of the American social news website Reddit, and it caused a lot of shock. This is a user named Deepfakes in a community who grafted the face of "Wonder Woman" Gal Gadot to the heroine of an adult movie and uploaded the video to the website. After uploading this fake video to this website, it caused a lot of discomfort. Reddit officially banned the face-changing video produced and uploaded by Deepfakes for inappropriate content and infringed on the privacy of others. Of course, this is only one of many bad influences, and there are many evil or illegal ways to use these fake content in propaganda, political campaigns, cyber crime, blackmail, etc. If it is allowed to continue, the consequences will be disastrous. Therefore, it is very necessary to do some work to identify these fake content generated by algorithms through technical means, which is a crucial part of solving the problems caused by deepfake described above. First identify the false content, and then we can better prevent, warn, and punish. This work is to protect people's personal rights, safeguard the people's legitimate interests, and maintain the stable development of society, which is extremely beneficial.

In this case, the researchers decided to find a method of deepfake detection to protect people from this huge danger. It was at that time that offensive deepfakes and defensive detection methods began to compete. This is why the first data sets containing generated fake videos appeared. What I want to do this time is to detect the fake faces generated by deepfake. The input of the predictor is the data set provided by the DFDC challenge, and the Resnext and Xception neural network algorithms are used to detect real and fake faces.

# 2.	Related Work

Since the development of deepfake, even the detection of deepfake has long become a hot research topic. In order to find a solution to this problem, relevant researchers have actively explored relevant research strategies and made many attempts. Usually, the solution approach uses visible artifacts, which is common in most deepfakes. The most successful method is based on blinking eyes, mismatched color profiles and facial distortion artifacts. Such artifacts provide good accuracy for deepfakes to a large extent, especially for relatively older fakes. On the other hand, this problem is considered to be more complex and requires other connections beyond vision [1]. Therefore, it is necessary to combine other technologies to study and solve the deepfake problem.

The most used research on this type of problem is directly based on machine learning and deep learning for recognition and detection, but recently integrated learning has been proposed to further improve the recognition accuracy. Next, let’s look at the work that has made some research progress in this area.

First, let's look at some related work directly based on machine learning and deep learning. As early as 2018, D. Afchar et al. proposed an automatic and efficient method for detecting face tampering in videos based on DeepFake and Face2Face, which can generate fake videos that cannot be distinguished by the human eyes. Because traditional image forensics technology is usually not suitable for video, and compression will seriously reduce the quality of the data. Therefore, this paper follows the method of deep learning and proposes two networks, both with fewer layers, to focus on the mesoscopic properties of the image [2]. This is one of the earlier work done in deepfake detection. In [3], they focused on the problem of detecting fake faces by matching falsely replaced face components, specifically by matching the eyebrow region to detect fake phenomena. Further, other researchers can add other parts to this work to enhance the recognition effect. With the continuous improvement of the deepfake algorithm, one can expect that the altered image artifacts will disappear. Therefore, the biometrics of the exchange components may be more preferable than marking forged images. B. Malolan et al. are analyzing the following methods, two convolutional neural network (CNN) architectures, namely Meso-4 and MesoInception-4; Convolutional-LSTM network, the former extracts frame-level features, the latter performs sequence processing and sends the dense layer, and finally determines whether the video is real or fake. Changing faces is never perfect. Obvious distortions and blurring of the surrounding facial areas usually expose them. The use of exchanged face resolution mismatches and affine transformations leads to the existence of fake images, a CNN is trained to capture these features and match. Then a framework for using deep learning methods to detect these deepfake videos is proposed: a convolutional neural network architecture is trained on the face database extracted from the FaceForensics and DeepFakeDetection datasets. In addition, the model has been tested on various interpretable artificial intelligence technologies (such as LRP and LIME) to provide a clear visualization of the salient areas of the image focused by the model [4]. The above work has achieved quite good results, but deepfake is also making progress. Obviously we need further research to combat deepfake.

DeepfakeStack [5] is a deep ensemble learning technology that combines a series of the latest classification models based on deep learning and creates an improved combined classifier. Based on training meta-learners on top of pre-trained basic learners, and provide an interface to adapt meta-learners according to the predictions of basic learners, and show how the integrated technology performs classification tasks. The architecture of DeepfakeStack includes two or more basic learners, called level 0 models, and a meta-learner, called level 1 model, which combines the predictions of these level 0 models. The level 1 model is trained based on the predictions made by the basic model of out-of-sample data. That is to say, the data not used for training the basic model is provided to the basic model to make predictions, and these predictions and the expected output provide input and output pairs of the training data set used to fit the meta-model. Compared with a single machine learning and deep learning model, deep ensemble learning shows better results in deepfake detection. The accuracy of DeepfakeStack can reach 99.65%, and the AUC can reach 1.0.

The current work is based on machine learning, deep learning and deep integrated learning to identify and detect deepfakes. Our work is also based on deep learning. The difference is that what we do has the following characteristics. We do it on the original video data set. A comparison of several ways of extracting faces, such as MTCNN, BlazeFace, YOLO, etc., used the Resnet model for training and prediction, and at the same time, using Resnetx and Xception pre-trained models to perform predictions. In the end we found that BlazeFace has the fastest detection speed, and the Resnetx model AUC can reach 0.98, which is a good data.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Load Data

In [None]:

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, ArtistAnimation 
%matplotlib inline
import glob
import cv2
from albumentations import *
from tqdm import tqdm_notebook as tqdm
import gc
import seaborn as sns
import warnings
import torch
import tensorflow as tf
import math
import sys
import time
sys.path.insert(0, "/kaggle/input/blazefacepytorch")
from blazeface import BlazeFace
import os, sys, time

import torch.nn as nn
import torch.nn.functional as F
%matplotlib inline
import matplotlib.pyplot as plt
import random
import warnings
warnings.filterwarnings("ignore")
from tqdm.notebook import tqdm


from tensorflow.keras.layers import Conv2D, Input, ZeroPadding2D, Dense, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2


import keras
from keras import Model,Sequential
from keras.layers import *
from keras.optimizers import *
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split

PATH = '../input/deepfake-detection-challenge/'
print(os.listdir(PATH))
sorted(glob.glob('../input/deepfakes/meta*'))

First, we need to declare the path of training and test samples and metadata files (this is just a small part of it, more training data we will show later):

In [None]:
TEST_PATH = '../input/deepfake-detection-challenge/test_videos/'
TRAIN_PATH = '../input/deepfake-detection-challenge/train_sample_videos/'
metadata = '../input/deepfake-detection-challenge/train_sample_videos/metadata.json'

View the number of samples in the test and training set:

In [None]:
#  加载训练视频的文件名
train_fns = sorted(glob.glob(TRAIN_PATH + '*.mp4'))

# 加载测试视频的文件名
test_fns = sorted(glob.glob(TEST_PATH + '*.mp4'))

print('The training set has {} sample videos'.format(len(train_fns)))
print('The test set has {} sample videos '.format(len(test_fns)))

Load metadata

In [None]:
meta = pd.read_json(metadata).transpose()
label_df = meta.reset_index()
idx = label_df['index']
image_meta = []
first_trn_images = []

Analyze the number of true and false videos in the training set and describe them with pie charts and histograms respectively

In [None]:
labels = 'FAKE', 'REAL'
sizes = [meta[meta.label == 'FAKE'].label.count(), meta[meta.label == 'REAL'].label.count()]

fig1, ax1 = plt.subplots(figsize=(10,7))
ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90, colors=['#f4d53f', '#02a1d8'])
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Labels', fontsize=16)
plt.show()

In this histogram, we find that only 19% of the samples are real videos

In [None]:
def gather_info(train_img):
    train_img = os.path.join(PATH, f"train_sample_videos/{train_img}")

    cap = cv2.VideoCapture(train_img)

    success, image = cap.read()
    count = 0
    first_trn_image = None

    while success:
        try:
            success, image = cap.read()
            if count == 0:
                first_trn_image = image
            x, y, z = image.shape

        except:
            break
        count += 1
        
    cap.release()
    cv2.destroyAllWindows()
    
    return [x,y,z,count], first_trn_image

In [None]:
# Collect training image size
for i in tqdm(range(len(idx))):
    test, first = gather_info(idx.values[i])
    image_meta.append(test)
    first_trn_images.append(first)

In [None]:
#View the first frame of the first 10 videos
for img, lbl in zip(first_trn_images[:10], label_df['label'].values[:10]):
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    plt.title(f'{lbl}')
    plt.imshow(img)
    plt.show()
    
gc.collect()

In [None]:
# Sample size distribution
plt.title('Count of each video type')
sns.countplot(y=label_df['label'].values)

plt.show()

The training samples have two resolutions: 1080x1920 and 1920x1080. When training and testing, the number of frames of each video is 297/299 and 298/299, of which 299 is the most frequent. It is not clear whether the private test will be the same release. Obviously, from the perspective of frame resolution, this challenge will require a lot of resources.

Video data analysis

In [None]:
def get_frame(filename):
    # 读取视频
    cap = cv2.VideoCapture(filename)
    ret, frame = cap.read()
    image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    cap.release()
    cv2.destroyAllWindows()
    
    return image

def get_label(filename, meta):
    # 获取标签
    video_id = filename.split('/')[-1]
    return meta.loc[video_id].label

def get_original_filename(filename, meta):
    
    video_id = filename.split('/')[-1]
    original_id = meta.loc[video_id].original
    
    return original_id

def visualize_frame(filename, meta, train = True):

    # 获取第一帧
    image = get_frame(filename)

    #　展示第一帧视频
    fig, axs = plt.subplots(1,3, figsize=(20,7))
    axs[0].imshow(image) 
    axs[0].axis('off')
    axs[0].set_title('Original frame')
    
    # 提取脸部图片
    face_cascade = cv2.CascadeClassifier('../input/haarcascades/haarcascade_frontalface_default.xml')


    faces = face_cascade.detectMultiScale(image, 1.2, 3)

    image_with_detections = image.copy()

    for (x,y,w,h) in faces:

        cv2.rectangle(image_with_detections,(x,y),(x+w,y+h),(255,0,0),3)

    axs[1].imshow(image_with_detections)
    axs[1].axis('off')
    axs[1].set_title('Highlight faces')
    
    crop_img = image.copy()
    for (x,y,w,h) in faces:
        crop_img = image[y:y+h, x:x+w]
        break;
        
    axs[2].imshow(crop_img)
    axs[2].axis('off')
    axs[2].set_title('Zoom-in face')
    
    if train:
        plt.suptitle('Image {image} label: {label}'.format(image = filename.split('/')[-1], label=get_label(filename, meta)))
    else:
        plt.suptitle('Image {image}'.format(image = filename.split('/')[-1]))
    plt.show()

In [None]:
meta = pd.read_json(metadata).transpose()
meta.head()
meta

# 3.	Problem Formulation



















In this work, the data set we selected is a competition on Kaggle: Deepfake Detection Challenge (DFDC), which is the data set provided by the Deepfake synthetic face detection competition. There are 4 data sets related to this competition.

1.Training set: This data set contains target labels, which can be downloaded outside of Kaggle for competitors to build their models. It is divided into 50 files for easy access and download. Due to its large size, it must be accessed through the GCS bucket.

2.Public verification set: The submitted file output generation will be based on a small set of 400 videos/id included in this public verification set. This is test_videos.zip on the Kaggle data page.

3.Public test set: This data set is completely rejected and the Kaggle platform calculates the public ranking. When you submit the competition from the output file of the submitted notebook containing the competition data set, your code will rerun in the background, and when the retest is completed, the score will be posted on the public leaderboard.

4.Private test set: This data set is private outside the Kaggle platform and is used to calculate private rankings. It contains videos of similar format and nature to the training and public verification/test sets, but real, organic videos, with or without deepfakes.

Here is another introduction to the training set. The complete training set exceeds 470GB. The official has provided 50 small data sets after segmentation. In the training data, whether the video is forged by deepfake is determined by the string "REAL" or "FAKE" in the label column. Here are the main columns: filename: the file name of the video; label: whether the video is real or fake; original: if the video of the training set is fake, the original video is listed here; split: this is always equal to "training". We intercept a certain frame or a few frames in each video as input for training and detection, as shown in the following figure


In [None]:
df_train0 = pd.read_json('../input/deepfakes/metadata0.json')
df_train1 = pd.read_json('../input/deepfakes/metadata1.json')
df_train2 = pd.read_json('../input/deepfakes/metadata2.json')
df_train3 = pd.read_json('../input/deepfakes/metadata3.json')
df_train4 = pd.read_json('../input/deepfakes/metadata4.json')
df_train5 = pd.read_json('../input/deepfakes/metadata5.json')
df_train6 = pd.read_json('../input/deepfakes/metadata6.json')
df_train7 = pd.read_json('../input/deepfakes/metadata7.json')
df_train8 = pd.read_json('../input/deepfakes/metadata8.json')
df_train9 = pd.read_json('../input/deepfakes/metadata9.json')
df_train10 = pd.read_json('../input/deepfakes/metadata10.json')
df_train11 = pd.read_json('../input/deepfakes/metadata11.json')
df_train12 = pd.read_json('../input/deepfakes/metadata12.json')
df_train13 = pd.read_json('../input/deepfakes/metadata13.json')
df_train14 = pd.read_json('../input/deepfakes/metadata14.json')
df_train15 = pd.read_json('../input/deepfakes/metadata15.json')
df_train16 = pd.read_json('../input/deepfakes/metadata16.json')
df_train17 = pd.read_json('../input/deepfakes/metadata17.json')
df_train18 = pd.read_json('../input/deepfakes/metadata18.json')
df_train19 = pd.read_json('../input/deepfakes/metadata19.json')
df_train20 = pd.read_json('../input/deepfakes/metadata20.json')
df_train21 = pd.read_json('../input/deepfakes/metadata21.json')
df_train22 = pd.read_json('../input/deepfakes/metadata22.json')
df_train23 = pd.read_json('../input/deepfakes/metadata23.json')
df_train24 = pd.read_json('../input/deepfakes/metadata24.json')
df_train25 = pd.read_json('../input/deepfakes/metadata25.json')
df_train26 = pd.read_json('../input/deepfakes/metadata26.json')
df_train27 = pd.read_json('../input/deepfakes/metadata27.json')
df_train28 = pd.read_json('../input/deepfakes/metadata28.json')
df_train29 = pd.read_json('../input/deepfakes/metadata29.json')
df_train30 = pd.read_json('../input/deepfakes/metadata30.json')
df_train31 = pd.read_json('../input/deepfakes/metadata31.json')
df_train32 = pd.read_json('../input/deepfakes/metadata32.json')
df_train33 = pd.read_json('../input/deepfakes/metadata33.json')
df_train34 = pd.read_json('../input/deepfakes/metadata34.json')
df_train35 = pd.read_json('../input/deepfakes/metadata35.json')
df_train36 = pd.read_json('../input/deepfakes/metadata36.json')
df_train37 = pd.read_json('../input/deepfakes/metadata37.json')
df_train38 = pd.read_json('../input/deepfakes/metadata38.json')
df_train39 = pd.read_json('../input/deepfakes/metadata39.json')
df_train40 = pd.read_json('../input/deepfakes/metadata40.json')
df_train41 = pd.read_json('../input/deepfakes/metadata41.json')
df_train42 = pd.read_json('../input/deepfakes/metadata42.json')
df_train43 = pd.read_json('../input/deepfakes/metadata43.json')
df_train44 = pd.read_json('../input/deepfakes/metadata44.json')
df_train45 = pd.read_json('../input/deepfakes/metadata45.json')
df_train46 = pd.read_json('../input/deepfakes/metadata46.json')
df_val1 = pd.read_json('../input/deepfakes/metadata47.json')
df_val2 = pd.read_json('../input/deepfakes/metadata48.json')
df_val3 = pd.read_json('../input/deepfakes/metadata49.json')
df_trains = [df_train0 ,df_train1, df_train2, df_train3, df_train4,
             df_train5, df_train6, df_train7, df_train8, df_train9,df_train10,
            df_train11, df_train12, df_train13, df_train14, df_train15,df_train16, 
            df_train17, df_train18, df_train19, df_train20, df_train21, df_train22, 
            df_train23, df_train24, df_train25, df_train26, df_train27, df_train28, 
            df_train29, df_train30, df_train31, df_train32, df_train33, df_train34,
            df_train34, df_train35, df_train36, df_train37, df_train38, df_train39,
            df_train40, df_train41, df_train42, df_train43, df_train44, df_train45,
            df_train46]
df_vals=[df_val1, df_val2, df_val3]
nums = list(range(len(df_trains)+1))
LABELS = ['REAL','FAKE']
val_nums=[47, 48, 49]

Next we show the photos in the training set

In [None]:
def get_path(num,x):
    num=str(num)
    if len(num)==2:
        path='../input/deepfakes/DeepFake'+num+'/DeepFake'+num+'/' + x.replace('.mp4', '') + '.jpg'
    else:
        path='../input/deepfakes/DeepFake0'+num+'/DeepFake0'+num+'/' + x.replace('.mp4', '') + '.jpg'
    if not os.path.exists(path):
       raise Exception
    return path
paths=[]
y=[]
for df_train,num in tqdm(zip(df_trains,nums),total=len(df_trains)):
    images = list(df_train.columns.values)
    for x in images:
        try:
            paths.append(get_path(num,x))
            y.append(LABELS.index(df_train[x]['label']))
        except Exception as err:
            #print(err)
            pass

val_paths=[]
val_y=[]
for df_val,num in tqdm(zip(df_vals,val_nums),total=len(df_vals)):
    images = list(df_val.columns.values)
    for x in images:
        try:
            val_paths.append(get_path(num,x))
            val_y.append(LABELS.index(df_val[x]['label']))
        except Exception as err:
            #print(err)
            pass

In [None]:
print('There are '+str(y.count(1))+' fake train samples')
print('There are '+str(y.count(0))+' real train samples')
print('There are '+str(val_y.count(1))+' fake val samples')
print('There are '+str(val_y.count(0))+' real val samples')

In [None]:
visualize_frame(train_fns[0], meta)

In [None]:
visualize_frame(train_fns[4], meta)


In the selection of extraction methods, we have chosen four face extraction methods: BlazeFace, MTCNN, MobileNet, and YOLO.

# Initialize BlazeFace
































First look at BlazeFace. This is a very efficient and lightweight face detector. It performs very well in face detection tasks in close-up frontal scenes. It can not only ensure the accuracy of detection, but also the fast reasoning speed can reach the sub-millisecond level in the general mobile phone CPU. Why it can have such high performance? One is the design of the backbone network, which uses the depthwise convolution of the large receptive field and 1x1 convolution kernel to speed up the features, and uses the nested single blaze method to increase the expression of the model ability; the second is because of the design of the detection regression. Since the main task of the algorithm is the detection of positive faces, in most cases, the anchor with a ratio of 1.0 can get good results. BlazeFace only cascades 2 scale features for face detection, and each point uses only two anchors under 16x16 features, and each point uses 6 anchors under 8x8 features. In the case of uncomplicated data distribution, both it can solve problems and increase the speed of network reasoning, which makes BlazeFace efficient and lightweight


In [None]:
blazeface = BlazeFace()
blazeface.load_weights("/kaggle/input/blazefacepytorch/blazeface.pth")
blazeface.load_anchors("/kaggle/input/blazefacepytorch/anchors.npy")

blazeface.min_score_thresh = 0.75
blazeface.min_suppression_threshold = 0.3

In [None]:
def get_blaze_boxes(detections, with_keypoints=False):
    result = []
    if isinstance(detections, torch.Tensor):
        detections = detections.cpu().numpy()

    if detections.ndim == 1:
        detections = np.expand_dims(detections, axis=0)

    img_shape = (128, 128)
    for i in range(detections.shape[0]):
        ymin = detections[i, 0] * img_shape[0]
        xmin = detections[i, 1] * img_shape[1]
        ymax = detections[i, 2] * img_shape[0]
        xmax = detections[i, 3] * img_shape[1]
        result.append((xmin, ymin, xmax, ymax))
    return result
def scale_boxes(boxes, scale_w, scale_h):
    sb = []
    for b in boxes:
        sb.append((b[0] * scale_w, b[1] * scale_h, b[2] * scale_w, b[3] * scale_h))
    return sb

# Initialize Yolo
































YOLO, its full name is You Only Look Once, which basically summarizes the characteristics of YOLO completely, that is, only one CNN operation is required. YOLO solves object detection as a regression problem. Based on a single end-to-end network, complete the input from the original image to the output of the object position and category. Both YOLO training and detection are performed in a separate network. YOLO did not show the process of obtaining the region proposal. The YOLO network draws on the GoogLeNet classification network structure. The difference is that YOLO does not use the inception module, but uses 1x1 convolutional layer (here 1x1 convolutional layer exists for cross-channel information integration) + 3x3 convolutional layer simple replacement. YOLO model training is divided into two steps, one is pre-training, using data to train the first 20 convolutional layers of the YOLO network + 1 average pooling layer + 1 fully connected layer. Resize the training image resolution to 224x224; then use the first 20 convolutional layer network parameters obtained in step 1 to initialize the network parameters of the first 20 convolutional layers of the YOLO model, and then use the VOC 20 type of annotation data for YOLO model training. In order to improve the image accuracy, when training the detection model, the input image resolution is resized to 448x448. In general, YOLO has the characteristics of fast speed, low background false detection rate and strong versatility.


In [None]:
def load_mobilenetv2_224_075_detector(path):
    input_tensor = Input(shape=(224, 224, 3))
    output_tensor = MobileNetV2(weights=None, include_top=False, input_tensor=input_tensor, alpha=0.75).output
    output_tensor = ZeroPadding2D()(output_tensor)
    output_tensor = Conv2D(kernel_size=(3, 3), filters=5)(output_tensor)

    model = Model(inputs=input_tensor, outputs=output_tensor)
    model.load_weights(path)
    
    return model

In [None]:
from tensorflow.keras.layers import Conv2D, Input, ZeroPadding2D, Dense, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
mobilenetv2 = load_mobilenetv2_224_075_detector("../input/facedetectionmobilenetv2/facedetection-mobilenetv2-size224-alpha0.75.h5")

In [None]:
def transpose_shots(shots):
    return [(shot[1], shot[0], shot[3], shot[2], shot[4]) for shot in shots]

#That constant describe pieces for 16:9 images
SHOTS = {
    # fast less accurate
    '2-16/9' : {
        'aspect_ratio' : 16/9,
        'shots' : [
             (0, 0, 9/16, 1, 1),
             (7/16, 0, 9/16, 1, 1)
        ]
    },
    # slower more accurate
    '10-16/9' : {
        'aspect_ratio' : 16/9,
        'shots' : [
             (0, 0, 9/16, 1, 1),
             (7/16, 0, 9/16, 1, 1),
             (0, 0, 5/16, 5/9, 0.5),
             (0, 4/9, 5/16, 5/9, 0.5),
             (11/48, 0, 5/16, 5/9, 0.5),
             (11/48, 4/9, 5/16, 5/9, 0.5),
             (22/48, 0, 5/16, 5/9, 0.5),
             (22/48, 4/9, 5/16, 5/9, 0.5),
             (11/16, 0, 5/16, 5/9, 0.5),
             (11/16, 4/9, 5/16, 5/9, 0.5),
        ]
    }
}

# 9:16 respectively
SHOTS_T = {
    '2-9/16' : {
        'aspect_ratio' : 9/16,
        'shots' : transpose_shots(SHOTS['2-16/9']['shots'])
    },
    '10-9/16' : {
        'aspect_ratio' : 9/16,
        'shots' : transpose_shots(SHOTS['10-16/9']['shots'])
    }
}

def r(x):
    return int(round(x))

def sigmoid(x):
    return 1 / (np.exp(-x) + 1)

def non_max_suppression(boxes, p, iou_threshold):
    if len(boxes) == 0:
        return np.array([])

    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]

    indexes = np.argsort(p)
    true_boxes_indexes = []

    while len(indexes) > 0:
        true_boxes_indexes.append(indexes[-1])

        intersection = np.maximum(np.minimum(x2[indexes[:-1]], x2[indexes[-1]]) - np.maximum(x1[indexes[:-1]], x1[indexes[-1]]), 0) * np.maximum(np.minimum(y2[indexes[:-1]], y2[indexes[-1]]) - np.maximum(y1[indexes[:-1]], y1[indexes[-1]]), 0)
        iou = intersection / ((x2[indexes[:-1]] - x1[indexes[:-1]]) * (y2[indexes[:-1]] - y1[indexes[:-1]]) + (x2[indexes[-1]] - x1[indexes[-1]]) * (y2[indexes[-1]] - y1[indexes[-1]]) - intersection)

        indexes = np.delete(indexes, -1)
        indexes = np.delete(indexes, np.where(iou >= iou_threshold)[0])

    return boxes[true_boxes_indexes]

def union_suppression(boxes, threshold):
    if len(boxes) == 0:
        return np.array([])

    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]

    indexes = np.argsort((x2 - x1) * (y2 - y1))
    result_boxes = []

    while len(indexes) > 0:
        intersection = np.maximum(np.minimum(x2[indexes[:-1]], x2[indexes[-1]]) - np.maximum(x1[indexes[:-1]], x1[indexes[-1]]), 0) * np.maximum(np.minimum(y2[indexes[:-1]], y2[indexes[-1]]) - np.maximum(y1[indexes[:-1]], y1[indexes[-1]]), 0)
        min_s = np.minimum((x2[indexes[:-1]] - x1[indexes[:-1]]) * (y2[indexes[:-1]] - y1[indexes[:-1]]), (x2[indexes[-1]] - x1[indexes[-1]]) * (y2[indexes[-1]] - y1[indexes[-1]]))
        ioms = intersection / (min_s + 1e-9)
        neighbours = np.where(ioms >= threshold)[0]
        if len(neighbours) > 0:
            result_boxes.append([min(np.min(x1[indexes[neighbours]]), x1[indexes[-1]]), min(np.min(y1[indexes[neighbours]]), y1[indexes[-1]]), max(np.max(x2[indexes[neighbours]]), x2[indexes[-1]]), max(np.max(y2[indexes[neighbours]]), y2[indexes[-1]])])
        else:
            result_boxes.append([x1[indexes[-1]], y1[indexes[-1]], x2[indexes[-1]], y2[indexes[-1]]])

        indexes = np.delete(indexes, -1)
        indexes = np.delete(indexes, neighbours)

    return result_boxes

class FaceDetector():
# Face Detection
    def __init__(self, model=mobilenetv2, shots=[SHOTS['10-16/9'], SHOTS_T['10-9/16']], image_size=224, grids=7, iou_threshold=0.1, union_threshold=0.1):
        self.model = model
        self.shots = shots
        self.image_size = image_size
        self.grids = grids
        self.iou_threshold = iou_threshold
        self.union_threshold = union_threshold
        self.prob_threshold = 0.7
        
    
    def detect(self, frame, threshold = 0.7):
        original_frame_shape = frame.shape
        self.prob_threshold = threshold
        aspect_ratio = None
        for shot in self.shots:
            if abs(frame.shape[1] / frame.shape[0] - shot["aspect_ratio"]) < 1e-9:
                aspect_ratio = shot["aspect_ratio"]
                shots = shot
        
        assert aspect_ratio is not None
        
        c = min(frame.shape[0], frame.shape[1] / aspect_ratio)
        slice_h_shift = r((frame.shape[0] - c) / 2)
        slice_w_shift = r((frame.shape[1] - c * aspect_ratio) / 2)
        if slice_w_shift != 0 and slice_h_shift == 0:
            frame = frame[:, slice_w_shift:-slice_w_shift]
        elif slice_w_shift == 0 and slice_h_shift != 0:
            frame = frame[slice_h_shift:-slice_h_shift, :]

        frames = []
        for s in shots["shots"]:
            frames.append(cv2.resize(frame[r(s[1] * frame.shape[0]):r((s[1] + s[3]) * frame.shape[0]), r(s[0] * frame.shape[1]):r((s[0] + s[2]) * frame.shape[1])], (self.image_size, self.image_size), interpolation=cv2.INTER_NEAREST))
        frames = np.array(frames)

        predictions = self.model.predict(frames, batch_size=len(frames), verbose=0)

        boxes = []
        prob = []
        shots = shots['shots']
        for i in range(len(shots)):
            slice_boxes = []
            slice_prob = []
            for j in range(predictions.shape[1]):
                for k in range(predictions.shape[2]):
                    p = sigmoid(predictions[i][j][k][4])
                    if not(p is None) and p > self.prob_threshold:
                        px = sigmoid(predictions[i][j][k][0])
                        py = sigmoid(predictions[i][j][k][1])
                        pw = min(math.exp(predictions[i][j][k][2] / self.grids), self.grids)
                        ph = min(math.exp(predictions[i][j][k][3] / self.grids), self.grids)
                        if not(px is None) and not(py is None) and not(pw is None) and not(ph is None) and pw > 1e-9 and ph > 1e-9:
                            cx = (px + j) / self.grids
                            cy = (py + k) / self.grids
                            wx = pw / self.grids
                            wy = ph / self.grids
                            if wx <= shots[i][4] and wy <= shots[i][4]:
                                lx = min(max(cx - wx / 2, 0), 1)
                                ly = min(max(cy - wy / 2, 0), 1)
                                rx = min(max(cx + wx / 2, 0), 1)
                                ry = min(max(cy + wy / 2, 0), 1)

                                lx *= shots[i][2]
                                ly *= shots[i][3]
                                rx *= shots[i][2]
                                ry *= shots[i][3]

                                lx += shots[i][0]
                                ly += shots[i][1]
                                rx += shots[i][0]
                                ry += shots[i][1]

                                slice_boxes.append([lx, ly, rx, ry])
                                slice_prob.append(p)

            slice_boxes = np.array(slice_boxes)
            slice_prob = np.array(slice_prob)

            slice_boxes = non_max_suppression(slice_boxes, slice_prob, self.iou_threshold)

            for sb in slice_boxes:
                boxes.append(sb)


        boxes = np.array(boxes)
        boxes = union_suppression(boxes, self.union_threshold)

        for i in range(len(boxes)):
            boxes[i][0] /= original_frame_shape[1] / frame.shape[1]
            boxes[i][1] /= original_frame_shape[0] / frame.shape[0]
            boxes[i][2] /= original_frame_shape[1] / frame.shape[1]
            boxes[i][3] /= original_frame_shape[0] / frame.shape[0]

            boxes[i][0] += slice_w_shift / original_frame_shape[1]
            boxes[i][1] += slice_h_shift / original_frame_shape[0]
            boxes[i][2] += slice_w_shift / original_frame_shape[1]
            boxes[i][3] += slice_h_shift / original_frame_shape[0]

        return list(boxes)
def get_boxes_points(boxes, frame_shape):
    result = []
    for box in boxes:
        lx = int(round(box[0] * frame_shape[1]))
        ly = int(round(box[1] * frame_shape[0]))
        rx = int(round(box[2] * frame_shape[1]))
        ry = int(round(box[3] * frame_shape[0]))
        result.append((lx,rx, ly, ry))
    return result 

In [None]:
yolo_model = FaceDetector()

# Initialize MTCNN

MTCNN, Multi-task convolutional neural network, is a multi-task neural network model for face detection tasks proposed by Shenzhen Research Institute of Chinese Academy of Sciences in 2016. This model mainly uses three cascaded networks, and uses the idea of candidate box plus the classifier is to perform fast and efficient face detection. The three cascaded networks are P-Net for quickly generating candidate windows, R-Net for filtering and selecting high-precision candidate windows, and O-Net for generating final bounding boxes and face key points. The full name of P-Net is Proposal Network, and its basic structure is a fully convolutional network. For the image pyramid constructed in the previous step, an FCN is used to perform preliminary feature extraction and frame calibration, and perform the Bounding-Box Regression adjustment window and NMS to filter most of the windows. P-Net is a region suggestion network for the face region. After the features are input to the three convolutional layers of the result, the network uses a face classifier to determine whether the region is a face, and uses border regression and a face key point. The locator is used to make the initial proposal of the face area. This part will eventually output a lot of face areas that may have faces, and input these areas into R-Net for further processing. The full name of R-Net is Refine Network. Its basic structure is a convolutional neural network. Compared with the first layer of P-Net, a fully connected layer is added, so the input data will be more stringent. After the picture passes through the P-Net, many prediction windows will be left. We send all the prediction windows to R-Net. This network will filter out a large number of candidate frames with poor results, and finally perform Bounding-Box Regression and NMS on the selected candidate boxes to further optimize the prediction results. Because the output of P-Net is only a possible face area with a certain degree of credibility, in this network, the input will be refined and selected, and most of the wrong input will be discarded, and again use the border regression and facial key point locator to carry out the border regression and key point positioning of the face area, and finally outputs a more credible face area for O-Net. Compared with the 1x1x32 feature of P-Net using full convolution output, R-Net uses a 128 fully connected layer after the last convolution layer, which retains more image features, and the accuracy performance is also better than P-Net. O-Net is called Output Network. The basic structure is a more complicated convolutional neural network. Compared with R-Net, there is one more convolutional layer. The difference between the effect of O-Net and R-Net is that this layer of structure will recognize the area of the face through more supervision, and will regress the facial feature points of the person, and finally output five facial feature points. It is a more complex convolutional network with more input features. At the end of the network structure, there is also a larger 256 fully connected layer, which retains more image features, and at the same time performs face discrimination and human face area border regression and face feature positioning, and finally output the coordinates of the upper left corner and the lower right corner of the face area and the five feature points of the face area. O-Net has more characteristic input and more complex network structure, and also has better performance. The output of this layer is used as the final network model output. In order to balance performance and accuracy, MTCNN avoids the huge performance consumption caused by traditional ideas such as sliding window plus classifier, first use a small model to generate a certain possibility of target area candidate frame, then use more complex models for fine classification and higher precision area box regression, and let this step be executed recursively, this idea forms a three-layer network, namely P-Net, R-Net, and O-Net, to achieve fast and efficient face detection.

In [None]:
!pip install mtcnn

In [None]:
from mtcnn import MTCNN
mtcnn = MTCNN()

# Initialize MobilenetFace

MobileNet, which is a small and efficient CNN model proposed by Google, has a compromise between accuracy and latency. The basic unit of MobileNet is depthwise separable convolution. In fact, this structure has been used in the Inception model before. Depth-level separable convolution is actually a factorized convolutions, which can be decomposed into two smaller operations: depthwise convolution and pointwise convolution. Depthwise convolution is different from standard convolution. For standard convolution, the convolution kernel is used on all input channels, while depthwise convolution uses different convolution kernels for each input channel, that is, a convolution kernel. Corresponds to an input channel, so depthwise convolution is a depth-level operation. The pointwise convolution is actually an ordinary convolution, but it uses a 1x1 convolution kernel. For depthwise separable convolution, it first uses depthwise convolution to convolve different input channels separately, and then uses pointwise convolution to combine the above outputs, so that the overall effect is similar to a standard convolution, but it will greatly reduce the calculation quantity and model parameter quantity. The above described depthwise separable convolution, which is the basic component of MobileNet, but batchnorm will be added to the real application and the ReLU activation function will be used. In general, its core is the use of decomposable depthwise separable convolution, which can not only reduce the computational complexity of the model, but also greatly reduce the model size.

In [None]:
detection_graph = tf.Graph()
with detection_graph.as_default():
    od_graph_def = tf.compat.v1.GraphDef()
    with tf.io.gfile.GFile('../input/mobilenetface/frozen_inference_graph_face.pb', 'rb') as fid:
        serialized_graph = fid.read()
        od_graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(od_graph_def, name='')
        config = tf.compat.v1.ConfigProto()
    config.gpu_options.allow_growth = True
    sess=tf.compat.v1.Session(graph=detection_graph, config=config)
    image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
    boxes_tensor = detection_graph.get_tensor_by_name('detection_boxes:0')    
    scores_tensor = detection_graph.get_tensor_by_name('detection_scores:0')
    num_detections = detection_graph.get_tensor_by_name('num_detections:0')

In [None]:
video='../input/deepfake-detection-challenge/train_sample_videos/bdnaqemxmr.mp4'

In [None]:
video_cap = cv2.VideoCapture(video)

frame_count = 0
all_frames = []
while(True):
    ret, frame = video_cap.read()
    if ret is False:
        break
    all_frames.append(frame)
    frame_count = frame_count + 1

# The value below are both the number of frames
print (frame_count)
print (len(all_frames))

In [None]:
cap = cv2.VideoCapture(video)
# Frame rate
fps = int(round(cap.get(cv2.CAP_PROP_FPS)))
# Resolution-Width
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
# Resolution-Height
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Total number of frames
frame_counter = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

cap.release()
cv2.destroyAllWindows()
# Duration, unit s
duration = frame_counter / fps

In [None]:
cap=cv2.VideoCapture(video)
ret,frame=cap.read()
frame=cv2.cvtColor(frame,cv2.COLOR_BGR2RGB)
plt.imshow(frame)

 ****

In [None]:
def get_mtcnn_face(img):
    start=time.time()
    bboxes=mtcnn.detect_faces(frame)[0]['box']
    x,y,w,h=bboxes
    bboxes=x,x+w,y,y+h
    return time.time()-start, bboxes
def get_blazeface_face(img):
    start=time.time()
    scale_w = img.shape[1] / 128.0 
    scale_h = img.shape[0] / 128.0
    blaze_output=blazeface.predict_on_image(cv2.resize(frame, (128,128)))
    blaze_bboxes=scale_boxes(get_blaze_boxes(blaze_output), scale_w, scale_h)
    if blaze_bboxes==[]:
        return time.time()-start,[]
    lx, ly, rx, ry = blaze_bboxes[0]
    bboxes=int(lx), int(rx), int(ly), int(ry)
    return time.time()-start, bboxes
def get_mobilenet_face(image):
    start=time.time()
    global boxes,scores,num_detections
    (im_height,im_width)=image.shape[:-1]
    imgs=np.array([image])
    (boxes, scores) = sess.run(
        [boxes_tensor, scores_tensor],
        feed_dict={image_tensor: imgs})
    max_=np.where(scores==scores.max())[0][0]
    box=boxes[0][max_]
    ymin, xmin, ymax, xmax = box
    (left, right, top, bottom) = (xmin * im_width, xmax * im_width,
                                ymin * im_height, ymax * im_height)
    left, right, top, bottom = int(left), int(right), int(top), int(bottom)
    return time.time()-start,(left, right, top, bottom)
def get_yolo_face(image):
    start=time.time()
    bbox=yolo_model.detect(frame, 0.7)
    bbox=get_boxes_points(bbox,frame.shape)[0]
    return time.time()-start,bbox

In [None]:
def annotate_image(frame,bbox,color):
    if bbox==[]:
        return frame
    frame=frame.copy()
    return cv2.rectangle(frame,(bbox[0],bbox[2]),(bbox[1],bbox[3]),color,10)
def crop_image(frame,bbox):
    left, right, top, bottom=bbox
    return frame[top:bottom,left:right]

In [None]:
_=get_blazeface_face(frame)
_=get_mtcnn_face(frame)
_=get_mobilenet_face(frame)
_=get_yolo_face(frame)

In [None]:
blaze_time,blaze_bboxes=get_blazeface_face(frame)
mtcnn_time,mtcnn_bboxes=get_mtcnn_face(frame)
mobilenet_time,mobilenet_bboxes=get_mobilenet_face(frame)
yolo_time,yolo_bboxes=get_yolo_face(frame)

In [None]:
print("BlazeFace detection time:"+str(blaze_time))
print("MTCNN detection time:"+str(mtcnn_time))
print("Yolo detection time:"+str(yolo_time))
print("Mobilenet detection time:"+str(mobilenet_time))


# Ability to Detect Face

In [None]:
if blaze_bboxes==[]:
    print('⚠️BlazeFace cannot detect faces in this picture.。')
if mtcnn_bboxes==[]:
    print('⚠️MTCNN cannot detect faces in this picture.。')
if mobilenet_bboxes==[]:
    print('⚠️mobilenet cannot detect faces in this picture.。')
if yolo_bboxes==[]:
    print('⚠️mobilenet cannot detect faces in this picture.。')


# Accuracy Comparison


In [None]:
annotated=annotate_image(frame,mobilenet_bboxes,(255,0,0))
annotated=annotate_image(annotated,mtcnn_bboxes,(0,255,0))
annotated=annotate_image(annotated,blaze_bboxes,(0,0,255))
annotated=annotate_image(annotated,yolo_bboxes,(255,0,255))

Blue: BlazeFace
Red: Mobilenet
Green: MTCNN
Purple: YOLO

In [None]:
plt.imshow(annotated)

# Cropped Images

# BlazeFace


In [None]:
plt.imshow(crop_image(frame,blaze_bboxes))

# MTCNN


In [None]:
plt.imshow(crop_image(frame,mtcnn_bboxes))

# Mobilenet

In [None]:
plt.imshow(crop_image(frame,mobilenet_bboxes))

# YOLO

In [None]:
plt.imshow(crop_image(frame,yolo_bboxes))

# Extra Comparisons


In [None]:
video='../input/deepfake-detection-challenge/train_sample_videos/eqnoqyfquo.mp4'
cap=cv2.VideoCapture(video)    
ret,frame=cap.read()
frame=cv2.cvtColor(frame,cv2.COLOR_BGR2RGB)
blaze_time,blaze_bboxes=get_blazeface_face(frame)
mtcnn_time,mtcnn_bboxes=get_mtcnn_face(frame)
mobilenet_time,mobilenet_bboxes=get_mobilenet_face(frame)
yolo_time,yolo_bboxes=get_yolo_face(frame)  
print("MTCNN detection time:"+str(mtcnn_time))
print("Yolo detection time:"+str(yolo_time))
print("Mobilenet detection time:"+str(mobilenet_time))
print("BlazeFace detection time:"+str(blaze_time))
annotated=annotate_image(frame,mobilenet_bboxes,(255,0,0))
annotated=annotate_image(annotated,mtcnn_bboxes,(0,255,0))
annotated=annotate_image(annotated,blaze_bboxes,(0,0,255))
annotated=annotate_image(annotated,yolo_bboxes,(255,0,255))
plt.imshow(annotated)

In [None]:
video='../input/deepfake-detection-challenge/train_sample_videos/eqjscdagiv.mp4'
cap=cv2.VideoCapture(video)    
ret,frame=cap.read()
frame=cv2.cvtColor(frame,cv2.COLOR_BGR2RGB)
blaze_time,blaze_bboxes=get_blazeface_face(frame)
mtcnn_time,mtcnn_bboxes=get_mtcnn_face(frame)
mobilenet_time,mobilenet_bboxes=get_mobilenet_face(frame)
yolo_time,yolo_bboxes=get_yolo_face(frame)  
print("MTCNN detection time:"+str(mtcnn_time))
print("Yolo detection time:"+str(yolo_time))
print("Mobilenet detection time:"+str(mobilenet_time))
print("BlazeFace detection time:"+str(blaze_time))
annotated=annotate_image(frame,mobilenet_bboxes,(255,0,0))
annotated=annotate_image(annotated,mtcnn_bboxes,(0,255,0))
annotated=annotate_image(annotated,blaze_bboxes,(0,0,255))
annotated=annotate_image(annotated,yolo_bboxes,(255,0,255))
plt.imshow(annotated)

In [None]:
video='../input/deepfake-detection-challenge/train_sample_videos/emgjphonqb.mp4'
cap=cv2.VideoCapture(video)    
ret,frame=cap.read()
frame=cv2.cvtColor(frame,cv2.COLOR_BGR2RGB)
blaze_time,blaze_bboxes=get_blazeface_face(frame)
mtcnn_time,mtcnn_bboxes=get_mtcnn_face(frame)
mobilenet_time,mobilenet_bboxes=get_mobilenet_face(frame)
yolo_time,yolo_bboxes=get_yolo_face(frame)  
print("MTCNN detection time :"+str(mtcnn_time))
print("Yolo detection time:"+str(yolo_time))
print("Mobilenet detection time:"+str(mobilenet_time))
print("BlazeFace detection time:"+str(blaze_time))
annotated=annotate_image(frame,mobilenet_bboxes,(255,0,0))
annotated=annotate_image(annotated,mtcnn_bboxes,(0,255,0))
annotated=annotate_image(annotated,blaze_bboxes,(0,0,255))
annotated=annotate_image(annotated,yolo_bboxes,(255,0,255))
plt.imshow(annotated)

# 4.	Methods
































In this part, we will introduce the related algorithms used in this work, mainly Resnet, Xception and ResNext. First look at Resnet. The proposal of the Deep Residual Network (ResNet) is a milestone event in the history of CNN images. It achieved 5 first results in ILSVRC and COCO 2015. In fact, ResNet solved the deep CNN model. The problem of difficulty in training, 14 years of VGG has only 19 layers, and 15 years of ResNet has as many as 152 layers, which is not an order of magnitude in network depth. Although ResNet wins by depth, there are tricks in its architecture, which makes the depth of the network play a role. This trick is Residual Learning. Use residual learning to solve the degradation problem. For a stacked layer structure (several layers stacked) when the input is x, the learned feature is recorded as H(x), and now I hope it can learn the residual F(x)=H(x)-x, so the original learning feature is F(x)+x. The reason for this is that residual learning is easier than direct learning of original features. When the residual is 0, the stacked layer only does the identity mapping at this time, at least the network performance will not decrease, in fact the residual will not be 0, which will also make the stacked layer learn new features based on the input features for better performance.


The ResNet network is based on the VGG19 network, modified on the basis of it, and added a residual unit through a short-circuit mechanism. The change is mainly reflected in ResNet directly using stride=2 convolution for downsampling, and replacing the fully connected layer with the global average pool layer. An important design principle of ResNet is: when the size of the feature map is reduced by half, the number of feature maps doubles, which maintains the complexity of the network layer. Compared with ordinary networks, ResNet adds a short-circuit mechanism between every two layers, which forms residual learning. Of course, you can also build a deeper network. For 18-layer and 34-layer ResNet, it performs residual learning between two layers. When the network is deeper, it performs residual learning between three layers, three-layer volume The product cores are 1x1, 3x3, and 1x1. It is worth noting that the number of feature maps in the hidden layer is relatively small, and it is 1/4 of the number of output feature maps. Then analyze the residual unit, ResNet uses two residual units, shallow network and deep network. For short-circuit connection, when the input and output dimensions are the same, the input can be directly added to the output. But when the dimensions are inconsistent (corresponding to double the dimensions), this cannot be added directly. There are two strategies: (1) Use zero-padding to increase the dimension. At this time, a downsamp is generally required. You can use the pooling with strde=2, so that no parameters will be added; (2) Use a new mapping (projection shortcut), Generally, a 1x1 convolution is used, which will increase the parameters and increase the amount of calculation. In addition to directly using identity mapping for short-circuit connections, of course, projection shortcuts can be used.

The second is Xception, which is another improvement to Inception-v3 proposed by Google after Inception. The structure of Xception is based on ResNet, but the convolutional layer is replaced with Separable Convolution. The Xception architecture has 36 convolutional layers, which form the basis of the feature extraction of the network. In studying the problem of image classification, the basis of convolution will follow the logistic regression layer. Optionally, a fully connected layer can be inserted before the logistic regression layer. The 36 convolutional layers are constructed into 14 modules, all of which have linear residual connections around them except for the first and last modules. In short, the Xception architecture is a linear stack of deeply separable convolutional layers with residual connections. This makes the architecture very easy to define and modify. Using high-level libraries (such as Keras or TensorFlow-Slim) only requires 30 to 40 lines of code, which is different from architectures such as VGG-16, but is much more complicated to define than architectures such as Inception V2 or V3. Compared with Inception V3, in terms of classification performance, Xception has a smaller lead on ImageNet, but is much ahead of JFT; in terms of parameters and speed, Xception has fewer parameters than Inception, but is faster.

Another is ResNext, which is a combination of ResNet and Inception. Unlike Inception v4, ResNext does not require manual design of complex Inception structural details, but each branch uses the same topology. The essence of ResNeXt is Group Convolution, which controls the number of groups by variable cardinality. The convolution machine is a compromise between ordinary convolution and depth separable convolution, that is, the number of channels of the Feature Map generated by each branch is n (n>1). Another difference from Inception v4 is that ResNeXt first performs 1×1 convolution and then performs unit addition. Inception V4 performs splicing and then performs 1×1 convolution. There is also grouped convolution. Grouped convolution is a compromise between the depth of the ordinary convolution kernel and the separable convolution. It does not completely assign each channel to an independent convolution kernel or the entire Feature Map. Use the same convolution kernel. In general, ResNeXt proposes a strategy of separable convolution between the depth of the ordinary convolution kernel: grouped convolution, which achieves a balance between the two strategies by controlling the number of groups (base). The idea of grouped convolution is derived from Inception. Unlike Inception, which requires manual design of each branch, the topology of each branch of ResNeXt is the same. Finally, combined with the residual network, the final ResNext is obtained. ResNeXt does have fewer hyperparameters than Inception V4, but it does not seem to be very reasonable to directly abolish Inception's characteristics of including different receptive fields. It is found that Inception V4 is better than ResNeXt in more environments. The running speed of ResNeXt of similar structure should be better than Inception V4, because the design of the branch of the same topology of ResNeXt is more in line with GPU hardware design principles.

The above three algorithms are the algorithms used in this work. We use AUC-ROC (Area Under Curve) to measure the quality of machine learning models. AUC-ROC is a commonly used evaluation index for classifiers in machine learning. Its rules are summarized in one sentence: a positive example, a negative example, the probability that the prediction is positive is greater than the probability that the prediction is negative . We show the effect of our work by drawing the ROC curve. The area under the ROC curve is the value of AUC. The specific diagram will be given in detail in the next part. For the training of the model mentioned above, we use packages such as numpy, pandas, matplotlib.pyplot, glob, cv2, torch, and tensorflow in python.


# 5.	Experiments and Results

# MesoNet

现在我们使用MesoNet网络进行训练，在这之前我们做了数据集的一些分析，发现这并不是一个均衡的数据集，为此我们采用了下采样的方式对数据集进行处理，通过K-5交叉验证的方式进行，并通过log_loss作为训练结果评估的标准，准确率在测试集上大约有60%以上的正确率。我们发现自己训练出的模型效果并不出色，我想这与我们数据集的使用不够科学有关，为此我们尝试了其他的神经网络模型进行预测，通过实验可以表明，相对比MesoNet、Xception和Resnext，Resnext的表现结果是最好的。

In [None]:
real=[]
fake=[]
for m,n in zip(paths,y):
    if n==0:
        real.append(m)
    else:
        fake.append(m)
fake=random.sample(fake,len(real))
paths,y=[],[]
for x in real:
    paths.append(x)
    y.append(0)
for x in fake:
    paths.append(x)
    y.append(1)

In [None]:
real=[]
fake=[]
for m,n in zip(val_paths,val_y):
    if n==0:
        real.append(m)
    else:
        fake.append(m)
fake=random.sample(fake,len(real))
val_paths,val_y=[],[]
for x in real:
    val_paths.append(x)
    val_y.append(0)
for x in fake:
    val_paths.append(x)
    val_y.append(1)

In [None]:
print('There are '+str(y.count(1))+' fake train samples')
print('There are '+str(y.count(0))+' real train samples')
print('There are '+str(val_y.count(1))+' fake val samples')
print('There are '+str(val_y.count(0))+' real val samples')

In [None]:
def read_img(path):
    return cv2.cvtColor(cv2.imread(path),cv2.COLOR_BGR2RGB)
X=[]
for img in tqdm(paths):
    X.append(read_img(img))
val_X=[]
for img in tqdm(val_paths):
    val_X.append(read_img(img))

In [None]:
def read_img(path):
    return cv2.cvtColor(cv2.imread(path),cv2.COLOR_BGR2RGB)
X=[]
for img in tqdm(paths):
    X.append(read_img(img))
val_X=[]
for img in tqdm(val_paths):
    val_X.append(read_img(img))

In [None]:
import random
def shuffle(X,y):
    new_train=[]
    for m,n in zip(X,y):
        new_train.append([m,n])
    random.shuffle(new_train)
    X,y=[],[]
    for x in new_train:
        X.append(x[0])
        y.append(x[1])
    return X,y

In [None]:
import pandas as pd
import keras
import os
import numpy as np
from sklearn.metrics import log_loss
from keras import Model,Sequential
from keras.layers import *
from keras.optimizers import *
from sklearn.model_selection import train_test_split
import cv2
from tqdm.notebook import tqdm
import glob
from mtcnn import MTCNN

In [None]:
def InceptionLayer(a, b, c, d):
    def func(x):
        x1 = Conv2D(a, (1, 1), padding='same', activation='elu')(x)
        
        x2 = Conv2D(b, (1, 1), padding='same', activation='elu')(x)
        x2 = Conv2D(b, (3, 3), padding='same', activation='elu')(x2)
            
        x3 = Conv2D(c, (1, 1), padding='same', activation='elu')(x)
        x3 = Conv2D(c, (3, 3), dilation_rate = 2, strides = 1, padding='same', activation='elu')(x3)
        
        x4 = Conv2D(d, (1, 1), padding='same', activation='elu')(x)
        x4 = Conv2D(d, (3, 3), dilation_rate = 3, strides = 1, padding='same', activation='elu')(x4)
        y = Concatenate(axis = -1)([x1, x2, x3, x4])
            
        return y
    return func
    
def define_model(shape=(256,256,3)):
    x = Input(shape = shape)
    
    x1 = InceptionLayer(1, 4, 4, 2)(x)
    x1 = BatchNormalization()(x1)
    x1 = MaxPooling2D(pool_size=(2, 2), padding='same')(x1)
    
    x2 = InceptionLayer(2, 4, 4, 2)(x1)
    x2 = BatchNormalization()(x2)        
    x2 = MaxPooling2D(pool_size=(2, 2), padding='same')(x2)        
        
    x3 = Conv2D(16, (5, 5), padding='same', activation = 'elu')(x2)
    x3 = BatchNormalization()(x3)
    x3 = MaxPooling2D(pool_size=(2, 2), padding='same')(x3)
        
    x4 = Conv2D(16, (5, 5), padding='same', activation = 'elu')(x3)
    x4 = BatchNormalization()(x4)
    if shape==(256,256,3):
        x4 = MaxPooling2D(pool_size=(4, 4), padding='same')(x4)
    else:
        x4 = MaxPooling2D(pool_size=(2, 2), padding='same')(x4)
    y = Flatten()(x4)
    y = Dropout(0.5)(y)
    y = Dense(16)(y)
    y = LeakyReLU(alpha=0.1)(y)
    y = Dropout(0.5)(y)
    y = Dense(1, activation = 'sigmoid')(y)
    model=Model(inputs = x, outputs = y)
    model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-4))
    #model.summary()
    return model
df_model=define_model()
df_model.load_weights('../input/mesopretrain/MesoInception_DF')
f2f_model=define_model()
f2f_model.load_weights('../input/mesopretrain/MesoInception_F2F')

In [None]:
from keras.callbacks import LearningRateScheduler
lrs=[1e-3,5e-4,1e-4]
def schedule(epoch):
    return lrs[epoch]

In [None]:
LOAD_PRETRAIN=True

In [None]:
import gc
kfolds=5
losses=[]
if LOAD_PRETRAIN:
    # import keras.backend as K
    df_models=[]
    f2f_models=[]
    i=0
    while len(df_models)<kfolds:
        model=define_model((150,150,3))
        if i==0:
            model.summary()
        #model.load_weights('../input/meso-pretrain/MesoInception_DF')
        for new_layer, layer in zip(model.layers[1:-8], df_model.layers[1:-8]):
            new_layer.set_weights(layer.get_weights())
        model.fit([X],[y],epochs=2,callbacks=[LearningRateScheduler(schedule)])
        pred=model.predict([val_X])
        loss=log_loss(val_y,pred)
        losses.append(loss)
        print('fold '+str(i)+' model loss: '+str(loss))
        df_models.append(model)
        K.clear_session()
        del model
        gc.collect()
        i+=1
    i=0
    while len(f2f_models)<kfolds:
        model=define_model((150,150,3))
        #model.load_weights('../input/meso-pretrain/MesoInception_DF')
        for new_layer, layer in zip(model.layers[1:-8], f2f_model.layers[1:-8]):
            new_layer.set_weights(layer.get_weights())
        model.fit([X],[y],epochs=2,callbacks=[LearningRateScheduler(schedule)])
        pred=model.predict([val_X])
        loss=log_loss(val_y,pred)
        losses.append(loss)
        print('fold '+str(i)+' model loss: '+str(loss))
        f2f_models.append(model)
        K.clear_session()
        del model
        gc.collect()
        i+=1
        models=f2f_models+df_models
else:
    models=[]
    i=0
    while len(models)<kfolds:
        model=define_model((150,150,3))
        if i==0:
            model.summary()
        model.fit([X],[y],epochs=2,callbacks=[LearningRateScheduler(schedule)])
        pred=model.predict([val_X])
        loss=log_loss(val_y,pred)
        losses.append(loss)
        print('fold '+str(i)+' model loss: '+str(loss))
        if loss<0.68:
            models.append(model)
        else:
            print('loss too bad, retrain!')
        K.clear_session()
        del model
        gc.collect()
        i+=1

In [None]:
def prediction_pipline(X,two_times=False):
    preds=[]
    for model in tqdm(models):
        pred=model.predict([X])
        preds.append(pred)
    preds=sum(preds)/len(preds)
    if two_times:
        return larger_range(preds,2)
    else:
        return preds
def larger_range(model_pred,time):
    return (((model_pred-0.5)*time)+0.5)

In [None]:
model_pred=prediction_pipline(val_X)

In [None]:
random_pred=np.random.random(len(val_X))
print('random loss: ' + str(log_loss(val_y,random_pred.clip(0.35,0.65))))
allone_pred=np.array([1 for _ in range(len(val_X))])
print('1 loss: ' + str(log_loss(val_y,allone_pred)))
allzero_pred=np.array([0 for _ in range(len(val_X))])
print('0 loss: ' + str(log_loss(val_y,allzero_pred)))
allpoint5_pred=np.array([0.5 for _ in range(len(val_X))])
print('0.5 loss: ' + str(log_loss(val_y,allpoint5_pred)))

In [None]:
print('Simple Averaging Loss: '+str(log_loss(val_y,model_pred.clip(0.35,0.65))))
print('Two Times Larger Range(Averaging) Loss: '+str(log_loss(val_y,larger_range(model_pred,2).clip(0.35,0.65))))
if log_loss(val_y,model_pred.clip(0.35,0.65))<log_loss(val_y,larger_range(model_pred,2).clip(0.35,0.65)):
    two_times=False
    print('simple averaging is better')
else:
    two_times=True
    print('two times larger range is better')
two_times=False 
#This is not a bug. I did this intentionally because the model can't get most of the private validation set right(based on LB)

In [None]:
import scipy
print(model_pred.clip(0.35,0.65).mean())
print(scipy.stats.median_absolute_deviation(model_pred.clip(0.35,0.65))[0])

In [None]:
def check_answers(pred,real,num):
    for i,(x,y) in enumerate(zip(pred,real)):
        correct_incorrect='correct ✅ ' if round(float(x),0)==round(float(y),0) else 'incorrect❌'
        print(correct_incorrect+' prediction: '+str(x[0])+', answer: '+str(y))
        if i>num:
            return
def correct_precentile(pred,real):
    correct=0
    incorrect=0
    for x,y in zip(pred,real):
        if round(float(x),0)==round(float(y),0):
            correct+=1
        else:
            incorrect+=1
    print('number correct: '+str(correct)+', number incorrect: '+str(incorrect))
    print(str(round(correct/len(real)*100,1))+'% correct'+', '+str(round(incorrect/len(real)*100,1))+'% incorrect')
check_answers(model_pred,val_y,15)
correct_precentile(model_pred,val_y)

在这之前我们也已经介绍了Resnext和Xception的网络，下面我们会通过对这两个预训练的网络模型进行预测，并通过AUC进行评估

# Resnext Model

In [None]:

test_dir = "/kaggle/input/deepfake-detection-challenge/train_sample_videos/"

test_videos = sorted([x for x in os.listdir(test_dir) if x[-4:] == ".mp4"])
len(test_videos)

In [None]:
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda)
print("cuDNN version:", torch.backends.cudnn.version())

In [None]:
gpu = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
sys.path.insert(0, "../input/blazefaceepytorch")
sys.path.insert(0, "../input/deepfakesinferencedemo")

In [None]:
from blazeface import BlazeFace
facedet = BlazeFace()
facedet.load_weights("../input/blazefaceepytorch/blazeface.pth")
facedet.load_anchors("../input/blazefaceepytorch/anchors.npy")
_ = facedet.train(False)

In [None]:
from read_video_1 import VideoReader
from face_extract_1 import FaceExtractor
import cv2

frames_per_video = 10 #frame_h * frame_l
video_reader = VideoReader()
video_read_fn = lambda x: video_reader.read_frames(x, num_frames=frames_per_video)
face_extractor = FaceExtractor(video_read_fn, facedet)

In [None]:
input_size = 224

In [None]:
from torchvision.transforms import Normalize

mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
normalize_transform = Normalize(mean, std)

In [None]:
def isotropically_resize_image(img, size, resample=cv2.INTER_AREA):
    h, w = img.shape[:2]
    if w > h:
        h = h * size // w
        w = size
    else:
        w = w * size // h
        h = size

    resized = cv2.resize(img, (w, h), interpolation=resample)
    return resized


def make_square_image(img):
    h, w = img.shape[:2]
    size = max(h, w)
    t = 0
    b = size - h
    l = 0
    r = size - w
    return cv2.copyMakeBorder(img, t, b, l, r, cv2.BORDER_CONSTANT, value=0)

In [None]:
import torchvision.models as models

class MyResNeXt(models.resnet.ResNet):
    def __init__(self, training=True):
        super(MyResNeXt, self).__init__(block=models.resnet.Bottleneck,
                                        layers=[3, 4, 6, 3], 
                                        groups=32, 
                                        width_per_group=4)
        self.fc = nn.Linear(2048, 1)

In [None]:
checkpoint = torch.load("../input/deepfakesinferencedemo/resnext.pth")

model = MyResNeXt().to(gpu)
model.load_state_dict(checkpoint)
_ = model.eval()

del checkpoint

In [None]:
def predict_on_video(video_path, batch_size):
    try:
        # Find the faces for N frames in the video.
        faces = face_extractor.process_video(video_path)

        # Only look at one face per frame.
        face_extractor.keep_only_best_face(faces)
                
        if len(faces) > 0:
            # NOTE: When running on the CPU, the batch size must be fixed
            # or else memory usage will blow up. (Bug in PyTorch?)
            x = np.zeros((batch_size, input_size, input_size, 3), dtype=np.uint8)

            # If we found any faces, prepare them for the model.
            n = 0
            for frame_data in faces:
                for face in frame_data["faces"]:
                    # Resize to the model's required input size.
                    # We keep the aspect ratio intact and add zero
                    # padding if necessary.                    
                    resized_face = isotropically_resize_image(face, input_size)
                    resized_face = make_square_image(resized_face)

                    if n < batch_size:
                        x[n] = resized_face
                        n += 1
                    else:
                        print("WARNING: have %d faces but batch size is %d" % (n, batch_size))
                    
                    # Test time augmentation: horizontal flips.
                    # TODO: not sure yet if this helps or not
                    #x[n] = cv2.flip(resized_face, 1)
                    #n += 1

            if n > 0:
                x = torch.tensor(x, device=gpu).float()

                # Preprocess the images.
                x = x.permute((0, 3, 1, 2))

                for i in range(len(x)):
                    x[i] = normalize_transform(x[i] / 255.)

                # Make a prediction, then take the average.
                with torch.no_grad():
                    y_pred = model(x)
                    y_pred = torch.sigmoid(y_pred.squeeze())
                    return y_pred[:n].mean().item()

    except Exception as e:
        print("Prediction error on video %s: %s" % (video_path, str(e)))

    return 0.5

In [None]:
from concurrent.futures import ThreadPoolExecutor

def predict_on_video_set(videos, num_workers):
    def process_file(i):
        filename = videos[i]
        y_pred = predict_on_video(os.path.join(test_dir, filename), batch_size=frames_per_video)
        return y_pred

    with ThreadPoolExecutor(max_workers=num_workers) as ex:
        predictions = ex.map(process_file, range(len(videos)))

    return list(predictions)

In [None]:
speed_test = True  # you have to enable this manually

In [None]:
if speed_test:
    start_time = time.time()
    speedtest_videos = test_videos[:5]
    predictions = predict_on_video_set(speedtest_videos, num_workers=4)
    elapsed = time.time() - start_time
    print("Elapsed %f sec. Average per video: %f sec." % (elapsed, elapsed / len(speedtest_videos)))
    print(speedtest_videos)

In [None]:
predictions = predict_on_video_set(test_videos, num_workers=4)

In [None]:
submission_df_resnext = pd.DataFrame({"filename": test_videos, "label": predictions})#
submission_df_resnext.to_csv("submission_Resnext_10.csv", index=False)

When using regression models and machine learning models, all the survey data are numerical values to make it easier to get good results. Because regression and machine learning are both based on mathematical function methods, when categorical data appears in the data set we want to analyze, the data at this time is not ideal because we cannot process them mathematically. For example, when processing Fake or Real data, we replace them with 0 and 1, and then analyze. Because of this situation, we need a way to digitize text to preprocess the training data. In this work, we use LabelEncoder to implement the above conversion process. LabelEncoder is a function in the scikit-learn package. First, we need to create a variable encoder_x for encoding, and then when the program is executed, these Fake and Real category data are converted into values 0 and 1.

In [None]:
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder()
y = le.fit_transform(meta['label'])


In [None]:
pr=[]
for  item in predictions:
    item = 1-item
    pr.append(item)
    


In [None]:
#path_xcep = "./submission_Resnext.csv"

#obj=pd.read_csv(path_xcep)
y=y
y_pre_xcep=pr

import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = metrics.roc_curve(y, y_pre_xcep)
roc_auc = metrics.auc(fpr, tpr)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('./xception.png')
plt.show()


# Xception

In [None]:
!pip install ../input/deepfakexceptiontrainmodel/pytorchcv-0.0.55-py2.py3-none-any.whl --quiet

In [None]:
input_size = 224

In [None]:
def isotropically_resize_image(img, size, resample=cv2.INTER_AREA):
    h, w = img.shape[:2]
    if w > h:
        h = h * size // w
        w = size
    else:
        w = w * size // h
        h = size

    resized = cv2.resize(img, (w, h), interpolation=resample)
    return resized


def make_square_image(img):
    h, w = img.shape[:2]
    size = max(h, w)
    t = 0
    b = size - h
    l = 0
    r = size - w
    return cv2.copyMakeBorder(img, t, b, l, r, cv2.BORDER_CONSTANT, value=0)

In [None]:
!ls ../input/deepfakexceptiontrainmodel

In [None]:
from pytorchcv.model_provider import get_model
model = get_model("xception", pretrained=False)
model = nn.Sequential(*list(model.children())[:-1]) # Remove original output layer

class Pooling(nn.Module):
  def __init__(self):
    super(Pooling, self).__init__()
    
    self.p1 = nn.AdaptiveAvgPool2d((1,1))
    self.p2 = nn.AdaptiveMaxPool2d((1,1))

  def forward(self, x):
    x1 = self.p1(x)
    x2 = self.p2(x)
    return (x1+x2) * 0.5

model[0].final_block.pool = nn.Sequential(nn.AdaptiveAvgPool2d((1,1)))

class Head(torch.nn.Module):
  def __init__(self, in_f, out_f):
    super(Head, self).__init__()
    
    self.f = nn.Flatten()
    self.l = nn.Linear(in_f, 512)
    self.d = nn.Dropout(0.5)
    self.o = nn.Linear(512, out_f)
    self.b1 = nn.BatchNorm1d(in_f)
    self.b2 = nn.BatchNorm1d(512)
    self.r = nn.ReLU()

  def forward(self, x):
    x = self.f(x)
    x = self.b1(x)
    x = self.d(x)

    x = self.l(x)
    x = self.r(x)
    x = self.b2(x)
    x = self.d(x)

    out = self.o(x)
    return out

class FCN(torch.nn.Module):
  def __init__(self, base, in_f):
    super(FCN, self).__init__()
    self.base = base
    self.h1 = Head(in_f, 1)
  
  def forward(self, x):
    x = self.base(x)
    return self.h1(x)


net = []
model = FCN(model, 2048)
model = model.cuda()
model.load_state_dict(torch.load('../input/deepfakexceptiontrainmodel/model.pth')) # new, updated
net.append(model)

In [None]:
def predict_on_video(video_path, batch_size):
    try:
        # Find the faces for N frames in the video.
        faces = face_extractor.process_video(video_path)

        # Only look at one face per frame.
        face_extractor.keep_only_best_face(faces)
        
        if len(faces) > 0:
            # NOTE: When running on the CPU, the batch size must be fixed
            # or else memory usage will blow up. (Bug in PyTorch?)
            x = np.zeros((batch_size, input_size, input_size, 3), dtype=np.uint8)

            # If we found any faces, prepare them for the model.
            n = 0
            for frame_data in faces:
                for face in frame_data["faces"]:
                    # Resize to the model's required input size.
                    # We keep the aspect ratio intact and add zero
                    # padding if necessary.                    
                    resized_face = isotropically_resize_image(face, input_size)
                    resized_face = make_square_image(resized_face)

                    if n < batch_size:
                        x[n] = resized_face
                        n += 1
                    else:
                        print("WARNING: have %d faces but batch size is %d" % (n, batch_size))
                    
                    # Test time augmentation: horizontal flips.
                    # TODO: not sure yet if this helps or not
                    #x[n] = cv2.flip(resized_face, 1)
                    #n += 1

            if n > 0:
                x = torch.tensor(x, device=gpu).float()

                # Preprocess the images.
                x = x.permute((0, 3, 1, 2))

                for i in range(len(x)):
                    x[i] = normalize_transform(x[i] / 255.)
#                     x[i] = x[i] / 255.

                # Make a prediction, then take the average.
                with torch.no_grad():
                    y_pred = model(x)
                    y_pred = torch.sigmoid(y_pred.squeeze())
                    return y_pred[:n].mean().item()

    except Exception as e:
        print("Prediction error on video %s: %s" % (video_path, str(e)))

    return 0.5

In [None]:
from concurrent.futures import ThreadPoolExecutor

def predict_on_video_set(videos, num_workers):
    def process_file(i):
        filename = videos[i]
        y_pred = predict_on_video(os.path.join(test_dir, filename), batch_size=frames_per_video)
        return y_pred

    with ThreadPoolExecutor(max_workers=num_workers) as ex:
        predictions = ex.map(process_file, range(len(videos)))

    return list(predictions)

In [None]:
if True:
    start_time = time.time()
    speedtest_videos = test_videos[:1]
    predictions = predict_on_video_set(speedtest_videos, num_workers=4)
    elapsed = time.time() - start_time
    print("Elapsed %f sec. Average per video: %f sec." % (elapsed, elapsed / len(speedtest_videos)))

In [None]:
predictions = predict_on_video_set(test_videos, num_workers=4)

In [None]:
pr1=[]
for  item in predictions:
    item = 1-item
    pr1.append(item)
    

In [None]:
y=y
y_pre_xcep=pr1



import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = metrics.roc_curve(y, y_pre_xcep)
roc_auc = metrics.auc(fpr, tpr)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('./xception1.png')
plt.show()

以上三种算法就是本次工作中用到的算法，我们使用AUC-ROC（Area Under Curve）来衡量机器学习模型质量的指标。
AUC-ROC是机器学习中常用的一个分类器评价指标，其规则用一句话概括就是：一个正例，一个负例，预测为正的概率值比预测为负的概率值还要大的可能性。我们通过绘制ROC曲线来体现我们的工作效果，ROC曲线下面的面积就是AUC的值，具体图示会在下一部分详细给出。
对于上面提到的模型的训练，我们使用了python中的numpy、pandas、matplotlib.pyplot、glob、cv2、torch、tensorflow等package。


# 6.	Conclusion and Future Works

In the whole work, we first tried to segment the video data set and divide the data set into pictures. Then we used 4 face detection methods and compared its running time. Finally, we chose the shortest use time FaceBlaze was used to extract faces, and then we used a ResNet model similar to CNN for training and verification, but our method did not perform well. We changed the model, chose the pre-trained Rest and Xception, and we got the best results on Rest. In the next work, we will try to do some work similar to the DeepfakeStack deep integration learning algorithm to achieve better recognition results and find the optimal integration strategy.

# 7.	Bibliography

[1] N. S. Ivanov, A. V. Arzhskov and V. G. Ivanenko, "Combining Deep Learning and Super-Resolution Algorithms for Deep Fake Detection," 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), St. Petersburg and Moscow, Russia, 2020, pp. 326-328, doi: 10.1109/EIConRus49466.2020.9039498.
[2] D. Afchar, V. Nozick, J. Yamagishi and I. Echizen, "MesoNet: a Compact Facial Video Forgery Detection Network," 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, Hong Kong, 2018, pp. 1-7, doi: 10.1109/WIFS.2018.8630761.
[3] H. M. Nguyen and R. Derakhshani, "Eyebrow Recognition for Identifying Deepfake Videos," 2020 International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, 2020, pp. 1-5.
[4] B. Malolan, A. Parekh and F. Kazi, "Explainable Deep-Fake Detection Using Visual Interpretability Methods," 2020 3rd International Conference on Information and Computer Technologies (ICICT), San Jose, CA, USA, 2020, pp. 289-293, doi: 10.1109/ICICT50521.2020.00051.
[5] M. S. Rana and A. H. Sung, "DeepfakeStack: A Deep Ensemble-based Learning Technique for Deepfake Detection," 2020 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), New York, NY, USA, 2020, pp. 70-75, doi: 10.1109/CSCloud-EdgeCom49738.2020.00021.


In [None]:
好　、　我们的工作就到这里