# Deepfake Detection Challenge
- Identify videos with facial or voice manipulations

In this [competition](https://deepfakedetectionchallenge.ai/) produced in cooperation with Amazon, Microsoft, the nonprofit Partnership on AI, and academics from eight universities—researchers around the world are vying to create automated tools that can spot fraudulent media. The competition was announced at the AI conference NeurIPS, and will accept entries through March 2020. Facebook has dedicated more than US $10 million for awards and grants.


Deepfake techniques, which present realistic AI-generated videos of people doing and saying fictional things, have the potential to have a significant impact on how people determine the legitimacy of information presented online. These content generation and modification technologies may affect the quality of public discourse and the safeguarding of human rights—especially given that deepfakes may be used maliciously as a source of misinformation, manipulation, harassment, and persuasion. Identifying manipulated media is a technically demanding and rapidly evolving challenge that requires collaborations across the entire tech industry and beyond.

![](https://spectrum.ieee.org/image/MzQyNjU1OQ.jpeg)

DeepFake uses AI (artificial intelligence) and machine learning to manipulate videos or any other form of digital representations. They result in images, videos or just audios that appear to be real. Since DeepFake uses AI and machine learning, the tech analyzes the videos and images of the target person from all angles. Thereafter, the technology accurately mimics the behavior and speech of the target person.

# Import Packages

In [None]:
SAMPLE_SUB = "../input/deepfake-detection-challenge/sample_submission.csv"
TRAIN_VIDEOS = "../input/deepfake-detection-challenge/train_sample_videos"
TEST_VIDEOS = "../input/deepfake-detection-challenge/test_videos"
TRAIN_JSON_PATH = "../input/deepfake-detection-challenge/train_sample_videos/metadata.json"

In [None]:
#import packages

import pandas as pd
import numpy as np
import cv2
import seaborn as sns
import matplotlib.pyplot as plt
import os

%matplotlib inline

# Data Files

* **train_sample_videos.zip** - a ZIP file containing a sample set of training videos and a metadata.json with labels. the full set of training videos is available through the links provided above.
* **sample_submission.csv** - a sample submission file in the correct format.
* **test_videos.zip** - a zip file containing a small set of videos to be used as a public validation set.

In [None]:
# check the number of videos present in the train and test data.

n_train_videos = len(os.listdir(TRAIN_VIDEOS))
n_test_videos = len(os.listdir(TEST_VIDEOS))

train_videos = os.listdir(TRAIN_VIDEOS)
test_videos = os.listdir(TEST_VIDEOS)

print("Number of training vidoes: ", n_train_videos - 1)
print("Number of testing videos: ", n_test_videos)

> The data is comprised of .mp4 files, split into compressed sets of ~10GB apiece. A **metadata.json** accompanies each set of .mp4 files, and contains `filename`, `label` (REAL/FAKE), `original` and `split` columns, listed below under Columns

In [None]:
#read the json file

deepfake_labels = pd.read_json(TRAIN_JSON_PATH).T

deepfake_labels.head()

# Data Analysis - MetaData File

In [None]:
#start with the basic analysis - Missing value analysis

missing_df = pd.DataFrame({"Missing_Count": deepfake_labels.isnull().sum(),
                          "Missing_Percent": round(deepfake_labels.isnull().mean(),2)})
missing_df

* Around 20% of the videos present doesn't have a original video associated with it. It could mean that these 77 videos might be REAL so that's why original video column is empty for these videos.

In [None]:
#check the distribution of labels
plt.style.use("ggplot")
plt.rcParams['figure.figsize'] = 14,7

deepfake_labels.label.value_counts(normalize = True).plot(kind = "barh")
plt.xlabel("Percentage of labels")
plt.title("Distribution of labels for videos")
plt.show()

- Training data is skewed. More than 80% of the data consisting of `FAKE` videos

# Exploratory Data Analysis - Video Data

In [None]:
# we have already created a list of training data and test data videos

print("Training videos: " ,train_videos[:5])
print("Testing videos: ", test_videos[:5])

- We will look at the some of the FAKE and REAL Videos. First we will get the names of the FAKE and REAL videos into seperate lists for training and test data

In [None]:
train_fake_lst = deepfake_labels[deepfake_labels["label"] == "FAKE"].index.tolist()
train_real_lst = deepfake_labels[deepfake_labels["label"] == "REAL"].index.tolist()

- To display a single frame from the video. We will use the function from the kernel **"[Basic EDA Face Detection, split video and ROI](https://www.kaggle.com/marcovasquez/basic-eda-face-detection-split-video-and-roi)"**

In [None]:
#display frame

def display_frame(video, axis):
    cap = cv2.VideoCapture(video)  
    ret, frame = cap.read()
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    axis.imshow(frame)
    axis.grid(False)
    video_name = video.split("/")[-1]
    axis.set_title("Video: " + video_name, size = 30)

# Analysis of Training - FAKE Videos
- Look at the FAKE videos present in training data

In [None]:
random_num = np.random.randint(0,len(train_fake_lst))
fig,axs = plt.subplots(nrows = 1, ncols=3, figsize=(50,40))
for i in range(3):
    display_frame(os.path.join(TRAIN_VIDEOS, train_fake_lst[random_num]), axs[i])

In [None]:
random_num = np.random.randint(0,len(train_fake_lst))
fig,axs = plt.subplots(nrows = 1, ncols=3, figsize=(50,40))
for i in range(3):
    display_frame(os.path.join(TRAIN_VIDEOS, train_fake_lst[random_num]), axs[i])

# Analysis of Training - REAL Videos
- Look at the REAL videos present in training data

In [None]:
random_num = np.random.randint(0,len(train_real_lst))
fig,axs = plt.subplots(nrows = 1, ncols=3, figsize=(50,40))
for i in range(3):
    display_frame(os.path.join(TRAIN_VIDEOS, train_real_lst[random_num]), axs[i])

- Another REAL Video

In [None]:
random_num = np.random.randint(0,len(train_real_lst))
fig,axs = plt.subplots(nrows = 1, ncols=3, figsize=(50,40))
for i in range(3):
    display_frame(os.path.join(TRAIN_VIDEOS, train_real_lst[random_num]), axs[i])

In [None]:
random_num = np.random.randint(0,len(train_real_lst))
fig,axs = plt.subplots(nrows = 1, ncols=3, figsize=(50,40))
for i in range(3):
    display_frame(os.path.join(TRAIN_VIDEOS, train_real_lst[random_num]), axs[i])

# Submission

In [None]:
#checking the label distribution

deepfake_labels.label.value_counts(normalize = True)

- Since our data is imbalanced (80% of data contains FAKE Video's). So we will assume that test data might also follows same distribution for sample submission.

In [None]:
sample_df = pd.read_csv(SAMPLE_SUB)
sample_df.head()

In [None]:
sample_df["label"] = 0.65
sample_df.to_csv("submission.csv", index = False)