# This notebook
This notebook is based in several other notebooks. I want to thank the authors of these notebooks for sharing their thoughts and code.

- https://www.kaggle.com/zaharch/train-set-metadata-for-dfdc
- https://www.kaggle.com/basharallabadi/dfdc-video-audio-labels
- https://www.kaggle.com/rakibilly/extract-audio-starter

This notebook outputs deepfake labels by separating audio and video deepfakes. In addition, at the end of the notebook I've included code for visualizing a sample frame from a video and code for creating an audio listening interface.

In [None]:
import numpy as np
import pandas as pd
import os
import random
import subprocess
from pathlib import Path
import IPython

import seaborn as sns
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Separating video from audio labels
Now, the original competition metadata only contains either the label "FAKE" or "REAL" for a given video. Therefore, in this notebook we will categorize the type of deepfake, i.e. either **audio deepfake** or **video deepfake**. 

- Add this to your dataset: https://www.kaggle.com/zaharch/train-set-metadata-for-dfdc
- The functions used are based on: https://www.kaggle.com/basharallabadi/dfdc-video-audio-labels


In [None]:
metadata = pd.read_csv('../input/train-set-metadata-for-dfdc/metadata', low_memory=False)
metadata.head()

In [None]:
def audio_label(row): 
    if row['label'] == 'REAL':
        return 'REAL'
    if row['wav.hash'] != row['wav.hash.orig'] and row['audio.@codec_time_base'] != '1/16000':
        return 'FAKE'
    return 'REAL'

def video_label(row):
    if row['label'] == 'REAL':
        return 'REAL'
    if row['pxl.hash'] != row['pxl.hash.orig']:
        return 'FAKE'
    return 'REAL'

In [None]:
metadata["video_label"] = metadata.progress_apply(video_label, axis=1)
metadata["audio_label"] = metadata.progress_apply(audio_label, axis=1)

In [None]:
clean_labels = metadata[["filename", "video_label", "audio_label"]]

## Exploratory Data Analysis

Let's understand this data a little! We start by looking at the **distribution of labels**.

In [None]:
sns.set_style('darkgrid')

plt.figure(figsize=(12,5))
plt.title('Label Distribution')

plt.subplot(1, 3, 1)
ax1 = sns.countplot(metadata["video_label"], order=["REAL", "FAKE"])
plt.subplot(1, 3, 2)
ax2 = sns.countplot(metadata["audio_label"], order=["REAL", "FAKE"])

union_label = metadata["video_label"].str.cat(metadata["audio_label"], sep="_")

plt.subplot(1, 3, 3)
ax3 = sns.countplot(union_label)

ax1.set_ylim(0, 120000)
ax2.set_ylim(0, 120000)
ax3.set_ylim(0, 120000)

plt.show()

It seems that all audio deepfakes are also image deepfakes. There are no audio deepfakes that are not image deepfakes.

In [None]:
print(f"Number of both FAKE video and FAKE audio: {len(union_label[union_label == 'FAKE_FAKE'])}")
print(f"Number of only FAKE audio: {len(metadata[metadata['audio_label']=='FAKE'])}")

In [None]:
num_audio_fakes = metadata["audio_label"].value_counts()["FAKE"]
print(f"We only have {num_audio_fakes} fake audio samples. It is undersampled in comparison to other labels.")

In [None]:
path = "../input/deepfake-detection-challenge/train_sample_videos/"
videos = [os.path.join(path, video) for video in os.listdir(path)]

In [None]:
def get_video_label(path, metadata):
    filename = os.path.basename(path)
    data = metadata[metadata["filename"] == filename]
    return data["video_label"]

def get_audio_label(path, metadata):
    filename = os.path.basename(path)
    data = metadata[metadata["filename"] == filename]
    return data["audio_label"]

## Audio reading

Static Build of ffmpeg: https://johnvansickle.com/ffmpeg/ <- internet is not available.
The public data set: https://www.kaggle.com/rakibilly/ffmpeg-static-build

This kernel helped me alot https://www.kaggle.com/rakibilly/extract-audio-starter

In [None]:
! tar xvf ../input/ffmpeg-static-build/ffmpeg-git-amd64-static.tar.xz

In [None]:
def create_audio(file, save_path):
    command = f"../working/ffmpeg-git-20191209-amd64-static/ffmpeg -i {file} -ab 192000 -ac 2 -ar 44100 -vn {save_path}"
    subprocess.call(command, shell=True)
    
output_format = "mp3"
output_dir = Path(f"mp3_files")
Path(output_dir).mkdir(exist_ok=True, parents=True)

## Video reading

In [None]:
def get_random_frame(path):
    cap = cv2.VideoCapture(path)
    cap.set(cv2.CAP_PROP_POS_FRAMES, random.uniform(0, 1))
    _, img = cap.read()
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    return img

def visualize_sample(sample):
    video_label = get_video_label(sample, clean_labels)
    audio_label = get_audio_label(sample, clean_labels)

    # Read random image
    img = get_random_frame(sample)
    plt.imshow(img)
    plt.title(f"Video: {video_label.item()}, Audio: {audio_label.item()}") # .item() works as of now 16/03/2020, but will be removed in the future
    plt.show()

## Visualization

In [None]:
sample = videos[100]

# Visualize random frame
visualize_sample(sample)

# Read audio
audio_file = f"{output_dir/sample[-14:-4]}.{output_format}"
create_audio(sample, audio_file)
IPython.display.Audio(audio_file)

In [None]:
sample = videos[203]

# Visualize random frame
visualize_sample(sample)

# Read audio
audio_file = f"{output_dir/sample[-14:-4]}.{output_format}"
create_audio(sample, audio_file)
IPython.display.Audio(audio_file)