# DeeperForensics-1.0 Dataset Sampling


This notebook is used to sample the full [DeeperForensics-1.0](https://github.com/EndlessSora/DeeperForensics-1.0) dataset for deepfake detection research purposes. 100 deepfakes are randomly sampled with its corresponding original videos, resulting in a total of 200 videos (100 videos for each label).

The real videos are retrieved from [FaceForensics++](https://github.com/ondyari/FaceForensics) dataset since DeeperForensics use the refined YouTube videos collected by [FaceForensics++](https://github.com/ondyari/FaceForensics) as mentioned in their [GitHub document page](https://github.com/EndlessSora/DeeperForensics-1.0/tree/master/dataset#target-videos).


In [1]:
import os
import csv
import random
import shutil

In [3]:
DEEPFAKE_DIR = (
    "/mnt/e/datasets/DeeperForensics-1.0/manipulated_videos/reenact_postprocess"
)
ORIGINAL_VID_DIR = (
    "/mnt/e/datasets/FaceForensics++/original_sequences/youtube/c23/videos"
)

SAMPLE_DIR = "/mnt/e/samples/videos/DeeperForensics-1.0"
SAMPLE_DEEPFAKE_DIR = os.path.join(SAMPLE_DIR, "Deepfake")
SAMPLE_ORIGINAL_VID_DIR = os.path.join(SAMPLE_DIR, "Original")

SAMPLE_SIZE = 100
SAMPLE_LIST_PATH = "deeperforensics-1.0-list.csv"

## Sample Real Videos


In [19]:
sample_original_vids = random.sample(os.listdir(ORIGINAL_VID_DIR), SAMPLE_SIZE)
print(f"Sampled {len(sample_original_vids)} original video files")

Sampled 100 original video files


## Sample Deepfakes

Each deepfake sample is chosen by taking manipulated videos based on the original video samples.


In [27]:
def extract_original_deepfake_filename(deepfake_filename):
    splitted = deepfake_filename.split("_")
    file_extension = os.path.splitext(deepfake_filename)[-1]
    return splitted[0] + file_extension

In [36]:
deepfakes = os.listdir(DEEPFAKE_DIR)
sample_deepfakes = []

for original_filename in sample_original_vids:
    for deepfake_filename in deepfakes:
        extracted_original_filename = extract_original_deepfake_filename(
            deepfake_filename
        )
        if extracted_original_filename == original_filename:
            sample_deepfakes.append(deepfake_filename)
            break

print(
    f"Sampled {len(sample_deepfakes)} deepfake files based on the sample original videos"
)

Sampled 100 deepfake files based on the sample original videos


## Define Each Sample's File Path


In [47]:
sample_original_vid_files = list(
    map(lambda filename: os.path.join(ORIGINAL_VID_DIR, filename), sample_original_vids)
)
sample_deepfake_files = list(
    map(lambda filename: os.path.join(DEEPFAKE_DIR, filename), sample_deepfakes)
)

print("Sample original video:")
print(sample_original_vid_files[0])
print("\nSample deepfake:")
print(sample_deepfake_files[0])

Sample original video:
/mnt/e/datasets/FaceForensics++/original_sequences/youtube/c23/videos/651.mp4

Sample deepfake:
/mnt/e/datasets/DeeperForensics-1.0/manipulated_videos/reenact_postprocess/651_W010.mp4


## Copy Sample to a Separate Directory


In [45]:
for file_path in sample_original_vid_files:
    if os.path.isfile(file_path):
        shutil.copy(file_path, SAMPLE_ORIGINAL_VID_DIR)

In [46]:
for file_path in sample_deepfake_files:
    if os.path.isfile(file_path):
        shutil.copy(file_path, SAMPLE_DEEPFAKE_DIR)

## Create CSV File to List Sample Video Files


In [55]:
with open(SAMPLE_LIST_PATH, "w", newline="") as sample_list_file:
    header = ["file", "label"]
    writer = csv.writer(sample_list_file)
    writer.writerow(header)

    for filename in sample_deepfake_files:
        dataset_path = filename.split("/")[-4:]
        writer.writerow(["/".join(dataset_path), "deepfake"])

    for filename in sample_original_vid_files:
        dataset_path = filename.split("/")[-6:]
        writer.writerow(["/".join(dataset_path), "real"])

    print(f"Sample videos filepath list generated in {SAMPLE_LIST_PATH}")

Sample videos filepath list generated in deeperforensics-1.0-list.csv
