# DFDC Dataset Sampling


This notebook is used to sample the full [DFDC](https://www.kaggle.com/c/deepfake-detection-challenge) dataset for deepfake detection research purposes. 100 deepfakes are randomly sampled with its corresponding original videos, resulting in a total of 200 videos (100 videos for each label).


In [14]:
import os
import csv
import random
import shutil
import json

In [6]:
DATASET_DIR = "/mnt/e/datasets/DFDC"
TRAIN_DATA_DIR_PREFIX = "dfdc_train"
METADATA_FILENAME = "metadata.json"

SAMPLE_DIR = "/mnt/e/samples/videos/DFDC"
SAMPLE_DEEPFAKE_DIR = os.path.join(SAMPLE_DIR, "Deepfake")
SAMPLE_ORIGINAL_VID_DIR = os.path.join(SAMPLE_DIR, "Original")

SAMPLE_SIZE = 100
SAMPLE_LIST_PATH = "dfdc-list.csv"

## Deepfake and Real Video Sampling


Deepfakes are sampled by taking two random videos for each part of the dataset, of which there are 50 parts in total. The real video retrieved by taking the original video of each deepfake based on provided the metadata for each part.


In [35]:
dataset_part_dirs = [
    dir_name
    for dir_name in os.listdir(DATASET_DIR)
    if TRAIN_DATA_DIR_PREFIX in dir_name
]
part_sample_size = SAMPLE_SIZE // len(dataset_part_dirs)

In [81]:
sample_original_vid_files = []
sample_deepfake_files = []

In [71]:
def get_vid_abs_path(filenames, part_dir):
    return list(
        map(
            lambda filename: os.path.join(part_dir, filename),
            filenames,
        )
    )

In [82]:
for part_dir in dataset_part_dirs:
    part_fulldir = os.path.join(DATASET_DIR, part_dir)
    metadata_filepath = os.path.join(part_fulldir, METADATA_FILENAME)

    with open(metadata_filepath, "r") as metadata_file:
        metadata = json.load(metadata_file)
        keys = list(metadata.keys())
        original_vids = [key for key in keys if metadata[key]["label"] == "REAL"]
        sample_original_vids = random.sample(original_vids, part_sample_size)

        # Sample deepfakes based on the sampled real videos
        sample_deepfakes = []
        for original_vid in sample_original_vids:
            deepfake_filenames = [
                key
                for key in keys
                if metadata[key]["label"] == "FAKE"
                and metadata[key]["original"] == original_vid
            ]
            sample_deepfakes.append(random.choice(deepfake_filenames))

        # Get absolute path for each sample videos
        part_fulldir = os.path.join("/mnt/e/datasets/DFDC", part_dir)
        sample_original_vid_files.extend(
            get_vid_abs_path(sample_original_vids, part_fulldir)
        )
        sample_deepfake_files.extend(get_vid_abs_path(sample_deepfakes, part_fulldir))

print(f"Sampled {len(sample_original_vid_files)} original video files")
print(
    f"Sampled {len(sample_deepfake_files)} deepfake files based on the sample original videos"
)

print("\nSample original video:")
print(sample_original_vid_files[0])
print("\nSample deepfake:")
print(sample_deepfake_files[0])

Sampled 100 original video files
Sampled 100 deepfake files based on the sample original videos

Sample original video:
/mnt/e/datasets/DFDC/dfdc_train_part_0/doniqevxeg.mp4

Sample deepfake:
/mnt/e/datasets/DFDC/dfdc_train_part_0/nadprinqny.mp4


## Copy Sample to a Separate Directory


In [89]:
for file_path in sample_original_vid_files:
    if os.path.isfile(file_path):
        shutil.copy(file_path, SAMPLE_ORIGINAL_VID_DIR)

In [90]:
for file_path in sample_deepfake_files:
    if os.path.isfile(file_path):
        shutil.copy(file_path, SAMPLE_DEEPFAKE_DIR)

## Create CSV File to List Sample Video Files


In [96]:
with open(SAMPLE_LIST_PATH, "w", newline="") as sample_list_file:
    header = ["file", "label"]
    writer = csv.writer(sample_list_file)
    writer.writerow(header)

    for filename in sample_deepfake_files:
        dataset_path = filename.split("/")[-3:]
        writer.writerow(["/".join(dataset_path), "deepfake"])

    for filename in sample_original_vid_files:
        dataset_path = filename.split("/")[-3:]
        writer.writerow(["/".join(dataset_path), "real"])

    print(f"Sample videos filepath list generated in {SAMPLE_LIST_PATH}")

Sample videos filepath list generated in dfdc-list.csv
