# Celeb-DF-v2 Dataset Sampling


This notebook is used to sample the full [Celeb-DF-v2](https://github.com/yuezunli/celeb-deepfakeforensics) dataset for deepfake detection research purposes. 100 deepfakes are randomly sampled with its corresponding original videos, resulting in a total of 200 videos (100 videos for each label).


In [1]:
import os
import csv
import random
import shutil

In [12]:
DATASET_DIR = "/mnt/e/datasets/Celeb-DF-v2"
CELEB_REAL_DIR = os.path.join(DATASET_DIR, "Celeb-real")
CELEB_SYNTHESIS_DIR = os.path.join(DATASET_DIR, "Celeb-synthesis")

SAMPLE_DIR = "/mnt/e/samples/videos/Celeb-DF-v2"
SAMPLE_DEEPFAKE_DIR = os.path.join(SAMPLE_DIR, "Deepfake")
SAMPLE_ORIGINAL_VID_DIR = os.path.join(SAMPLE_DIR, "Original")

SAMPLE_SIZE = 100
SAMPLE_LIST_PATH = "celeb-df-v2-list.csv"

## Sample Real Videos

Real videos are sampled based on deepfakes' original videos. The original video file for each deepfake are retrieved from the deepfake filename.


#### Extract Original Vids Filename for Each Deepfake


In [5]:
def extract_original_deepfake_filename(deepfake_filename):
    splitted = deepfake_filename.split("_")
    return f"{splitted[0]}_{splitted[-1]}"

In [6]:
deepfakes = os.listdir(CELEB_SYNTHESIS_DIR)
deepfake_original_vids = list(map(extract_original_deepfake_filename, deepfakes))

for i in range(0, 5):
    print(f"Deepfake: {deepfakes[i]} => Original: {deepfake_original_vids[i]}")
print("...")

Deepfake: id0_id16_0000.mp4 => Original: id0_0000.mp4
Deepfake: id0_id16_0001.mp4 => Original: id0_0001.mp4
Deepfake: id0_id16_0002.mp4 => Original: id0_0002.mp4
Deepfake: id0_id16_0003.mp4 => Original: id0_0003.mp4
Deepfake: id0_id16_0004.mp4 => Original: id0_0004.mp4
...


### Get Unique Deepfake Original Vids


In [7]:
unique_deepfake_original_vids = list(set(deepfake_original_vids))
print(
    f"Found {len(unique_deepfake_original_vids)} unique original videos for each deepfakes"
)

Found 542 unique original videos for each deepfakes


### Sample Unique Deepfake Original Vids


In [8]:
sample_original_vids = random.sample(unique_deepfake_original_vids, SAMPLE_SIZE)
print(f"Sampled {len(sample_original_vids)} original video files")

Sampled 100 original video files


## Sample Deepfakes

Each deepfake sample is chosen by a random choice of some deepfakes generated based on the original video samples.


In [9]:
def match_deepfake_original_vid_filename(deepfake_filename, original_vid_filename):
    if extract_original_deepfake_filename(deepfake_filename) == original_vid_filename:
        return deepfake_filename

In [10]:
sample_deepfakes = []
for original_vid_filename in sample_original_vids:
    original_vid_deepfakes = list(
        filter(
            lambda deepfake_filename: match_deepfake_original_vid_filename(
                deepfake_filename, original_vid_filename
            ),
            deepfakes,
        )
    )
    sample_deepfakes.append(random.choice(original_vid_deepfakes))

print(
    f"Sampled {len(sample_deepfakes)} deepfake files based on the sample original videos"
)

Sampled 100 deepfake files based on the sample original videos


## Define Each Sample's File Path


In [13]:
sample_original_vid_files = list(
    map(lambda filename: os.path.join(CELEB_REAL_DIR, filename), sample_original_vids)
)
sample_deepfake_files = list(
    map(lambda filename: os.path.join(CELEB_SYNTHESIS_DIR, filename), sample_deepfakes)
)

print("Sample original video:")
print(sample_original_vid_files[0])
print("\nSample deepfake:")
print(sample_deepfake_files[0])

Sample original video:
/mnt/e/datasets/Celeb-DF-v2/Celeb-real/id23_0003.mp4

Sample deepfake:
/mnt/e/datasets/Celeb-DF-v2/Celeb-synthesis/id23_id3_0003.mp4


In [15]:
sample_original_vid_files = list(
    map(lambda filename: os.path.join(CELEB_REAL_DIR, filename), sample_original_vids)
)
sample_deepfake_files = list(
    map(lambda filename: os.path.join(CELEB_SYNTHESIS_DIR, filename), sample_deepfakes)
)

## Copy Sample to a Separate Directory


In [17]:
for file_path in sample_original_vid_files:
    if os.path.isfile(file_path):
        shutil.copy(file_path, SAMPLE_ORIGINAL_VID_DIR)

In [18]:
for file_path in sample_deepfake_files:
    if os.path.isfile(file_path):
        shutil.copy(file_path, SAMPLE_DEEPFAKE_DIR)

## Create CSV File to List Sample Video Files


In [19]:
with open(SAMPLE_LIST_PATH, "w", newline="") as sample_list_file:
    header = ["file", "label"]
    writer = csv.writer(sample_list_file)
    writer.writerow(header)

    for filename in sample_deepfake_files:
        dataset_path = filename.split("/")[-3:]
        writer.writerow(["/".join(dataset_path), "deepfake"])

    for filename in sample_original_vid_files:
        dataset_path = filename.split("/")[-3:]
        writer.writerow(["/".join(dataset_path), "real"])

    print(f"Sample videos filepath list generated in {SAMPLE_LIST_PATH}")

Sample videos filepath list generated in celeb-df-v2-list.csv
