# Dataset Splitting


This notebook is used to split [Celeb-DF-v2](https://github.com/yuezunli/celeb-deepfakeforensics), [DeeperForensics-1.0](https://github.com/EndlessSora/DeeperForensics-1.0), and [DFDC](https://www.kaggle.com/c/deepfake-detection-challenge) dataset into train, validation, and test sets for deepfake detection research purposes. The [600 sampled videos](https://github.com/ramaastra/deepfake-detection-cnn-mc-dropout/tree/main/data-sampling) from the three datasets are splitted each with a ratio of 70%, 10%, and 20% respectively.


In [1]:
import os
import random
import shutil
import pandas as pd

## Define Dataset Path


### Face Extracted Dataset Directory


In [None]:
CDF_DIR = "/mnt/e/samples/extracted/Celeb-DF-v2"
DF_DIR = "/mnt/e/samples/extracted/DeeperForensics-1.0"
DFDC_DIR = "/mnt/e/samples/extracted/DFDC"

### Dataset Split Output Directory


In [None]:
CDF_OUTPUT_DIR = "/mnt/e/samples/split/Celeb-DF-v2"
DF_OUTPUT_DIR = "/mnt/e/samples/split/DeeperForensics-1.0"
DFDC_OUTPUT_DIR = "/mnt/e/samples/split/DFDC"

### Dataset Sample List Filepath


In [4]:
CDF_SAMPLE_LIST = "../data-sampling/celeb-df-v2-list.csv"
DF_SAMPLE_LIST = "../data-sampling/deeperforensics-1.0-list.csv"
DFDC_SAMPLE_LIST = "../data-sampling/dfdc-list.csv"

## Define Dataset Split Ratio


In [5]:
DATASET_SIZE = 200
NUM_LABELS = 2

TRAIN_RATIO = 0.7
VAL_RATIO = 0.1
TEST_RATIO = 0.2

In [6]:
train_size_per_label = int(DATASET_SIZE * TRAIN_RATIO / 2)
val_size_per_label = int(DATASET_SIZE * VAL_RATIO / 2)
test_size_per_label = int(DATASET_SIZE * TEST_RATIO / 2)

print(f"Train size per label\t: {train_size_per_label} video")
print(f"Val size per label\t: {val_size_per_label} video")
print(f"Test size per label\t: {test_size_per_label} video")

Train size per label	: 70 video
Val size per label	: 10 video
Test size per label	: 20 video


## Load Dataset Sample List into Dataframe


In [7]:
cdf_list_df = pd.read_csv(CDF_SAMPLE_LIST)
df_list_df = pd.read_csv(DF_SAMPLE_LIST)
dfdc_list_df = pd.read_csv(DFDC_SAMPLE_LIST)

print("Dataframe example:")
cdf_list_df.head()

Dataframe example:


Unnamed: 0,file,label
0,Celeb-DF-v2/Celeb-synthesis/id23_id3_0003.mp4,deepfake
1,Celeb-DF-v2/Celeb-synthesis/id53_id56_0009.mp4,deepfake
2,Celeb-DF-v2/Celeb-synthesis/id17_id26_0009.mp4,deepfake
3,Celeb-DF-v2/Celeb-synthesis/id54_id51_0008.mp4,deepfake
4,Celeb-DF-v2/Celeb-synthesis/id59_id60_0006.mp4,deepfake


## Split the Dataset


In [8]:
def remove_sampled_data(data, sample):
    return [value for value in data if value not in sample]


def train_val_test_split(df):
    deepfakes = df[df["label"] == "deepfake"].get("file").values
    real_vids = df[df["label"] == "real"].get("file").values

    indices = range(len(df) // NUM_LABELS)

    # Take train sample
    train_indices = random.sample(indices, train_size_per_label)
    indices = remove_sampled_data(indices, train_indices)

    # Take validation sample
    val_indices = random.sample(indices, val_size_per_label)
    indices = remove_sampled_data(indices, val_indices)

    # Take test sample
    test_indices = random.sample(indices, test_size_per_label)
    indices = remove_sampled_data(indices, test_indices)

    train_deepfakes = [deepfakes[i] for i in train_indices]
    val_deepfakes = [deepfakes[i] for i in val_indices]
    test_deepfakes = [deepfakes[i] for i in test_indices]

    train_real_vids = [real_vids[i] for i in train_indices]
    val_real_vids = [real_vids[i] for i in val_indices]
    test_real_vids = [real_vids[i] for i in test_indices]

    data = {"file_id": [], "label": [], "split": []}

    # Add file ids to dict
    data["file_id"] += train_deepfakes + train_real_vids
    data["file_id"] += val_deepfakes + val_real_vids
    data["file_id"] += test_deepfakes + test_real_vids

    # Add labels to dict
    data["label"] += ["deepfake" for _ in range(len(train_deepfakes))]
    data["label"] += ["real" for _ in range(len(train_real_vids))]
    data["label"] += ["deepfake" for _ in range(len(val_deepfakes))]
    data["label"] += ["real" for _ in range(len(val_real_vids))]
    data["label"] += ["deepfake" for _ in range(len(test_deepfakes))]
    data["label"] += ["real" for _ in range(len(test_real_vids))]

    # Add split information to dict
    data["split"] += ["train" for _ in range(train_size_per_label * NUM_LABELS)]
    data["split"] += ["val" for _ in range(val_size_per_label * NUM_LABELS)]
    data["split"] += ["test" for _ in range(test_size_per_label * NUM_LABELS)]

    df = pd.DataFrame(data)
    return df

In [None]:
cdf_split = train_val_test_split(cdf_list_df)
df_split = train_val_test_split(df_list_df)
dfdc_split = train_val_test_split(dfdc_list_df)

print("Example of generated split list:")
cdf_split.head()

Example of generated split list:


Unnamed: 0.1,Unnamed: 0,file_id,label,split
0,0,Celeb-DF-v2/Celeb-synthesis/id0_id16_0001.mp4,deepfake,train
1,1,Celeb-DF-v2/Celeb-synthesis/id55_id52_0008.mp4,deepfake,train
2,2,Celeb-DF-v2/Celeb-synthesis/id59_id60_0006.mp4,deepfake,train
3,3,Celeb-DF-v2/Celeb-synthesis/id48_id45_0008.mp4,deepfake,train
4,4,Celeb-DF-v2/Celeb-synthesis/id20_id29_0002.mp4,deepfake,train


## Function to Extract File Id from Filepath


In [10]:
def get_file_id(filepath):
    filename = os.path.basename(filepath)

    # Remove filename extension
    file_id = os.path.splitext(filename)[0]

    return file_id

## Function to Copy Data into Output Directory


In [None]:
def copy_data_to_split_dir(split_df, source_dir, target_dir):
    for row in split_df.values:
        filepath, label, split = row
        file_id = get_file_id(filepath)
        label_dirname = "Original" if label == "real" else "Deepfake"

        dataset_dir = os.path.join(source_dir, label_dirname)
        output_dir = os.path.join(target_dir, split.capitalize(), label_dirname)

        extracted_filenames = [
            filename
            for filename in os.listdir(dataset_dir)
            if file_id in filename.split("-")[0]
        ]

        print(f"Processing {file_id}")
        for filename in extracted_filenames:
            extracted_filepath = os.path.join(dataset_dir, filename)
            output_filepath = os.path.join(output_dir, filename)
            if os.path.isfile(extracted_filepath):
                shutil.copy(extracted_filepath, output_filepath)

        print("=> Successfully copied to the split directory\n")

## Create a Copy of Splitted Celeb-DF-v2 Dataset


In [None]:
copy_data_to_split_dir(cdf_split, CDF_DIR, CDF_OUTPUT_DIR)

Processing id0_id16_0001
=> Successfully copied to the split directory

Processing id55_id52_0008
=> Successfully copied to the split directory

Processing id59_id60_0006
=> Successfully copied to the split directory

Processing id48_id45_0008
=> Successfully copied to the split directory

Processing id20_id29_0002
=> Successfully copied to the split directory

Processing id61_id5_0009
=> Successfully copied to the split directory

Processing id57_id52_0006
=> Successfully copied to the split directory

Processing id29_id33_0006
=> Successfully copied to the split directory

Processing id49_id52_0003
=> Successfully copied to the split directory

Processing id13_id7_0008
=> Successfully copied to the split directory

Processing id3_id1_0007
=> Successfully copied to the split directory

Processing id51_id53_0005
=> Successfully copied to the split directory

Processing id6_id9_0003
=> Successfully copied to the split directory

Processing id32_id37_0000
=> Successfully copied to the sp

## Create a Copy of Splitted DeeperForensics-1.0 Dataset


In [None]:
copy_data_to_split_dir(df_split, DF_DIR, DF_OUTPUT_DIR)

Processing 902_M029
=> Successfully copied to the split directory

Processing 979_M038
=> Successfully copied to the split directory

Processing 416_M030
=> Successfully copied to the split directory

Processing 771_W027
=> Successfully copied to the split directory

Processing 876_M025
=> Successfully copied to the split directory

Processing 890_W110
=> Successfully copied to the split directory

Processing 537_W136
=> Successfully copied to the split directory

Processing 358_M025
=> Successfully copied to the split directory

Processing 469_W027
=> Successfully copied to the split directory

Processing 487_W031
=> Successfully copied to the split directory

Processing 017_W006
=> Successfully copied to the split directory

Processing 360_W012
=> Successfully copied to the split directory

Processing 584_M118
=> Successfully copied to the split directory

Processing 912_M031
=> Successfully copied to the split directory

Processing 024_M113
=> Successfully copied to the split direct

## Create a Copy of Splitted DFDC Dataset


In [None]:
copy_data_to_split_dir(dfdc_split, DFDC_DIR, DFDC_OUTPUT_DIR)

Processing zvedcbblfc
=> Successfully copied to the split directory

Processing zxjgsupwsa
=> Successfully copied to the split directory

Processing ybnjvhunta
=> Successfully copied to the split directory

Processing ihlbiaygvz
=> Successfully copied to the split directory

Processing wiqlpmenil
=> Successfully copied to the split directory

Processing tzkxumuidc
=> Successfully copied to the split directory

Processing syfbyohaem
=> Successfully copied to the split directory

Processing vhphtddasg
=> Successfully copied to the split directory

Processing xacvqczazm
=> Successfully copied to the split directory

Processing rgvpctvdkn
=> Successfully copied to the split directory

Processing dxpqnkmhby
=> Successfully copied to the split directory

Processing cnvhlcuyjq
=> Successfully copied to the split directory

Processing zwnkaxgrub
=> Successfully copied to the split directory

Processing lueanpvcqp
=> Successfully copied to the split directory

Processing oxeupoccru
=> Successfu

## Generate CSV File to Store Split Information


In [None]:
cdf_split.to_csv("celeb-df-v2-split.csv", index=False)
df_split.to_csv("deeperforensics-1.0-split.csv", index=False)
dfdc_split.to_csv("dfdc-split.csv", index=False)