## Dataset Setup

This document will set up the three datasets used in this work: `DFDC`, `FaceForensics++`, and `Celeb-DF-v2`. After the completion of downloading the datasets, each are restructured to follow the `FAKE`, and `REAL` labeling. 

Raw dataset must be placed in `datasets/raws/` directory. We are going to select 400 videos for each class for the individual dataset, and have 600 videos for each in the combined dataset.

| Dataset         | Real | Fake | Total |
|-----------------|------|------|-------|
| DFDC            | 400  | 400  | 800   |
| FaceForensics++ | 400  | 400  | 800   |
| Celeb-DF-v2     | 400  | 400  | 800   |
| Combined        | 600  | 600  | 1200  |

Videos filtered from this will be available in `datasets/filtered/` folder.

### Downloading the Datasets

`DFDC` and `Celeb-DF-v2` can be downloaded from these respective links for each datasets:
- [DFDC](https://www.kaggle.com/c/deepfake-detection-challenge/data)  
- [Celeb-DF-v2](https://drive.google.com/open?id=1iLx76wsbi9itnkxSqz9BVBl4ZvnbIazj)

For `FaceForensics++`, a specialized script is used. The specialized script is called like below:

In [1]:
# Activate again if needed
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d Deepfakes -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d Face2Face -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d FaceSwap -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d NeuralTextures -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d original -c raw -t videos -n 400 --server EU2

### Split Each Dataset to REAL/FAKE

Reorganizing `Celeb-DF-v2` and `FaceForensics` can be solved by simply modifying the directory structure, whereas `DFDC` requires a bit more work as we need to determine each image's label through another file, `metadata.json`

In [20]:
import os
import shutil
import json


def organize_to_filtered(dataset, new_fd, dir_map, max=400):
    print("")
    for dmap in dir_map:
        raw_loc = f"{dataset}/{dmap}"
        new_loc = f"{new_fd}/{dir_map[dmap]}"

        if not os.path.exists(new_loc):
            os.makedirs(new_loc)

        # TODO: "./raws/FaceForensics" should not be hardcoded for flexibility
        ff_cat = dmap.split("/")[-3] if dataset == "./raws/FaceForensics" else ""

        for i, file in enumerate(os.listdir(raw_loc)):
            tf_file = f"{new_loc}/{dmap}_{file}" if ff_cat == "" else f"{new_loc}/{ff_cat}_{file}"
            if i < max:
                shutil.copy(f"{raw_loc}/{file}", tf_file)

        print(f"Moved {raw_loc} to {new_loc}")


def organize_mul_from_metadata(datasets, new_fd, metadatas, max=400):
    counter = {
        "REAL": 0,
        "FAKE": 0,
    }

    if not os.path.exists(new_fd):
        os.makedirs(f"{new_fd}/FAKE")
        os.makedirs(f"{new_fd}/REAL")

    for ids, ds in enumerate(datasets):
        with open(metadatas[ids], "r") as metafile:
            json_data = json.load(metafile)
            for i, file in enumerate(os.listdir(ds)):
                if file == "metadata.json":
                    continue

                label = json_data[file]['label']
                if counter[label] < max:
                    shutil.copy(f"{ds}/{file}", f"{new_fd}/{label}/{file}")
                    counter[label] += 1

                if counter["REAL"] == max and counter["FAKE"] == max:
                    print("\nAll labels are maxed!")
                    break

In [23]:
# Activate again if needed
# organize_to_filtered("./raws/Celeb-DF-v2", "./videos-organized/Celeb-DF-v2", {
#     "Celeb-real": "REAL",
#     "Celeb-synthesis": "FAKE"
# })

# organize_to_filtered(
#     "./raws/FaceForensics",
#     "./videos-organized/FaceForensics",
#     {
#         "original_sequences/youtube/raw/videos": "REAL",
#         "manipulated_sequences/Deepfakes/raw/videos": "FAKE",
#         "manipulated_sequences/Face2Face/raw/videos": "FAKE",
#         "manipulated_sequences/FaceSwap/raw/videos": "FAKE",
#         "manipulated_sequences/NeuralTextures/raw/videos": "FAKE",
#     }
# )

# dfdc_datasets = [
#     "./raws/dfdc_train_part_0",
#     "./raws/dfdc_train_part_1",
#     "./raws/dfdc_train_part_2",
#     "./raws/dfdc_train_part_3",
#     "./raws/dfdc_train_part_4"
# ]

# dfdc_metadatas = [
#     "./raws/dfdc_train_part_0/metadata.json",
#     "./raws/dfdc_train_part_1/metadata.json",
#     "./raws/dfdc_train_part_2/metadata.json",
#     "./raws/dfdc_train_part_3/metadata.json",
#     "./raws/dfdc_train_part_4/metadata.json"
# ]

# organize_mul_from_metadata(dfdc_datasets, "./videos-organized/DFDC", dfdc_metadatas)

### Create Combined Dataset

With the datasets now organized, it is easier to create the combined set. For now, the combined set will take the first 200 videos from each dataset.

In [30]:
def create_combined_ds(datasets, classes, combined_ds_path, max_per_class=200):
    if not os.path.exists(combined_ds_path):
        for c in classes:
            os.makedirs(f"{combined_ds_path}/{c}")

    for ds in datasets:
        for cl in classes:
            for i, file in enumerate(os.listdir(f"{ds}/{cl}")):
                if i == max_per_class:
                    print(f"Finish inserting {ds}-{cl}")
                    break

                shutil.copy(f"{ds}/{cl}/{file}", f"{combined_ds_path}/{cl}/{file}")

In [31]:
organized_ds = [
    "./videos-organized/Celeb-DF-v2",
    "./videos-organized/DFDC",
    "./videos-organized/FaceForensics"
]

classes = ["REAL", "FAKE"]

create_combined_ds(organized_ds, classes, "./videos-organized/Combined")

Finish inserting ./videos-organized/Celeb-DF-v2-REAL
Finish inserting ./videos-organized/Celeb-DF-v2-FAKE
Finish inserting ./videos-organized/DFDC-REAL
Finish inserting ./videos-organized/DFDC-FAKE
Finish inserting ./videos-organized/FaceForensics-REAL
Finish inserting ./videos-organized/FaceForensics-FAKE
