## Dataset Setup

This document will set up the three dataset used in this work: `DFDC`, `FaceForensics++`, and `Celeb-DF-v2`. After the completion of downloading the datasets, each datasets are restructured to follow the `FAKE`, and `REAL` labeling. 

Raw dataset must be placed in `datasets/raws/` directory. We are going to select 400 videos for each class for the individual dataset, and have 600 videos for each in the combined dataset.

| Dataset         | Real | Fake | Total |
|-----------------|------|------|-------|
| DFDC            | 400  | 400  | 800   |
| FaceForensics++ | 400  | 400  | 800   |
| Celeb-DF-v2     | 400  | 400  | 800   |
| Combined        | 600  | 600  | 1200  |

Videos filtered from this will be available in `datasets/filtered/` folder.

### Downloading the Datasets

`DFDC` and `Celeb-DF-v2` can be downloaded from these respective links for each datasets:
- [DFDC](https://www.kaggle.com/c/deepfake-detection-challenge/data)  
- [Celeb-DF-v2](https://drive.google.com/open?id=1iLx76wsbi9itnkxSqz9BVBl4ZvnbIazj)

For `FaceForensics++`, a specialized script is used. The specialized script is called like below:

In [1]:
# Activate again if needed
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d Deepfakes -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d Face2Face -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d FaceSwap -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d NeuralTextures -c raw -t videos -n 100 --server EU2
# !python helper/ff_downloader.py datasets/raws/FaceForensics -d original -c raw -t videos -n 400 --server EU2

### Split Each Dataset to REAL/FAKE

Reorganizing `Celeb-DF-v2` and `FaceForensics` can be solved by simply modifying the directory structure, whereas `DFDC` requires a bit more work as we need to determine each image's label through another file, `metadata.json`

In [37]:
import os
import shutil
import json


def organize_to_filtered(dataset, dir_map, max=400):
    raw_ds = f"./datasets/raws/{dataset}"
    filtered_ds = f"./datasets/filtered/{dataset}"
    for dmap in dir_map:
        raw_loc = f"{raw_ds}/{dmap}"
        new_loc = f"{filtered_ds}/{dir_map[dmap]}"

        if not os.path.exists(new_loc):
            os.makedirs(new_loc)

        ff_cat = dmap.split("/")[-3] if dataset == "FaceForensics" else ""

        for i, file in enumerate(os.listdir(raw_loc)):
            tf_file = f"{new_loc}/{dmap}_{file}" if ff_cat == "" else f"{new_loc}/{ff_cat}_{file}"
            if i < max:
                shutil.copy(f"{raw_loc}/{file}", tf_file)

        print(f"Moved {raw_loc} to {new_loc}")


def organize_from_metadata(dataset, metadata, max=400):
    raw_ds = f"./datasets/raws/{dataset}"
    filtered_ds = f"./datasets/filtered/{dataset}"

    counter = {
        "REAL": 0,
        "FAKE": 0,
    }

    if not os.path.exists(filtered_ds):
        os.makedirs(f"{filtered_ds}/FAKE")
        os.makedirs(f"{filtered_ds}/REAL")

    with open(metadata, "r") as metafile:
        json_data = json.load(metafile)
        for i, file in enumerate(os.listdir(raw_ds)):
            if file == "metadata.json":
                continue

            label = json_data[file]['label']
            if counter[label] < max:
                shutil.copy(f"{raw_ds}/{file}", f"{filtered_ds}/{label}/{file}")
                counter[label] += 1
                print(f"Counter {label}: {counter[label]}")

            if counter["REAL"] == max and counter["FAKE"] == max:
                break

In [38]:
# Activate again if needed
# organize_to_filtered("Celeb-DF-v2", {
#     "Celeb-real": "REAL",
#     "Celeb-synthesis": "FAKE"
# })

# print("")
# organize_to_filtered("FaceForensics", {
#     "original_sequences/youtube/raw/videos": "REAL",
#     "manipulated_sequences/Deepfakes/raw/videos": "FAKE",
#     "manipulated_sequences/Face2Face/raw/videos": "FAKE",
#     "manipulated_sequences/FaceSwap/raw/videos": "FAKE",
#     "manipulated_sequences/NeuralTextures/raw/videos": "FAKE",
# })

organize_from_metadata("dfdc_train_part_4", "./datasets/raws/dfdc_train_part_4/metadata.json")

Counter FAKE: 1
Counter REAL: 1
Counter FAKE: 2
Counter FAKE: 3
Counter FAKE: 4
Counter FAKE: 5
Counter FAKE: 6
Counter FAKE: 7
Counter REAL: 2
Counter REAL: 3
Counter FAKE: 8
Counter FAKE: 9
Counter FAKE: 10
Counter FAKE: 11
Counter FAKE: 12
Counter FAKE: 13
Counter FAKE: 14
Counter FAKE: 15
Counter FAKE: 16
Counter FAKE: 17
Counter FAKE: 18
Counter FAKE: 19
Counter FAKE: 20
Counter FAKE: 21
Counter FAKE: 22
Counter FAKE: 23
Counter FAKE: 24
Counter FAKE: 25
Counter FAKE: 26
Counter FAKE: 27
Counter FAKE: 28
Counter REAL: 4
Counter FAKE: 29
Counter REAL: 5
Counter FAKE: 30
Counter REAL: 6
Counter FAKE: 31
Counter FAKE: 32
Counter REAL: 7
Counter FAKE: 33
Counter FAKE: 34
Counter REAL: 8
Counter FAKE: 35
Counter FAKE: 36
Counter FAKE: 37
Counter FAKE: 38
Counter FAKE: 39
Counter FAKE: 40
Counter FAKE: 41
Counter REAL: 9
Counter FAKE: 42
Counter FAKE: 43
Counter FAKE: 44
Counter FAKE: 45
Counter FAKE: 46
Counter FAKE: 47
Counter FAKE: 48
Counter FAKE: 49
Counter FAKE: 50
Counter FAKE: 5