## Preprocessing

Below are example codes for preprocessing a raw dataset of videos into face frames, with `dlib` and `facenet_pytorch.MTCNN`. Make sure that the raw dataset has its directory structured like the one mentioned in the [README.md](../README.md).

**Please be aware:** you might need to do manual intervention (re-running) as the face detection algorithms of `dlib` and `facenet_pytorch.MCTNN` might not detect faces in the frames. Check the log files available in `logs/` directory and watch for these messages:

```bash
ERROR:root:No face detected on ./videos-organized/Combined/FAKE/acdkfksyev.mp4 frame 233
ERROR:root:No face detected on ./videos-organized/Combined/FAKE/alnkzqihau.mp4 frame 102
```

### Preliminary

In [1]:
# Ensure __init__.py is being run before the script
%run __init__.py

### Preprocessor

There are two preprocessors: `MTCNNPreprocessor()` and `DlibPreprocessor`. Both have identical APIs. To instantiate the preprocessor classes, provide two arguments:

- `dataset_path (str)`: Relative path to the raw dataset
- `classes (list[str])`: Folders inside the dataset

After instantiating, you can call `preprocess()` function to begin the frame extraction. Arguments for `preprocessor()` are:

- `save_to (str)`: Where to save the extracted images
- `n_frame (int)`: How many frames to extract per video
- `cut_amount (float)`: Percentage of frames to be cut. e.g. setting to 0.1 is equal to 10% frames being cut in front and back, totaling 20%.
- `batch_size (int)`: How many videos to process in a batch
- `seed (int)`: To control the randomizer, guaranteeing identical reproducible results at every run.

### With MTCNN

In [2]:
from src.generators.mtcnn_generators import MTCNNSampleGenerator

In [4]:
prep = MTCNNSampleGenerator(
    dataset_path="../data/raw/Celeb-DF-v2",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/MTCNN-Celeb-DF-v2",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

In [4]:
prep = MTCNNSampleGenerator(
    dataset_path="../data/raw/DFDC",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/MTCNN-DFDC",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

Extracting ../data/raw/DFDC/REAL to ../data/preprocessed/MTCNN-DFDC/REAL done in 2366.79s
Extracting ../data/raw/DFDC/FAKE to ../data/preprocessed/MTCNN-DFDC/FAKE done in 2421.98s


In [5]:
prep = MTCNNSampleGenerator(
    dataset_path="../data/raw/FaceForensics",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/MTCNN-FaceForensics",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

Extracting ../data/raw/FaceForensics/REAL to ../data/preprocessed/MTCNN-FaceForensics/REAL done in 3668.26s
Extracting ../data/raw/FaceForensics/FAKE to ../data/preprocessed/MTCNN-FaceForensics/FAKE done in 3594.76s


In [6]:
prep = MTCNNSampleGenerator(
    dataset_path="../data/raw/Combined",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/MTCNN-Combined",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

Extracting ../data/raw/Combined/REAL to ../data/preprocessed/MTCNN-Combined/REAL done in 4028.04s
Extracting ../data/raw/Combined/FAKE to ../data/preprocessed/MTCNN-Combined/FAKE done in 3576.62s


### With Dlib

In [7]:
from src.generators.dlib_generators import DlibSampleGenerator

In [8]:
prep = DlibSampleGenerator(
    dataset_path="../data/raw/Celeb-DF-v2",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/Dlib-Celeb-DF-v2",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

Extracting ../data/raw/Celeb-DF-v2/REAL to ../data/preprocessed/Dlib-Celeb-DF-v2/REAL done in 1572.8s
Extracting ../data/raw/Celeb-DF-v2/FAKE to ../data/preprocessed/Dlib-Celeb-DF-v2/FAKE done in 1658.66s


In [9]:
prep = DlibSampleGenerator(
    dataset_path="../data/raw/DFDC",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/Dlib-DFDC",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

Extracting ../data/raw/DFDC/REAL to ../data/preprocessed/Dlib-DFDC/REAL done in 5099.15s
Extracting ../data/raw/DFDC/FAKE to ../data/preprocessed/Dlib-DFDC/FAKE done in 5071.03s


In [10]:
prep = DlibSampleGenerator(
    dataset_path="../data/raw/FaceForensics",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/Dlib-FaceForensics",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

Extracting ../data/raw/FaceForensics/REAL to ../data/preprocessed/Dlib-FaceForensics/REAL done in 5267.2s
Extracting ../data/raw/FaceForensics/FAKE to ../data/preprocessed/Dlib-FaceForensics/FAKE done in 5375.04s


In [11]:
prep = DlibSampleGenerator(
    dataset_path="../data/raw/Combined",
    classes=["REAL", "FAKE"]
)

prep.preprocess(
    save_to="../data/preprocessed/Dlib-Combined",
    n_frame=10,
    cut_amount=0.15,
    batch_size=10,
    seed=42,
)

Extracting ../data/raw/Combined/REAL to ../data/preprocessed/Dlib-Combined/REAL done in 6549.61s
Extracting ../data/raw/Combined/FAKE to ../data/preprocessed/Dlib-Combined/FAKE done in 6265.16s
