## Sampling-based Frame Extraction

Below are example codes for preprocessing a raw dataset of videos into face frames, with `dlib` and `facenet_pytorch.MTCNN`. Make sure that the raw dataset has its directory structured like the one mentioned in the [README.md](../README.md).

**Please be aware:** you might need to do manual intervention (re-running) as the face detection algorithms of `dlib` and `facenet_pytorch.MCTNN` might not detect faces in the frames. Check the log files available in `logs/` directory and watch for these messages:

```bash
ERROR:root:No face detected on ./videos-organized/Combined/FAKE/acdkfksyev.mp4 frame 233
ERROR:root:No face detected on ./videos-organized/Combined/FAKE/alnkzqihau.mp4 frame 102
```

### Preliminary

In [1]:
# Ensure __init__.py is being run before the script
%run __init__.py

### Preprocessor

There are two sampling-based generators: `MTCNNSampleGenerator()` and `MTCNNSampleGenerator`. Both have identical APIs. To instantiate the preprocessor classes, provide two arguments:

- `dataset_path (str)`: Relative path to the raw dataset
- `classes (list[str])`: Folders inside the dataset

After instantiating, you can call `preprocess()` function to begin the frame extraction. Arguments for `preprocessor()` are:

- `save_to (str)`: Where to save the extracted images
- `n_frame (int)`: How many frames to extract per video
- `cut_amount (float)`: Percentage of frames to be cut. e.g. setting to 0.1 is equal to 10% frames being cut in front and back, totaling 20%.
- `batch_size (int)`: How many videos to process in a batch
- `seed (int)`: To control the randomizer, guaranteeing identical reproducible results at every run.

#### With MTCNN

In [2]:
from datetime import datetime

from src.preprocessing.extractors.mtcnn_extractors import MTCNNSampleExtractor

In [3]:
datasets = [
    "Dev-Sample",
    "Celeb-DF-v2",
    "DFDC",
    "FaceForensics",
    "Combined"
]

for d in datasets:
    ext = MTCNNSampleExtractor(
        dataset_path=f"../data/raw/{d}",
        classes=["REAL", "FAKE"]
    )

    start = datetime.now()
    ext.extract(
        save_to=f"../data/preprocessed/MTCNN-{d}",
        n_frame=10,
        cut_amount=0.15,
        batch_size=10,
        seed=42,
    )
    elapsed = round((datetime.now() - start).total_seconds(), 2)
    print(f"Done extracting {d}'s samples for {elapsed}s!")

Done extracting Dev-Sample's samples for 240.44s!
Done extracting Celeb-DF-v2's samples for 968.26s!
Done extracting DFDC's samples for 4799.07s!
Done extracting FaceForensics's samples for 7019.9s!
Done extracting Combined's samples for 6471.98s!


#### With Dlib

In [2]:
from datetime import datetime

from src.preprocessing.extractors.dlib_extractors import DlibSampleExtractor

In [None]:
datasets = [
    # "Dev-Sample",
    # "Celeb-DF-v2",
    "DFDC",
    # "FaceForensics",
    "Combined"
]

for d in datasets:
    ext = DlibSampleExtractor(
        dataset_path=f"../data/raw/{d}",
        classes=["REAL", "FAKE"]
    )

    start = datetime.now()
    ext.extract(
        save_to=f"../data/preprocessed/Dlib-{d}",
        n_frame=10,
        cut_amount=0.15,
        batch_size=10,
        seed=42,
    )
    elapsed = round((datetime.now() - start).total_seconds(), 2)
    print(f"Done extracting {d}'s samples for {elapsed}s!")