## Frame Extraction

Below are example codes for preprocessing a raw dataset of videos into face frames, with `facenet_pytorch.MTCNN`. Make sure that the raw dataset has its directory structured like the one mentioned in the [README.md](../README.md).

**Please be aware:** you might need to do more than one run of frame extraction due to possible face detection failures from `MTCNN`. 

### Preliminary

In [1]:
# Ensure __init__.py is being run before the script
%run __init__.py

### Extraction

There are two sampling-based generators: `MTCNNSampleExtractor` and `MTCNNSeqExtractor`. Both have similar APIs. To instantiate these extractor classes, provide two arguments:

- `dataset_path (str)`: Relative path to the raw dataset
- `classes (list[str])`: Folders inside the dataset

After instantiating, you can call `extract()` function to begin the frame extraction. Arguments for `extract()` are:

- `save_to (str)`: Where to save the extracted images
- `n_frame (int)`: How many frames to extract per video
- `cut_amount (float)`: Percentage of frames to be cut. e.g. setting to 0.1 is equal to 10% frames being cut in front and back, totaling 20%.
- `batch_size (int)`: How many videos to process in a batch
- `seed (int)`: (Only for SampleExtractor) To control the randomizer, guaranteeing identical reproducible results at every run.

#### Sampling Based

In [2]:
from datetime import datetime

from src.preprocessing.extractors.mtcnn_extractors import MTCNNSampleExtractor

In [3]:
datasets = [
    "Celeb-DF-v2",
    "DFDC",
    "FaceForensics",
    "Combined"
]

for d in datasets:
    ext = MTCNNSampleExtractor(
        dataset_path=f"../data/raw/{d}",
        classes=["REAL", "FAKE"]
    )

    start = datetime.now()
    ext.extract(
        save_to=f"../data/preprocessed/MTCNN-{d}",
        n_frame=10,
        cut_amount=0.15,
        batch_size=10,
        seed=42,
    )
    elapsed = round((datetime.now() - start).total_seconds(), 2)
    print(f"Done extracting {d}'s samples for {elapsed}s!")

Done extracting Celeb-DF-v2's samples for 1133.86s!
Done extracting DFDC's samples for 5033.5s!
Done extracting FaceForensics's samples for 7180.26s!
Done extracting Combined's samples for 6619.11s!


#### Sequential 

As explained earlier, `MTCNN` might fail. If there is only six out of ten frames required in a video, we need to get the remaining four. We can either redo the sampling process to involve another seed, or to speed up the process, we can do it sequentially (looping over all the available frames and detect faces). That is what we are going to do here.

##### Gatherer

With this we match the target directory (directory from sampling-based) with the source directory (raw data). if a video is found to have its number of frames less than `need_to_match`, then it is part of the "incomplete" data. This incomplete data is in the form of a dictionary.

This dictionary is shaped like so:

```python
    # filename: incomplete frames count
    {
     "../data/raw/DFDC/aaxejguth.mp4": 2,
     "../data/raw/DFDC/mniodsvhr.mp4": 6,
    }
```

This dictionary will be passed onto a sequence-based extractor, to redo the face detection, and solve the incompleteness.

In [4]:
import os
import re

In [5]:
# Example
def count_for_match(source, target, need_to_match):
    src_files = os.listdir(source)
    target_files = os.listdir(target)

    eligibles = {}
    for srcf in src_files:
        r = re.compile(rf".*{srcf}")
        trgf = list(filter(r.match, target_files))
        if len(trgf) < need_to_match:
            eligibles[f"{source}/{srcf}"] = need_to_match - len(trgf)
    return eligibles

##### Sequential Extractor

In [6]:
from datetime import datetime 

from src.preprocessing.extractors.mtcnn_extractors import MTCNNSeqExtractor

In [7]:
datasets = [
    "Celeb-DF-v2",
    "DFDC",
    "FaceForensics",
    "Combined"
]
preprocessor = "MTCNN"
classes = ["REAL", "FAKE"]

for d in datasets:
    start = datetime.now()
    for c in classes:
        x = count_for_match(
            f"../data/raw/{d}/{c}", 
            f"../data/preprocessed/{preprocessor}-{d}/{c}",
            10
        )

        if not len(list(x)) > 0:
            continue

        p = MTCNNSeqExtractor(x)
        p.extract(save_to=f"../data/preprocessed/{preprocessor}-{d}", cut_amount=0.15)
    elapsed = round((datetime.now() - start).total_seconds(), 2)
    print(f"Done extracting {d}'s samples for {elapsed}s!")

Done extracting Celeb-DF-v2's samples for 1.59s!
Done extracting DFDC's samples for 164.33s!
Done extracting FaceForensics's samples for 1.15s!
Done extracting Combined's samples for 81.96s!
