## Sequential-based Frame Extraction

As discussed in previous notebook [1. Sampling-based Frame Extraction](./1_sampling_based_frame_extraction.ipynb), due to the random sampling nature of the preprocessor, it might grab a frame where `dlib` or `MTCNN` either does not detect a face or having difficulty in detecting one, resulting in incomplete number of frames per video. 

In this notebook, we intend to fix that by generating the remaining frames through a loop instead of sampling. If there are 10 frames per video but only 6 is detected in the first preprocessing step, this will complete the remaining 4 frames.

### Preliminary

In [6]:
# Ensure __init__.py is being run before the script
%run __init__.py

### Gatherer

First, we need to examine the log file, and filter out the preprocessing method and the dataset we want to fix. The gatherer will also show some statistics about the errors.

This will return a list of dictionary, shaped like so:

```python
    # filename: problematic frame count
    {
     "../data/raw/DFDC/aaxejguth.mp4": 2,
     "../data/raw/DFDC/mniodsvhr.mp4": 6,
    }
```

In [2]:
# Example
from src.preprocessing.extractors.utils import IncompleteGatherer

gatherer = IncompleteGatherer("./logs/2024-07-20_05-35-13.log")
x = gatherer.gather(
    preprocessor="MTCNN",
    dataset="DFDC",
    dataclass="REAL"
)

x

Counter({'../data/raw/DFDC/REAL/bnuwxhfahw.mp4': 7,
         '../data/raw/DFDC/REAL/orblnqzpra.mp4': 7,
         '../data/raw/DFDC/REAL/pdufsewrec.mp4': 5,
         '../data/raw/DFDC/REAL/wfzjxzhdkj.mp4': 4,
         '../data/raw/DFDC/REAL/bfjsthfhbd.mp4': 4,
         '../data/raw/DFDC/REAL/ijvprklcmz.mp4': 4,
         '../data/raw/DFDC/REAL/xkfliqnmwt.mp4': 3,
         '../data/raw/DFDC/REAL/gsshxchgqv.mp4': 3,
         '../data/raw/DFDC/REAL/ooafcxxfrs.mp4': 3,
         '../data/raw/DFDC/REAL/wapflpdhyi.mp4': 3,
         '../data/raw/DFDC/REAL/bkcyglmfci.mp4': 3,
         '../data/raw/DFDC/REAL/peysyddtmp.mp4': 2,
         '../data/raw/DFDC/REAL/scqcdvqiyq.mp4': 2,
         '../data/raw/DFDC/REAL/jyoxdvxpza.mp4': 2,
         '../data/raw/DFDC/REAL/kezwvsxxzj.mp4': 2,
         '../data/raw/DFDC/REAL/rmlzgerevr.mp4': 2,
         '../data/raw/DFDC/REAL/wbjtrlyjsm.mp4': 2,
         '../data/raw/DFDC/REAL/uprwuohbwx.mp4': 1,
         '../data/raw/DFDC/REAL/aokxvqadsx.mp4': 1,
         '..

### Preprocessor

Instead of taking path to a folder, this preprocessor will take the gatherer dictionary, with file names as the key and the value is the problematic frames

#### MTCNN

In [3]:
from datetime import datetime 

from src.preprocessing.extractors.utils import IncompleteGatherer
from src.preprocessing.extractors.mtcnn_extractors import MTCNNSeqExtractor

In [4]:
datasets = [
    "Dev-Sample",
    "Celeb-DF-v2",
    "DFDC",
    "FaceForensics",
    "Combined"
]
preprocessor = "MTCNN"
classes = ["REAL", "FAKE"]

for d in datasets:
    for c in classes:
        gatherer = IncompleteGatherer("./logs/2024-07-20_05-35-13.log")
        x = gatherer.gather(
            preprocessor=preprocessor,
            dataset=d,
            dataclass=c
        )

        if not len(list(x)) > 0:
            continue

        start = datetime.now()
        p = MTCNNSeqExtractor(x)
        p.extract(save_to=f"../data/preprocessed/{preprocessor}-{d}", cut_amount=0.15)
        elapsed = round((datetime.now() - start).total_seconds(), 2)
        print(f"Done extracting {d}'s samples for {elapsed}s!")

Done extracting Dev-Sample's samples for 0.63s!
Done extracting Dev-Sample's samples for 0.41s!
Done extracting Celeb-DF-v2's samples for 0.1s!
Done extracting DFDC's samples for 70.91s!
Done extracting DFDC's samples for 73.32s!
Done extracting Combined's samples for 45.58s!
Done extracting Combined's samples for 30.84s!


#### Dlib

In [9]:
from datetime import datetime 

from src.preprocessing.extractors.utils import IncompleteGatherer
from src.preprocessing.extractors.dlib_extractors import DlibSeqExtractor

In [10]:
datasets = [
    # "Dev-Sample",
    # "Celeb-DF-v2",
    "DFDC",
    # "FaceForensics",
    "Combined"
]
preprocessor = "Dlib"
classes = ["REAL", "FAKE"]

for d in datasets:
    for c in classes:
        gatherer = IncompleteGatherer("./logs/2024-07-20_05-35-13.log")
        x = gatherer.gather(
            preprocessor=preprocessor,
            dataset=d,
            dataclass=c
        )

        if not len(list(x)) > 0:
            continue

        start = datetime.now()
        p = DlibSeqExtractor(x)
        p.extract(save_to=f"../data/preprocessed/{preprocessor}-{d}", cut_amount=0.15)
        elapsed = round((datetime.now() - start).total_seconds(), 2)
        print(f"Done extracting {d}'s samples for {elapsed}s!")

KeyboardInterrupt: 