## Preprocessing Single Files

This notebook is an example for doing manual intervention if the [previous](./1_preprocessing.ipynb) notebook's way have errors.

### Preliminary

In [1]:
# Ensure __init__.py is being run before the script
%run __init__.py

### Gatherer

First, we need to examine the log file, and filter out the preprocessing method and the dataset we want to fix. The gatherer will also show some statistics about the errors (how many files per REAL/FAKE set, how many frames are problematic).

This will return a list of dictionary, shaped like so:

```python
    # filename: problematic frame count
    {
     "../data/raw/DFDC/aaxejguth.mp4": 2,
     "../data/raw/DFDC/mniodsvhr.mp4": 6,
    }
```

In [2]:
# Example
from src.preprocessing.utils import IncompleteGatherer

gatherer = IncompleteGatherer("./logs/2024-07-11_07-53-11.log")
x = gatherer.gather(
    preprocessor="MTCNN",
    dataset="DFDC",
    dataclass="REAL"
)

x

Counter({'../data/raw/DFDC/REAL/bnuwxhfahw.mp4': 7,
         '../data/raw/DFDC/REAL/orblnqzpra.mp4': 7,
         '../data/raw/DFDC/REAL/pdufsewrec.mp4': 5,
         '../data/raw/DFDC/REAL/wfzjxzhdkj.mp4': 4,
         '../data/raw/DFDC/REAL/bfjsthfhbd.mp4': 4,
         '../data/raw/DFDC/REAL/ijvprklcmz.mp4': 4,
         '../data/raw/DFDC/REAL/xkfliqnmwt.mp4': 3,
         '../data/raw/DFDC/REAL/gsshxchgqv.mp4': 3,
         '../data/raw/DFDC/REAL/ooafcxxfrs.mp4': 3,
         '../data/raw/DFDC/REAL/wapflpdhyi.mp4': 3,
         '../data/raw/DFDC/REAL/bkcyglmfci.mp4': 3,
         '../data/raw/DFDC/REAL/peysyddtmp.mp4': 2,
         '../data/raw/DFDC/REAL/scqcdvqiyq.mp4': 2,
         '../data/raw/DFDC/REAL/jyoxdvxpza.mp4': 2,
         '../data/raw/DFDC/REAL/kezwvsxxzj.mp4': 2,
         '../data/raw/DFDC/REAL/rmlzgerevr.mp4': 2,
         '../data/raw/DFDC/REAL/wbjtrlyjsm.mp4': 2,
         '../data/raw/DFDC/REAL/uprwuohbwx.mp4': 1,
         '../data/raw/DFDC/REAL/aokxvqadsx.mp4': 1,
         '..

### Preprocessor

Instead of taking path to a folder, this preprocessor will take the gatherer dictionary, with file names as the key and the value is the problematic frames

#### MTCNN

In [3]:
from src.preprocessing.mtcnn_preprocessor import MTCNNListPreprocessor
from src.preprocessing.utils import IncompleteGatherer

dataset = "DFDC"
preprocessor = "MTCNN"
classes = ["REAL", "FAKE"]

for c in classes:
    gatherer = IncompleteGatherer("./logs/2024-07-11_07-53-11.log")
    x = gatherer.gather(
        preprocessor=preprocessor,
        dataset=dataset,
        dataclass=c
    )
    
    p = MTCNNListPreprocessor(x)
    p.preprocess(save_to=f"../data/preprocessed/{preprocessor}-{dataset}", cut_amount=0.15)
    print(f"Done preprocessing {preprocessor}-{dataset} : {c}")

Done preprocessing MTCNN-DFDC : REAL
Done preprocessing MTCNN-DFDC : FAKE


In [5]:
dataset = "Combined"
preprocessor = "MTCNN"
classes = ["REAL", "FAKE"]

for c in classes:
    gatherer = IncompleteGatherer("./logs/2024-07-11_07-53-11.log")
    x = gatherer.gather(
        preprocessor=preprocessor,
        dataset=dataset,
        dataclass=c
    )
    
    p = MTCNNListPreprocessor(x)
    p.preprocess(save_to=f"../data/preprocessed/{preprocessor}-{dataset}", cut_amount=0.15)
    print(f"Done preprocessing {preprocessor}-{dataset} : {c}")

Done preprocessing MTCNN-Combined : REAL
Done preprocessing MTCNN-Combined : FAKE


#### Dlib

In [6]:
from src.preprocessing.dlib_preprocessor import DlibListPreprocessor
from src.preprocessing.utils import IncompleteGatherer

dataset = "Celeb-DF-v2"
preprocessor = "Dlib"
classes = ["REAL", "FAKE"]

for c in classes:
    gatherer = IncompleteGatherer("./logs/2024-07-11_07-53-11.log")
    x = gatherer.gather(
        preprocessor=preprocessor,
        dataset=dataset,
        dataclass=c
    )
    
    p = MTCNNListPreprocessor(x)
    p.preprocess(save_to=f"../data/preprocessed/{preprocessor}-{dataset}", cut_amount=0.15)
    print(f"Done preprocessing {preprocessor}-{dataset} : {c}")

Done preprocessing Dlib-Celeb-DF-v2 : REAL
Done preprocessing Dlib-Celeb-DF-v2 : FAKE


In [7]:
from src.preprocessing.dlib_preprocessor import DlibListPreprocessor
from src.preprocessing.utils import IncompleteGatherer

dataset = "FaceForensics"
preprocessor = "Dlib"
classes = ["REAL"]

for c in classes:
    gatherer = IncompleteGatherer("./logs/2024-07-11_07-53-11.log")
    x = gatherer.gather(
        preprocessor=preprocessor,
        dataset=dataset,
        dataclass=c
    )
    
    p = MTCNNListPreprocessor(x)
    p.preprocess(save_to=f"../data/preprocessed/{preprocessor}-{dataset}", cut_amount=0.15)
    print(f"Done preprocessing {preprocessor}-{dataset} : {c}")

Done preprocessing Dlib-FaceForensics : REAL


In [8]:
from src.preprocessing.dlib_preprocessor import DlibListPreprocessor
from src.preprocessing.utils import IncompleteGatherer

dataset = "DFDC"
preprocessor = "Dlib"
classes = ["REAL", "FAKE"]

for c in classes:
    gatherer = IncompleteGatherer("./logs/2024-07-11_07-53-11.log")
    x = gatherer.gather(
        preprocessor=preprocessor,
        dataset=dataset,
        dataclass=c
    )
    
    p = MTCNNListPreprocessor(x)
    p.preprocess(save_to=f"../data/preprocessed/{preprocessor}-{dataset}", cut_amount=0.15)
    print(f"Done preprocessing {preprocessor}-{dataset} : {c}")

Done preprocessing Dlib-DFDC : REAL
Done preprocessing Dlib-DFDC : FAKE


In [10]:
from src.preprocessing.dlib_preprocessor import DlibListPreprocessor
from src.preprocessing.utils import IncompleteGatherer

dataset = "Combined"
preprocessor = "Dlib"
classes = ["REAL", "FAKE"]

for c in classes:
    gatherer = IncompleteGatherer("./logs/2024-07-11_07-53-11.log")
    x = gatherer.gather(
        preprocessor=preprocessor,
        dataset=dataset,
        dataclass=c
    )
    
    p = MTCNNListPreprocessor(x)
    p.preprocess(save_to=f"../data/preprocessed/{preprocessor}-{dataset}", cut_amount=0.15)
    print(f"Done preprocessing {preprocessor}-{dataset} : {c}")

Done preprocessing Dlib-Combined : REAL
Done preprocessing Dlib-Combined : FAKE
