## Partitioning

We are not only testing one dataset scenarios, but many others. However, to not redo the preproocessing process ([1. Sampling Based](), [2. Sequential Based]()) over and over, it is better to just copy the required parts of the scenarios.

For example: Base dataset is 10 frames per video, another scenario is to use 2 frames per video, so we just need to copy the first two frames for each video, and save them to another directory.

> Note: due to the simplicity of this task, the whole code is available in this notebook.

In [1]:
import re
import os
import shutil
import random

In [2]:
def list_matches(source: str, target: str, max_allowed: int, seed: int):
    random.seed(seed)
    
    src_files = os.listdir(source)
    target_files = os.listdir(target)

    eligibles = []
    for srcf in src_files:
        r = re.compile(rf".*{srcf}")
        matched_frames = list(filter(r.match, target_files))
        eligibles.append(
            random.sample(matched_frames, max_allowed)
        )
        
    return [f"{target}/{item}" for row in eligibles for item in row]

In [3]:
# list_matches("../data/raw/Celeb-DF-v2/REAL", "../data/preprocessed/MTCNN-Celeb-DF-v2/REAL", 2)
def copy_matches(matches: list[str], target: str):
    if not os.path.exists(target):
        os.makedirs(target)
    
    for i in matches:
        shutil.copy(i, target)

In [4]:
# To test
max_allowed_nums = [2, 3, 5, 7]
classes = ["REAL", "FAKE"]
video_sources = [
    "../data/raw/Celeb-DF-v2",
    "../data/raw/DFDC",
    "../data/raw/FaceForensics",
    "../data/raw/Combined"
]

match_check_targets = [
    "../data/preprocessed/MTCNN-Celeb-DF-v2",
    "../data/preprocessed/MTCNN-DFDC/",
    "../data/preprocessed/MTCNN-FaceForensics",
    "../data/preprocessed/MTCNN-Combined",
]

In [5]:
for maxa in max_allowed_nums:
    cp_targets = [
        f"../data/preprocessed/Spl{maxa}-MTCNN-Celeb-DF-v2",
        f"../data/preprocessed/Spl{maxa}-MTCNN-DFDC/",
        f"../data/preprocessed/Spl{maxa}-MTCNN-FaceForensics",
        f"../data/preprocessed/Spl{maxa}-MTCNN-Combined",
    ]
    
    for i in range(len(video_sources)):
        for c in classes:
            matches = list_matches(
                source=f"{video_sources[i]}/{c}", 
                target=f"{match_check_targets[i]}/{c}", 
                max_allowed=maxa,
                seed=42
            )
            copy_matches(matches, f"{cp_targets[i]}/{c}")