<a href="https://colab.research.google.com/github/thinothw/DFDS-Final-Project/blob/main/Phase01_Sandbox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Phase 01 - Setting up the Environment.

In [1]:
# INFRASTRUCTURE - Connect to Drive

from google.colab import drive
import os

print("Requesting Google Drive access...")
drive.mount('/content/drive')

# Double check the project folder exists
base_path = '/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K'
if os.path.exists(base_path):
    print(f" Connection Stable! Project folder found at: {base_path}")
else:
    print(f" Warning: Base path not found. Check your Drive folder name!")

Requesting Google Drive access...
Mounted at /content/drive
 Connection Stable! Project folder found at: /content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K


In [2]:
!pip install retina-face opencv-python imagehash

Collecting retina-face
  Downloading retina_face-0.0.17-py3-none-any.whl.metadata (10 kB)
Collecting imagehash
  Downloading ImageHash-4.3.2-py2.py3-none-any.whl.metadata (8.4 kB)
Downloading retina_face-0.0.17-py3-none-any.whl (25 kB)
Downloading ImageHash-4.3.2-py2.py3-none-any.whl (296 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m296.7/296.7 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: imagehash, retina-face
Successfully installed imagehash-4.3.2 retina-face-0.0.17


In [2]:
# Clean, compatible install block
!pip uninstall -y numpy -q
!pip install -q numpy==1.26.4
!pip install -q datasets==2.18.0
!pip install -q facenet-pytorch==2.5.3
!pip install -q opencv-python==4.8.0.76

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.0/61.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m18.0/18.0 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.13.0.92 requires numpy>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
pytensor 2.38.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
jax 0.7.2 requires numpy>=2.0, but you have n

In [3]:
import sys
import os

print("--- MASTER ENVIRONMENT HEALTH CHECK ---")

# 1. Verify Google Drive Mount
drive_path = '/content/drive/MyDrive'
if os.path.exists(drive_path):
    print(f" Vault Access: Google Drive is securely mounted at {drive_path}")
else:
    print(f" Vault Error: Google Drive NOT Mounted! Run drive.mount('/content/drive')")

# 2. Verify Core Libraries
try:
    import numpy as np
    print(f" NumPy Status: Active (Version {np.__version__} - Native Colab build)")
    import pandas as pd
    print(f" Pandas Status: Active (Version {pd.__version__})")
    import torch
    print(f" PyTorch Status: Active (Version {torch.__version__})")
    import cv2
    print(f" OpenCV Status: Active (Version {cv2.__version__})")
except ImportError as e:
    print(f" Core Library Error: {e}")

# 3. Verify Custom Computer Vision Tools
try:
    import imagehash
    print(f" ImageHash Status: Active (Ready for pHash deduplication)")
except ImportError:
    print(" ImageHash Missing! Run: !pip install imagehash")

try:
    from retinaface import RetinaFace
    print(f" RetinaFace Status: Active (Ready for GPU extraction)")
except ImportError:
    print(" RetinaFace Missing! Run: !pip install retina-face")

print("\n All systems green. The environment is absolutely stable and ready for the 10K Sandbox run.")

--- MASTER ENVIRONMENT HEALTH CHECK ---
 Vault Access: Google Drive is securely mounted at /content/drive/MyDrive
 NumPy Status: Active (Version 2.0.2 - Native Colab build)
 Pandas Status: Active (Version 2.2.2)
 PyTorch Status: Active (Version 2.10.0+cu128)
 OpenCV Status: Active (Version 4.13.0)
 ImageHash Status: Active (Ready for pHash deduplication)
 RetinaFace Status: Active (Ready for GPU extraction)

 All systems green. The environment is absolutely stable and ready for the 10K Sandbox run.


Phase 02 - Downloading Data from OpenFake.

In [None]:
# Phase 2 - Download.
# OpenFake Sandbox Dataset Downloader - 12,000 Images.
# Code ran with 0 saved errors!.

# 1. Core Imports
import os
import torch
import shutil
from datasets import load_dataset

# 2. GPU Verification
print("\nChecking GPU availability...")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device detected: {device}")

# 3. Create High-Speed Local Folders
print("\nSetting up high-speed local storage...")
local_base = '/content/temp_raw'
local_real = os.path.join(local_base, 'real')
local_fake = os.path.join(local_base, 'fake')

os.makedirs(local_real, exist_ok=True)
os.makedirs(local_fake, exist_ok=True)

# 4. Connect to OpenFake Dataset
print("\nConnecting to OpenFake dataset stream...")
dataset_name = "ComplexDataLab/OpenFake"

try:
    openfake = load_dataset(dataset_name, split='train', streaming=True)
    print("Dataset connection successful.")
except Exception as e:
    raise RuntimeError(f"Dataset loading failed: {e}")

# 5. Download Buffer (6,000 Real + 6,000 Fake)
TARGET_BUFFER = 6000
real_count = 0
fake_count = 0
save_errors = 0
max_iterations = 80000
iteration_counter = 0

print(f"\nStarting MASS DOWNLOAD to local storage (Target: {TARGET_BUFFER} per class)...")

for item in openfake:
    iteration_counter += 1

    if iteration_counter > max_iterations:
        print("\n‚ö†Ô∏è Iteration cap exceeded. Stopping download.")
        break

    try:
        label = item['label']
        image = item.get('image', None) # Safely attempt to get the image

        # Safety 1: The Null Check
        if image is None:
            continue

        # Safety 2: Handle Integer Labels
        if isinstance(label, int):
            label = 'real' if label == 0 else 'fake'

        if label not in ['real', 'fake']:
            continue

        # Safety 3: Force RGB format
        if image.mode != 'RGB':
            image = image.convert('RGB')

        if label == 'real' and real_count < TARGET_BUFFER:
            image.save(os.path.join(local_real, f"raw_openfake_real_{real_count}.jpg"))
            real_count += 1
            if real_count % 1000 == 0:
                print(f"  -> Downloaded {real_count} / {TARGET_BUFFER} Real images...")

        elif label == 'fake' and fake_count < TARGET_BUFFER:
            image.save(os.path.join(local_fake, f"raw_openfake_fake_{fake_count}.jpg"))
            fake_count += 1
            if fake_count % 1000 == 0:
                print(f"  -> Downloaded {fake_count} / {TARGET_BUFFER} Fake images...")

        if real_count == TARGET_BUFFER and fake_count == TARGET_BUFFER:
            break

    except Exception as e:
        save_errors += 1
        print(f"Error at iteration {iteration_counter}: {e}")

print("\n=== Download Summary ===")
print(f"Real images saved locally: {real_count}")
print(f"Fake images saved locally: {fake_count}")
print(f"Save errors: {save_errors}")

# 6. Move to Google Drive
print("\nMoving files from local storage to Google Drive. Please wait...")
drive_base = '/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/raw_data'
drive_real = os.path.join(drive_base, 'real')
drive_fake = os.path.join(drive_base, 'fake')

os.makedirs(drive_real, exist_ok=True)
os.makedirs(drive_fake, exist_ok=True)

# Copy everything over
shutil.copytree(local_real, drive_real, dirs_exist_ok=True)
shutil.copytree(local_fake, drive_fake, dirs_exist_ok=True)

# 7. Purge Local Temp Files
print("Cleaning up temporary local files to free space...")
shutil.rmtree(local_base)

print(" Mass download, Drive transfer, and cleanup completed successfully!")


Checking GPU availability...
Device detected: cpu

Setting up high-speed local storage...

Connecting to OpenFake dataset stream...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/206 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/206 [00:00<?, ?it/s]

Dataset connection successful.

Starting MASS DOWNLOAD to local storage (Target: 6000 per class)...
  -> Downloaded 1000 / 6000 Fake images...
  -> Downloaded 1000 / 6000 Real images...
  -> Downloaded 2000 / 6000 Real images...
  -> Downloaded 2000 / 6000 Fake images...




  -> Downloaded 3000 / 6000 Real images...
  -> Downloaded 3000 / 6000 Fake images...
  -> Downloaded 4000 / 6000 Fake images...
  -> Downloaded 4000 / 6000 Real images...
  -> Downloaded 5000 / 6000 Fake images...
  -> Downloaded 5000 / 6000 Real images...
  -> Downloaded 6000 / 6000 Real images...
  -> Downloaded 6000 / 6000 Fake images...

=== Download Summary ===
Real images saved locally: 6000
Fake images saved locally: 6000
Save errors: 0

Moving files from local storage to Google Drive. Please wait...
Cleaning up temporary local files to free space...
 Mass download, Drive transfer, and cleanup completed successfully!


Phase 02.1 - Running RetinaFace Bouncer Script on OpenFake Data.

In [None]:
# PHASE 2 - RetinaFace Bouncer V3.1.1
# Load to Colab Local Drive then Offload to Drive.
# Live Updates of the Progress.


import os
import cv2
import numpy as np
import shutil
from retinaface import RetinaFace
from PIL import Image
import warnings
from tqdm import tqdm  # The real time progress bar

warnings.filterwarnings("ignore")

print("Bouncer operating on: GPU (RetinaFace V4.3.2 High-Speed + Live Tracking)\n")

# 1. Define Drive & Local Paths
drive_base = '/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K'
local_base = '/content/temp_workspace'

drive_raw_real = os.path.join(drive_base, 'raw_data/real')
drive_raw_fake = os.path.join(drive_base, 'raw_data/fake')

local_raw_real = os.path.join(local_base, 'raw_data/real')
local_raw_fake = os.path.join(local_base, 'raw_data/fake')

local_hq_real = os.path.join(local_base, 'processed_data/real')
local_hq_fake = os.path.join(local_base, 'processed_data/fake')
local_reject_real = os.path.join(local_base, 'rejected/real')
local_reject_fake = os.path.join(local_base, 'rejected/fake')

# 2. Teleport Raw Data to Local SSD for Speed
print(" Step 1: Teleporting raw data from Google Drive to Local SSD (This takes 1-2 mins)...")
os.makedirs(local_base, exist_ok=True)
shutil.copytree(drive_raw_real, local_raw_real, dirs_exist_ok=True)
shutil.copytree(drive_raw_fake, local_raw_fake, dirs_exist_ok=True)
print(" Local transfer complete! Setting up output folders...")

for path in [local_hq_real, local_hq_fake, local_reject_real, local_reject_fake]:
    os.makedirs(path, exist_ok=True)

# 3. Blur Detection Function
def blur_score(pil_image):
    img = np.array(pil_image)
    gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    return cv2.Laplacian(gray, cv2.CV_64F).var()

# 4. RetinaFace Extraction Engine
def hq_crop_engine_v4_3(img_path, save_path, reject_path, padding=25, min_face_size=30, min_face_ratio=0.005, blur_threshold=50, confidence_threshold=0.90):
    try:
        faces = RetinaFace.detect_faces(img_path)
        if not isinstance(faces, dict) or len(faces) == 0:
            return False, "No Face Detected"

        largest_area = 0
        best_face = None
        for face in faces.values():
            box = face.get('facial_area', None)
            if box is None: continue
            area = (box[2] - box[0]) * (box[3] - box[1])
            if area > largest_area:
                largest_area = area
                best_face = face

        if best_face is None: return False, "No Valid Face Found"

        box = best_face['facial_area']
        confidence = best_face['score']
        if confidence < confidence_threshold: return False, f"Low Confidence ({confidence:.2f})"

        img_pil = Image.open(img_path).convert('RGB')
        width, height = img_pil.size
        face_w, face_h = box[2] - box[0], box[3] - box[1]

        if face_w < min_face_size or face_h < min_face_size: return False, "Too Small Pixels"
        if (face_w * face_h) / (width * height) < min_face_ratio: return False, "Too Small Ratio"

        x_min, y_min = max(0, int(box[0]) - padding), max(0, int(box[1]) - padding)
        x_max, y_max = min(width, int(box[2]) + padding), min(height, int(box[3]) + padding)

        cropped_img = img_pil.crop((x_min, y_min, x_max, y_max))
        blur_val = blur_score(cropped_img)

        if blur_val < blur_threshold: return False, "Blurred"

        cropped_img.save(save_path)
        return True, "Success"
    except Exception as e:
        return False, str(e)

# 5. High-Speed Cleaner with Live Progress Bar
def clean_dataset_fast(input_folder, output_folder, reject_folder, label, target_valid=2000):
    print(f"\n---  Processing {label} Folder (Target: {target_valid}) ---")

    images = [f for f in os.listdir(input_folder) if f.endswith(('.jpg', '.jpeg', '.png'))]
    success, rejected = 0, 0
    reasons = {}

    # The tqdm wrapper creates the live progress bar
    pbar = tqdm(total=target_valid, desc=f"Extracting {label} Faces", unit="face")

    for name in images:
        if success >= target_valid:
            break

        img_path = os.path.join(input_folder, name)
        save_path = os.path.join(output_folder, name)
        reject_path = os.path.join(reject_folder, name)

        ok, msg = hq_crop_engine_v4_3(img_path, save_path, reject_path)

        if ok:
            success += 1
            pbar.update(1) # Ticks the progress bar forward
        else:
            rejected += 1
            reasons[msg] = reasons.get(msg, 0) + 1
            try:
                Image.open(img_path).convert('RGB').save(reject_path)
            except:
                pass

    pbar.close()
    print(f"\n STATS for {label}: {success} Valid | {rejected} Rejected")

# 6. Execute the Run
clean_dataset_fast(local_raw_real, local_hq_real, local_reject_real, "REAL", target_valid=2000)
clean_dataset_fast(local_raw_fake, local_hq_fake, local_reject_fake, "FAKE", target_valid=2000)

# 7. Push Final Data Back to Drive
print("\n Step 3: Pushing pristine extracted faces back to Google Drive...")
drive_final_hq = os.path.join(drive_base, 'processed_data_V4_3')
shutil.copytree(os.path.join(local_base, 'processed_data'), drive_final_hq, dirs_exist_ok=True)

# 8. Clean Up
print("üßπ Sweeping temporary local files...")
shutil.rmtree(local_base)

print("\n OpenFake Phase Complete! 4,000 perfectly balanced images locked into your Drive.")

Bouncer operating on: GPU (RetinaFace V4.3.2 High-Speed + Live Tracking)

 Step 1: Teleporting raw data from Google Drive to Local SSD (This takes 1-2 mins)...
 Local transfer complete! Setting up output folders...

---  Processing REAL Folder (Target: 2000) ---


Extracting REAL Faces:   0%|          | 0/2000 [00:00<?, ?face/s]

26-02-27 15:49:41 - Directory /root/.deepface created
26-02-27 15:49:41 - Directory /root/.deepface/weights created
26-02-27 15:49:41 - retinaface.h5 will be downloaded from the url https://github.com/serengil/deepface_models/releases/download/v1.0/retinaface.h5


Downloading...
From: https://github.com/serengil/deepface_models/releases/download/v1.0/retinaface.h5
To: /root/.deepface/weights/retinaface.h5

  0%|          | 0.00/119M [00:00<?, ?B/s][A
 22%|‚ñà‚ñà‚ñè       | 25.7M/119M [00:00<00:00, 256MB/s][A
 50%|‚ñà‚ñà‚ñà‚ñà‚ñâ     | 59.2M/119M [00:00<00:00, 302MB/s][A
 76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 89.7M/119M [00:00<00:00, 270MB/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 119M/119M [00:00<00:00, 268MB/s]
Extracting REAL Faces: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [20:02<00:00,  1.66face/s]



 STATS for REAL: 2000 Valid | 1240 Rejected

---  Processing FAKE Folder (Target: 2000) ---


Extracting FAKE Faces: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [11:24<00:00,  2.92face/s]



 STATS for FAKE: 2000 Valid | 1420 Rejected

 Step 3: Pushing pristine extracted faces back to Google Drive...
üßπ Sweeping temporary local files...

 OpenFake Phase Complete! 4,000 perfectly balanced images locked into your Drive.


Phase 03 - Downloading Data from FF++.

In [None]:
import os
import urllib.request
import subprocess
import shutil
from google.colab import drive
from google.colab import userdata

# Unlock Vault
print("\nAccessing secure vault...")
TUM_URL = userdata.get('TUM_LINK')
if not TUM_URL:
    raise ValueError(" ERROR: 'TUM_LINK' not found in Secrets. Please check the spelling!")

# Fetch Downloader directly from secret TUM link
print("\nFetching official TUM downloader script from your secure link...")
urllib.request.urlretrieve(TUM_URL, "download.py")

# Define Production Paths
temp_base = '/content/ff_temp'
drive_base = '/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/FF_Plus/raw_videos'

# 10K Sandbox Split: 200 Real, 50 of each Fake (400 Total Videos)
download_targets = {
    'original': 200,
    'Deepfakes': 50,
    'Face2Face': 50,
    'FaceSwap': 50,
    'NeuralTextures': 50
}

print("\n Firing up V3.2 10K Sandbox auto-bypassing download sequence...")

for cat, num_videos in download_targets.items():
    sub_folder = 'real' if cat == 'original' else f'fake/{cat.lower()}'
    temp_path = os.path.join(temp_base, sub_folder)
    drive_path = os.path.join(drive_base, sub_folder)

    shutil.rmtree(temp_path, ignore_errors=True)
    os.makedirs(temp_path, exist_ok=True)
    os.makedirs(drive_path, exist_ok=True)

    print(f"\n Pulling {num_videos} videos for: {cat} to high-speed local storage...")

    # Echo automatically presses Enter to agree to the TOS, dynamic num_videos injected
    cmd = f'echo "" | python3 download.py {temp_path} -d {cat} -c c23 -n {num_videos} --server EU2'
    result = subprocess.run(cmd, shell=True)

    if result.returncode != 0:
        print(f" ERROR: Download failed for {cat}. Check server status.")
    else:
        print(f" Download complete. Transferring {cat} to Google Drive...")
        shutil.copytree(temp_path, drive_path, dirs_exist_ok=True)
        print(f" Transfer complete for {cat}.")

# Global Housekeeping: Wipe the temporary master folder from the local SSD
print("\n Sweeping temporary local files...")
shutil.rmtree(temp_base, ignore_errors=True)

print("\n 10K Sandbox FF++ Download Complete! 400 videos securely stored.")


Accessing secure vault...

Fetching official TUM downloader script from your secure link...

 Firing up V3.2 10K Sandbox auto-bypassing download sequence...

 Pulling 200 videos for: original to high-speed local storage...
 Download complete. Transferring original to Google Drive...
 Transfer complete for original.

 Pulling 50 videos for: Deepfakes to high-speed local storage...
 Download complete. Transferring Deepfakes to Google Drive...
 Transfer complete for Deepfakes.

 Pulling 50 videos for: Face2Face to high-speed local storage...
 Download complete. Transferring Face2Face to Google Drive...
 Transfer complete for Face2Face.

 Pulling 50 videos for: FaceSwap to high-speed local storage...
 Download complete. Transferring FaceSwap to Google Drive...
 Transfer complete for FaceSwap.

 Pulling 50 videos for: NeuralTextures to high-speed local storage...
 Download complete. Transferring NeuralTextures to Google Drive...
 Transfer complete for NeuralTextures.

 Sweeping temporary l

Phase 03.1 - Running the Frame Extracter for the FF++ Vidoes.

In [2]:
import os
import glob
import random
import shutil
import cv2
import hashlib
from tqdm import tqdm

# Lock the random seed for perfect reproducibility
random.seed(42)

# PRODUCTION PATHS: Pointing straight to the 10K Sandbox
drive_raw_base = "/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/FF_Plus/raw_videos"
drive_extract_base = "/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/FF_Plus/extracted_frames"
drive_zip_out = "/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/FF_Plus/extracted_frames.zip"

# LOCAL SSD PATHS for max IO speed
local_temp_base = "/content/local_processing"
local_extract_base = os.path.join(local_temp_base, "extracted_frames")

target_frames_with_buffer = 7
split_ratios = {"train": 0.70, "val": 0.15, "test": 0.15}

os.makedirs(local_temp_base, exist_ok=True)
local_video_path = os.path.join(local_temp_base, "processing_vid.mp4")

# Initialize Traceability Log for the main run
log_file_path = "/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/FF_Plus/split_log.txt"
with open(log_file_path, "w") as f:
    f.write("FF++ 10K Sandbox: Video Split and Extraction Log\n")
    f.write("================================================\n")

# Tracker for final distribution summary
split_counts = {"train": 0, "val": 0, "test": 0}

def extract_buffered_frames(local_path, original_path, output_folder, category_prefix, n_frames):
    # Read the video from the fast Local SSD
    cap = cv2.VideoCapture(local_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    # UPGRADE 2: Reject extremely short videos (Threshold: n_frames * 2)
    if total_frames < (n_frames * 2):
        cap.release()
        return 0, total_frames

    step = total_frames // n_frames
    start_offset = random.randint(0, max(0, step - 1))
    target_ids = set([start_offset + (i * step) for i in range(n_frames)])

    # Generate the name and hash using the ORIGINAL Google Drive path
    vid_name = os.path.basename(original_path).split('.')[0]
    short_hash = hashlib.md5(original_path.encode()).hexdigest()[:6]

    extracted_count = 0
    current_id = 0

    while cap.isOpened():
        ret = cap.grab()
        if not ret:
            break

        if current_id in target_ids:
            ret, frame = cap.retrieve()
            if ret:
                frame_name = f"{category_prefix}_{vid_name}_{short_hash}_f{extracted_count:03d}.jpg"
                save_path = os.path.join(output_folder, frame_name)
                cv2.imwrite(save_path, frame)
                extracted_count += 1

        current_id += 1
        if extracted_count >= n_frames:
            break

    cap.release()
    return extracted_count, total_frames

print("Initiating reproducible extraction sequence onto local SSD...")
categories = ['real', 'fake/deepfakes', 'fake/face2face', 'fake/faceswap', 'fake/neuraltextures']

total_extracted = 0
skipped_videos = 0

for cat in categories:
    cat_path = os.path.join(drive_raw_base, cat)

    # UPGRADE 1: Sort before shuffle for absolute determinism
    videos = sorted(glob.glob(os.path.join(cat_path, "**", "*.mp4"), recursive=True))

    if not videos:
        continue

    random.shuffle(videos)

    train_split = int(len(videos) * split_ratios["train"])
    val_split = int(len(videos) * (split_ratios["train"] + split_ratios["val"]))

    splits = {
        "train": videos[:train_split],
        "val": videos[train_split:val_split],
        "test": videos[val_split:]
    }

    for split_name, split_videos in splits.items():
        if not split_videos:
            continue

        cat_safe_name = cat.replace('/', '_')
        final_local_folder = os.path.join(local_extract_base, split_name, cat_safe_name)
        os.makedirs(final_local_folder, exist_ok=True)

        pbar = tqdm(total=len(split_videos), desc=f"{split_name.upper()} - {cat_safe_name}", unit="vid")

        with open(log_file_path, "a") as log:
            for video_path in split_videos:
                shutil.copy2(video_path, local_video_path)

                frames_saved, total_found = extract_buffered_frames(local_video_path, video_path, final_local_folder, f"{split_name}_{cat_safe_name}", target_frames_with_buffer)

                if frames_saved == 0:
                    skipped_videos += 1
                    log.write(f"[SKIPPED] {video_path} (Only {total_found} frames - failed diversity threshold)\n")
                else:
                    total_extracted += frames_saved
                    split_counts[split_name] += frames_saved  # UPGRADE 3: Tracking distribution
                    log.write(f"[{split_name.upper()}] {video_path} -> Extracted {frames_saved} frames\n")

                pbar.update(1)

        pbar.close()

print("\nArchiving frames into a single high-speed zip payload...")
# This creates extracted_frames.zip on the local SSD
shutil.make_archive(os.path.join(local_temp_base, "extracted_frames"), 'zip', local_extract_base)

print("Pushing zip payload to Google Drive vault...")
# We only transfer ONE file over the network
shutil.copy2(os.path.join(local_temp_base, "extracted_frames.zip"), drive_zip_out)

print("Sweeping local temporary files...")
shutil.rmtree(local_temp_base, ignore_errors=True)

# UPGRADE 3: Final Distribution Summary
print("\n--- FINAL EXTRACTION SUMMARY ---")
print(f"Total Frames Secured: {total_extracted}")
print(f"   ‚ñ∫ TRAIN: {split_counts['train']} frames")
print(f"   ‚ñ∫ VAL:   {split_counts['val']} frames")
print(f"   ‚ñ∫ TEST:  {split_counts['test']} frames")
print(f"\nSkipped Videos (Too Short): {skipped_videos}")
print("Pipeline complete. Check split_log.txt for full traceability.")

Initiating reproducible extraction sequence onto local SSD...


TRAIN - real: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 140/140 [04:26<00:00,  1.91s/vid]
VAL - real: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [01:05<00:00,  2.18s/vid]
TEST - real: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [00:58<00:00,  1.96s/vid]
TRAIN - fake_deepfakes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [01:20<00:00,  2.29s/vid]
VAL - fake_deepfakes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:21<00:00,  3.12s/vid]
TEST - fake_deepfakes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:18<00:00,  2.37s/vid]
TRAIN - fake_face2face: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [01:30<00:00,  2.60s/vid]
VAL - fake_face2face: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:15<00:00,  2.21s/vid]
TEST - fake_face2face: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:14<00:00,  1.76s/vid]
TRAIN - fake_faceswap: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [01:10<00:00,  2.02s/vid]
VAL - fake_faceswap: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:15<00:00,  2.16s/vid]
TEST - fake_faceswap: 100%|‚


Archiving frames into a single high-speed zip payload...
Pushing zip payload to Google Drive vault...
Sweeping local temporary files...

--- FINAL EXTRACTION SUMMARY ---
Total Frames Secured: 2800
   ‚ñ∫ TRAIN: 1960 frames
   ‚ñ∫ VAL:   406 frames
   ‚ñ∫ TEST:  434 frames

Skipped Videos (Too Short): 0
Pipeline complete. Check split_log.txt for full traceability.


Phase 03.2 - Running the RetinaFace on Extracted Frames.

- BK Tree Library Installer.

In [4]:
!pip install pybktree imagehash

Collecting pybktree
  Downloading pybktree-1.1.tar.gz (4.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pybktree
  Building wheel for pybktree (setup.py) ... [?25l[?25hdone
  Created wheel for pybktree: filename=pybktree-1.1-py3-none-any.whl size=4949 sha256=626532c672bda8a396aeb175bdd803b124ed109058cb6d123eb58cfa5f0128ae
  Stored in directory: /root/.cache/pip/wheels/e0/c0/e9/f03776b415a424272cb3cb1baf27385d100f3ef7eb9bb6553e
Successfully built pybktree
Installing collected packages: pybktree
Successfully installed pybktree-1.1


In [2]:
import os
import cv2
import numpy as np
import shutil
import glob
import csv
import imagehash
import random
import pybktree
from retinaface import RetinaFace
from PIL import Image
import warnings
from tqdm import tqdm

warnings.filterwarnings("ignore")

# 1. Global Determinism
random.seed(42)
np.random.seed(42)

print("Master Bouncer Operating On: GPU (Zip Pipeline + BK-Tree + Quota Enforcer)\n")

# --- PRODUCTION ZIP PATHS ---
drive_base = '/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/FF_Plus'
drive_zip_in = os.path.join(drive_base, 'extracted_frames.zip')
drive_zip_out = os.path.join(drive_base, 'processed_faces.zip')
csv_log_path = os.path.join(drive_base, 'extraction_log.csv')

local_base = '/content/temp_workspace'
local_zip_in = os.path.join(local_base, 'extracted_frames.zip')
local_input = os.path.join(local_base, 'input_frames')
local_output = os.path.join(local_base, 'processed_faces')

# --- THE STRICT QUOTA SYSTEM ---
target_quotas = {
    "train_real": 700, "val_real": 150, "test_real": 150,
    "train_fake_deepfakes": 175, "val_fake_deepfakes": 37, "test_fake_deepfakes": 38,
    "train_fake_face2face": 175, "val_fake_face2face": 37, "test_fake_face2face": 38,
    "train_fake_faceswap": 175, "val_fake_faceswap": 37, "test_fake_faceswap": 38,
    "train_fake_neuraltextures": 175, "val_fake_neuraltextures": 37, "test_fake_neuraltextures": 38
}

accepted_counts = {key: 0 for key in target_quotas}
total_rejected = 0

# --- THE ZIP TELEPORTATION PROTOCOL ---
print("Teleporting single zip payload to Local SSD...")
shutil.rmtree(local_base, ignore_errors=True)
os.makedirs(local_base, exist_ok=True)

shutil.copy2(drive_zip_in, local_zip_in)

print("Unpacking payload directly into fast SSD memory...")
shutil.unpack_archive(local_zip_in, local_input)
os.remove(local_zip_in) # Vaporize the zip to free up SSD space

with open(csv_log_path, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Image_Name", "Split", "Category", "Status", "Reason", "Confidence", "Blur_Score", "pHash", "Bounding_Box"])

# --- THE BK-TREE UPGRADE ---
def hash_distance(hash1, hash2):
    return hash1 - hash2

seen_hashes_tree = pybktree.BKTree(hash_distance)

def master_crop_engine(img_path, save_path, padding=25, min_face_size=30, min_face_ratio=0.005, blur_threshold=25, confidence_threshold=0.90):
    try:
        img_cv = cv2.imread(img_path)
        if img_cv is None:
            return False, "Corrupted Image", 0.0, 0.0, "", "[]"

        height, width = img_cv.shape[:2]

        faces = RetinaFace.detect_faces(img_cv)
        if not isinstance(faces, dict) or len(faces) == 0:
            return False, "No Face Detected", 0.0, 0.0, "", "[]"

        largest_area = 0
        best_face = None
        for face in faces.values():
            box = face.get('facial_area', None)
            if box is None: continue
            area = (box[2] - box[0]) * (box[3] - box[1])
            if area > largest_area:
                largest_area = area
                best_face = face

        if best_face is None:
            return False, "No Valid Face Found", 0.0, 0.0, "", "[]"

        box = best_face['facial_area']
        str_box = f"[{box[0]}, {box[1]}, {box[2]}, {box[3]}]"
        confidence = best_face['score']

        if confidence < confidence_threshold:
            return False, "Low Confidence", confidence, 0.0, "", str_box

        face_w, face_h = box[2] - box[0], box[3] - box[1]

        if face_w < min_face_size or face_h < min_face_size:
            return False, "Resolution Too Small", confidence, 0.0, "", str_box
        if (face_w * face_h) / (width * height) < min_face_ratio:
            return False, "Face Ratio Too Small", confidence, 0.0, "", str_box

        x_min, y_min = max(0, int(box[0]) - padding), max(0, int(box[1]) - padding)
        x_max, y_max = min(width, int(box[2]) + padding), min(height, int(box[3]) + padding)
        cropped_cv = img_cv[y_min:y_max, x_min:x_max]

        gray_crop = cv2.cvtColor(cropped_cv, cv2.COLOR_BGR2GRAY)
        blur_val = cv2.Laplacian(gray_crop, cv2.CV_64F).var()

        if blur_val < blur_threshold:
            return False, "Motion Blur Detected", confidence, blur_val, "", str_box

        cropped_pil = Image.fromarray(cv2.cvtColor(cropped_cv, cv2.COLOR_BGR2RGB))
        standardized_img = cropped_pil.resize((224, 224), Image.BICUBIC)

        new_hash = imagehash.phash(standardized_img)

        matches = seen_hashes_tree.find(new_hash, 2)
        if matches:
            return False, "Duplicate Face (Hamming Dist <= 2)", confidence, blur_val, str(new_hash), str_box

        seen_hashes_tree.add(new_hash)
        standardized_img.save(save_path, format='JPEG', quality=95)

        return True, "Accepted", confidence, blur_val, str(new_hash), str_box

    except Exception as e:
        return False, f"Engine Error: {str(e)}", 0.0, 0.0, "", "[]"

print("Firing up RetinaFace Deduplication Engine...")
image_files = glob.glob(os.path.join(local_input, "**", "*.jpg"), recursive=True)

if not image_files:
    print("ERROR: No images found in local storage.")
else:
    random.shuffle(image_files)

    pbar = tqdm(total=sum(target_quotas.values()), desc="Securing Quota", unit="face")

    with open(csv_log_path, mode='a', newline='') as log_file:
        csv_writer = csv.writer(log_file)

        for img_path in image_files:
            # --- THE STRING SAFETY UPGRADE (Bottom-Up reading for Zip compatibility) ---
            parent_dir = os.path.dirname(img_path)
            cat_name_raw = os.path.basename(parent_dir)
            split_name_raw = os.path.basename(os.path.dirname(parent_dir))

            split_name = split_name_raw.lower().strip()
            cat_name = cat_name_raw.lower().strip()

            bucket_key = f"{split_name}_{cat_name}"

            if bucket_key in target_quotas and accepted_counts.get(bucket_key, 0) >= target_quotas[bucket_key]:
                continue

            # Ensure we maintain structure in the output folder
            out_folder = os.path.join(local_output, split_name, cat_name)
            os.makedirs(out_folder, exist_ok=True)

            file_name = os.path.basename(img_path)
            save_path = os.path.join(out_folder, file_name)

            passed, msg, conf, blur, phash_val, bbox = master_crop_engine(img_path, save_path)

            if passed:
                if bucket_key in accepted_counts:
                    accepted_counts[bucket_key] += 1
                csv_writer.writerow([file_name, split_name, cat_name, "Accepted", "None", round(conf, 4), round(blur, 2), phash_val, bbox])
                pbar.update(1)
            else:
                total_rejected += 1
                csv_writer.writerow([file_name, split_name, cat_name, "Rejected", msg, round(conf, 4), round(blur, 2), phash_val, bbox])

            if all(accepted_counts[k] >= target_quotas[k] for k in target_quotas):
                print("\nAll target quotas achieved early! Shutting down Bouncer.")
                break

    pbar.close()

    print("\n--- FINAL BOUNCER SUMMARY ---")
    total_secured = sum(accepted_counts.values())
    print(f"Total Pristine Faces Secured: {total_secured} / {sum(target_quotas.values())}")
    print(f"Total Frames Rejected: {total_rejected}")

    print("\n[REAL CATEGORY STATS]")
    print(f"  ‚ñ∫ Train: {accepted_counts.get('train_real', 0)}/{target_quotas['train_real']}")
    print(f"  ‚ñ∫ Val:   {accepted_counts.get('val_real', 0)}/{target_quotas['val_real']}")
    print(f"  ‚ñ∫ Test:  {accepted_counts.get('test_real', 0)}/{target_quotas['test_real']}")

    print("\n[FAKE SUBCATEGORY STATS (Target per split: 175 Train / 37 Val / 38 Test)]")
    fake_cats = ["fake_deepfakes", "fake_face2face", "fake_faceswap", "fake_neuraltextures"]
    for f_cat in fake_cats:
        train_hit = accepted_counts.get(f'train_{f_cat}', 0)
        val_hit = accepted_counts.get(f'val_{f_cat}', 0)
        test_hit = accepted_counts.get(f'test_{f_cat}', 0)
        print(f"  ‚ñ∫ {f_cat}: Train({train_hit}) | Val({val_hit}) | Test({test_hit})")

print("\nArchiving pristine faces into a single zip payload...")
shutil.make_archive(os.path.join(local_base, "processed_faces"), 'zip', local_output)

print("Pushing standardized zip dataset and CSV logs back to Google Drive...")
shutil.copy2(os.path.join(local_base, "processed_faces.zip"), drive_zip_out)
shutil.rmtree(local_base, ignore_errors=True)

print("Pipeline Complete! Check extraction_log.csv for full traceability.")

Master Bouncer Operating On: GPU (BK-Tree pHash + Quota Enforcer + SSD Only)

Teleporting buffered frames to Local SSD (Bypassing FUSE bottleneck)...
Firing up RetinaFace Deduplication Engine...


Securing Quota:   0%|          | 0/2000 [00:00<?, ?face/s]

26-03-01 21:51:07 - Directory /root/.deepface created
26-03-01 21:51:07 - Directory /root/.deepface/weights created
26-03-01 21:51:07 - retinaface.h5 will be downloaded from the url https://github.com/serengil/deepface_models/releases/download/v1.0/retinaface.h5


Downloading...
From: https://github.com/serengil/deepface_models/releases/download/v1.0/retinaface.h5
To: /root/.deepface/weights/retinaface.h5

  0%|          | 0.00/119M [00:00<?, ?B/s][A
 18%|‚ñà‚ñä        | 21.5M/119M [00:00<00:00, 209MB/s][A
 36%|‚ñà‚ñà‚ñà‚ñå      | 42.5M/119M [00:00<00:00, 192MB/s][A
 71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 84.4M/119M [00:00<00:00, 253MB/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 119M/119M [00:00<00:00, 165MB/s]
Securing Quota:   5%|‚ñç         | 94/2000 [00:38<05:09,  6.16face/s]

KeyboardInterrupt: 

- Stats from RetinaFace Extraction

In [None]:
import pandas as pd

csv_path = '/content/drive/MyDrive/Deepfake_Honours_Project/Sandbox_10K/FF_Plus/Test_Run/extraction_log_v2.csv'
df = pd.read_csv(csv_path)

print("--- BOUNCER REJECTION BREAKDOWN ---")
rejections = df[df['Status'] == 'Rejected']
print(rejections['Reason'].value_counts())

--- BOUNCER REJECTION BREAKDOWN ---
Reason
Duplicate Face (Hamming Dist <= 2)    14
Motion Blur Detected                   3
Name: count, dtype: int64
