# FF++ Pipeline Audit - Step 7 & Step 8 Verification

This notebook verifies:
- **Step 7**: Constrained batching (method mixing, video_id anti-correlation)
- **Step 8**: Deployment realism augmentations
- **Lazy caching**: Cache hit/miss rates

## 1. Clone Repository

In [None]:
!git clone https://github.com/Incharajayaram/Team-Converge.git /content/Team-Converge
%cd /content/Team-Converge/Finetune1

## 2. Install Dependencies

In [None]:
!pip install -q mediapipe pyyaml tqdm

## 3. Upload FF++ Data (ZIP file)

Upload your `ffpp_data.zip` file when prompted. The zip should contain the raw FF++ videos.

In [None]:
from google.colab import files
import os

print("Upload your ffpp_data.zip file...")
uploaded = files.upload()

# Get the uploaded filename
zip_filename = list(uploaded.keys())[0]
print(f"Uploaded: {zip_filename}")

In [None]:
# Extract to /content/data/raw/ffpp
!mkdir -p /content/data/raw/ffpp
!unzip -q "{zip_filename}" -d /content/data/raw/ffpp
!ls -la /content/data/raw/ffpp | head -20
print(f"\nTotal files: {len(os.listdir('/content/data/raw/ffpp'))}")

## 4. Run Audit Mode (300 steps)

This will:
- Validate Step 7 batch constraints every 25 batches
- Dump 32 augmented samples for visual inspection
- Track cache hit/miss rates

In [None]:
!python train.py --config config.yaml \
    --override dataset.ffpp_root=/content/data/raw/ffpp \
    --override caching.cache_dir=/content/cache/faces \
    --audit_steps 300 \
    --audit_every 25 \
    --dump_aug 32

## 5. View Audit Report

In [None]:
import json
with open('artifacts/reports/audit_report.json') as f:
    report = json.load(f)
    
print("=" * 50)
print("AUDIT REPORT SUMMARY")
print("=" * 50)
print(f"Total batches: {report['total_batches']}")
print(f"Valid batches: {report['valid_batches']} ({100*report['valid_batches']/report['total_batches']:.1f}%)")
print(f"Method mixing OK: {report['method_mixing_ok']}")
print(f"Video ID violations: {report['video_violations']}")
print(f"Group ID violations: {report['group_violations']}")
print(f"Group relaxed (fallback): {report['group_relaxed']}")
print(f"\nCache hit rate: {report['cache_hit_rate']:.2%}")
print(f"Avg data_time: {report['avg_data_time']:.3f}s")
print(f"Avg step_time: {report['avg_step_time']:.3f}s")

## 6. View Augmented Samples (Step 8 Visual Check)

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
import os

aug_dir = 'artifacts/aug_debug'
files = sorted(os.listdir(aug_dir))[:16]

fig, axes = plt.subplots(4, 4, figsize=(16, 16))
for ax, fname in zip(axes.flat, files):
    img = Image.open(os.path.join(aug_dir, fname))
    ax.imshow(img)
    ax.set_title(fname[:30], fontsize=8)
    ax.axis('off')
plt.tight_layout()
plt.savefig('artifacts/aug_grid.png', dpi=150)
plt.show()

print(f"\nTotal augmented samples: {len(os.listdir(aug_dir))}")

## 7. Check Cache

In [None]:
cache_dir = '/content/cache/faces/train'
if os.path.exists(cache_dir):
    files = os.listdir(cache_dir)
    print(f"Cached face crops: {len(files)}")
    sizes = [os.path.getsize(os.path.join(cache_dir, f)) for f in files[:100]]
    print(f"Avg size: {sum(sizes)/len(sizes)/1024:.1f} KB")
else:
    print("No cache yet")