# FF++ Pipeline Audit - Step 7 & Step 8 Verification

This notebook verifies:
- **Step 7**: Constrained batching (method mixing, video_id anti-correlation)
- **Step 8**: Deployment realism augmentations
- **Lazy caching**: Cache hit/miss rates

## 1. Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 2. Clone Repository

In [None]:
!git clone https://github.com/Incharajayaram/Team-Converge.git /content/Team-Converge
%cd /content/Team-Converge/Finetune1

## 3. Install Dependencies

In [None]:
!pip install -q mediapipe pyyaml tqdm

## 4. Copy FF++ Data to Local Disk

⚠️ **Update the source path** to match your Drive location!

In [None]:
# UPDATE THIS PATH to your actual FF++ location on Drive
DRIVE_FFPP_PATH = "/content/drive/MyDrive/data/raw/ffpp"

!python copy_to_local.py \
    --source {DRIVE_FFPP_PATH} \
    --dest /content/data/raw/ffpp

## 5. Run Audit Mode (300 steps)

This will:
- Validate Step 7 batch constraints every 25 batches
- Dump 32 augmented samples for visual inspection
- Track cache hit/miss rates
- Generate `artifacts/reports/audit_report.json`

In [None]:
!python train.py --config config.yaml \
    --override dataset.ffpp_root=/content/data/raw/ffpp \
    --override caching.cache_dir=/content/cache/faces \
    --audit_steps 300 \
    --audit_every 25 \
    --dump_aug 32

## 6. View Audit Report

In [None]:
import json
with open('artifacts/reports/audit_report.json') as f:
    report = json.load(f)
    
print("=" * 50)
print("AUDIT REPORT SUMMARY")
print("=" * 50)
print(f"Total batches: {report['total_batches']}")
print(f"Valid batches: {report['valid_batches']} ({100*report['valid_batches']/report['total_batches']:.1f}%)")
print(f"Method mixing OK: {report['method_mixing_ok']}")
print(f"Video ID violations: {report['video_violations']}")
print(f"Group ID violations: {report['group_violations']}")
print(f"Group relaxed (fallback): {report['group_relaxed']}")
print(f"\nCache hit rate: {report['cache_hit_rate']:.2%}")
print(f"Avg data_time: {report['avg_data_time']:.3f}s")
print(f"Avg step_time: {report['avg_step_time']:.3f}s")

## 7. View Augmented Samples (Step 8 Visual Check)

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
import os

aug_dir = 'artifacts/aug_debug'
files = sorted(os.listdir(aug_dir))[:16]  # Show first 16

fig, axes = plt.subplots(4, 4, figsize=(16, 16))
for ax, fname in zip(axes.flat, files):
    img = Image.open(os.path.join(aug_dir, fname))
    ax.imshow(img)
    ax.set_title(fname[:30], fontsize=8)
    ax.axis('off')
plt.tight_layout()
plt.savefig('artifacts/aug_grid.png', dpi=150)
plt.show()

print(f"\nTotal augmented samples saved: {len(os.listdir(aug_dir))}")
print(f"Grid saved to: artifacts/aug_grid.png")

## 8. Check Cache Directory

In [None]:
import os

cache_dir = '/content/cache/faces/train'
if os.path.exists(cache_dir):
    files = os.listdir(cache_dir)
    print(f"Cached face crops: {len(files)}")
    
    # Sample file sizes
    sizes = [os.path.getsize(os.path.join(cache_dir, f)) for f in files[:100]]
    avg_size = sum(sizes) / len(sizes)
    print(f"Average cache file size: {avg_size/1024:.1f} KB")
    print(f"Estimated total cache: {len(files) * avg_size / 1e9:.2f} GB")
else:
    print("Cache directory not found - this is expected on first run!")

## 9. (Optional) Run Full Training

Once audit passes, run actual training:

In [None]:
# Uncomment to run full training
# !python train.py --config config.yaml \
#     --override dataset.ffpp_root=/content/data/raw/ffpp \
#     --override caching.cache_dir=/content/cache/faces