# Aegis-AI Audio Deepfake Detection Training

This notebook trains a lightweight CNN for audio deepfake detection using the ASVspoof 2019 dataset.

## Setup Instructions
1. Upload this notebook to Google Colab
2. Enable GPU: Runtime → Change runtime type → GPU (T4 or better)
3. Run cells in order
4. Download the trained ONNX model at the end

## 1. Install Dependencies

In [1]:
!pip install -q onnx onnxruntime soundfile

ERROR: Could not find a version that satisfies the requirement onnxruntime (from versions: none)
ERROR: No matching distribution found for onnxruntime


## 2. Clone Repository (or upload files)

In [2]:
# Option A: Clone from GitHub (if your repo is public)
# !git clone https://github.com/your-username/aegis-ai.git
# %cd aegis-ai

# Option B: Upload files manually
# 1. Upload ml/training/train_audio.py
# 2. Upload ml/training/logging_config.py
# 3. Upload ml/datasets/loader.py
# 4. Upload manifest file (or download dataset)

import os
os.makedirs('ml/training', exist_ok=True)
os.makedirs('ml/datasets', exist_ok=True)
os.makedirs('models/audio', exist_ok=True)

print("✓ Directories created")

✓ Directories created


## 3. Upload Training Script and Dependencies

Use the file upload widget to upload:
- `ml/training/train_audio.py`
- `ml/training/logging_config.py`
- `ml/datasets/loader.py`
- `ml/datasets/__init__.py` (create if needed)

In [3]:
# Create __init__.py files (cross-platform)
from pathlib import Path
Path('ml/__init__.py').touch()
Path('ml/training/__init__.py').touch()
Path('ml/datasets/__init__.py').touch()
print("✓ Created __init__.py files")

# Upload using Colab file upload
from google.colab import files
import shutil

print("\nUpload train_audio.py to ml/training/")
uploaded = files.upload()
for filename in uploaded:
    shutil.move(filename, f'ml/training/{filename}')
    print(f"✓ Moved {filename} to ml/training/")

print("\nUpload logging_config.py to ml/training/")
uploaded = files.upload()
for filename in uploaded:
    shutil.move(filename, f'ml/training/{filename}')
    print(f"✓ Moved {filename} to ml/training/")

print("\nUpload loader.py to ml/datasets/")
uploaded = files.upload()
for filename in uploaded:
    shutil.move(filename, f'ml/datasets/{filename}')
    print(f"✓ Moved {filename} to ml/datasets/")

'touch' is not recognized as an internal or external command,
operable program or batch file.
'touch' is not recognized as an internal or external command,
operable program or batch file.
'touch' is not recognized as an internal or external command,
operable program or batch file.


ModuleNotFoundError: No module named 'google'

## 4. Download ASVspoof 2019 Dataset

This will download the dataset directly to Colab (faster than local download).

In [None]:
# Download ASVspoof 2019 LA dataset
!mkdir -p ml/datasets/asvspoof_2019
!wget -O ml/datasets/asvspoof_2019/LA.zip https://datashare.ed.ac.uk/bitstream/handle/10283/3336/LA.zip

# Extract
import zipfile
print("Extracting dataset...")
with zipfile.ZipFile('ml/datasets/asvspoof_2019/LA.zip', 'r') as zip_ref:
    zip_ref.extractall('ml/datasets/asvspoof_2019/')
print("✓ Dataset extracted")

## 5. Build Manifest

Create a JSONL manifest file for training.

In [None]:
%%writefile ml/datasets/build_manifest_simple.py
#!/usr/bin/env python3
"""Simple manifest builder for ASVspoof 2019 LA."""

import json
import wave
from pathlib import Path

def build_manifest(dataset_root, output_path):
    dataset_root = Path(dataset_root)
    protocol_dir = dataset_root / "LA" / "ASVspoof2019_LA_cm_protocols"
    audio_dir = dataset_root / "LA" / "ASVspoof2019_LA_train" / "flac"
    
    splits = {
        "train": protocol_dir / "ASVspoof2019.LA.cm.train.trn.txt",
        "dev": protocol_dir / "ASVspoof2019.LA.cm.dev.trl.txt",
        "eval": protocol_dir / "ASVspoof2019.LA.cm.eval.trl.txt",
    }
    
    with open(output_path, 'w') as out:
        for split_name, protocol_file in splits.items():
            print(f"Processing {split_name}...")
            if not protocol_file.exists():
                print(f"  Warning: {protocol_file} not found")
                continue
            
            with open(protocol_file) as f:
                for line in f:
                    parts = line.strip().split()
                    if len(parts) < 4:
                        continue
                    
                    speaker_id = parts[0]
                    file_id = parts[1]
                    label = parts[4] if len(parts) > 4 else "bonafide"  # bonafide or spoof
                    
                    # Find audio file
                    audio_path = audio_dir / f"{file_id}.flac"
                    if not audio_path.exists():
                        # Try other split directories
                        for try_split in ["train", "dev", "eval"]:
                            try_path = dataset_root / "LA" / f"ASVspoof2019_LA_{try_split}" / "flac" / f"{file_id}.flac"
                            if try_path.exists():
                                audio_path = try_path
                                break
                    
                    if not audio_path.exists():
                        continue
                    
                    record = {
                        "path": str(audio_path),
                        "label": label,
                        "duration_sec": 4.0,  # Placeholder
                        "sample_rate": 16000,
                        "split": split_name,
                    }
                    
                    out.write(json.dumps(record) + "\n")
    
    print(f"✓ Manifest saved to {output_path}")

if __name__ == "__main__":
    build_manifest(
        "ml/datasets/asvspoof_2019",
        "ml/datasets/manifests/asvspoof_2019.jsonl"
    )

!mkdir -p ml/datasets/manifests
!python ml/datasets/build_manifest_simple.py

## 6. Verify Dataset

Check that the manifest was created correctly.

In [None]:
import json
from collections import Counter

manifest_path = "ml/datasets/manifests/asvspoof_2019.jsonl"

splits = Counter()
labels = Counter()

with open(manifest_path) as f:
    for line in f:
        rec = json.loads(line)
        splits[rec['split']] += 1
        labels[rec['label']] += 1

print("Dataset statistics:")
print(f"  Splits: {dict(splits)}")
print(f"  Labels: {dict(labels)}")
print(f"  Total samples: {sum(splits.values())}")

## 7. Run Training (Quick Test with Subset)

First, let's do a quick test run with a small subset to verify everything works.

In [None]:
# Quick test with small subset
!python -m ml.training.train_audio \
    --manifest ml/datasets/manifests/asvspoof_2019.jsonl \
    --output-dir models/audio \
    --model-version V0.1.0-test \
    --epochs 2 \
    --batch-size 16 \
    --lr 1e-3 \
    --seed 42 \
    --max-train-samples 500 \
    --max-val-samples 100 \
    --device cuda

## 8. Full Training Run

If the test passed, run full training (this will take 1-3 hours depending on GPU).

In [None]:
# Full training run
!python -m ml.training.train_audio \
    --manifest ml/datasets/manifests/asvspoof_2019.jsonl \
    --output-dir models/audio \
    --model-version V1.0.0 \
    --epochs 20 \
    --batch-size 32 \
    --lr 1e-3 \
    --seed 42 \
    --device cuda

## 9. Verify ONNX Model

In [None]:
import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession("models/audio/V1.0.0.onnx")

# Check input/output shapes
print("Model Inputs:")
for inp in session.get_inputs():
    print(f"  {inp.name}: {inp.shape} ({inp.type})")

print("\nModel Outputs:")
for out in session.get_outputs():
    print(f"  {out.name}: {out.shape} ({out.type})")

# Test inference
dummy_input = np.random.randn(1, 64, 1001).astype(np.float32)
outputs = session.run(None, {"audio_features": dummy_input})
print(f"\nTest inference output shape: {outputs[0].shape}")
print(f"Test inference output: {outputs[0]}")
print("✓ ONNX model verified")

## 10. Download Trained Model

Download the ONNX model and metadata to your local machine.

In [None]:
from google.colab import files

# Download ONNX model
files.download('models/audio/V1.0.0.onnx')

# Download metadata
files.download('models/audio/V1.0.0.json')

# Download PyTorch checkpoint (optional)
files.download('models/audio/V1.0.0_best.pt')

print("✓ Files ready for download")

## 11. Test with Sample Audio (Optional)

Test the model with a sample audio file.

In [None]:
import torchaudio
import torch.nn.functional as F

# Load a sample from the dataset
sample_path = "ml/datasets/asvspoof_2019/LA/ASVspoof2019_LA_dev/flac/LA_D_1000147.flac"  # Adjust path

if Path(sample_path).exists():
    # Load audio
    waveform, sr = torchaudio.load(sample_path)
    
    # Preprocess (same as training)
    from ml.training.train_audio import AudioFeatureExtractor
    extractor = AudioFeatureExtractor()
    features = extractor(waveform)
    features_np = features.unsqueeze(0).numpy()  # Add batch dimension
    
    # Run inference
    outputs = session.run(None, {"audio_features": features_np})
    logits = outputs[0][0]
    probs = np.exp(logits) / np.exp(logits).sum()  # Softmax
    
    print(f"Sample: {sample_path}")
    print(f"Logits: {logits}")
    print(f"Probabilities: {probs}")
    print(f"Prediction: {'BONAFIDE' if probs[0] > probs[1] else 'SPOOF'}")
    print(f"Confidence: {max(probs):.3f}")
else:
    print("Sample file not found - adjust path")

## Summary

After running this notebook:
1. You'll have a trained ONNX model (`V1.0.0.onnx`)
2. Model metadata file (`V1.0.0.json`)
3. PyTorch checkpoint for further fine-tuning (`V1.0.0_best.pt`)

### Next Steps (on your local machine):
```bash
# 1. Place the downloaded model
cp ~/Downloads/V1.0.0.onnx models/audio/
cp ~/Downloads/V1.0.0.json models/audio/

# 2. Create symlink
cd models/audio
ln -s V1.0.0.onnx latest.onnx  # Linux/Mac
# Or on Windows: mklink latest.onnx V1.0.0.onnx

# 3. Set environment variable
export ONNX_MODEL_PATH=models/audio/latest.onnx

# 4. Start the API
cd services/api
uvicorn app.main:app --reload

# 5. Test the /v1/models endpoint
curl http://localhost:8000/v1/models
```