## METHODOLOGY 1 RESEARCH NOTEBOOK

**Dynamic Content-Aware Audio Compression for Reduced Hallucinations**

Lead: Rithwik Nukala

**RESEARCH HYPOTHESIS:**
We will reduce reasoning or output tokens by an estimated 30-50% using a dynamic, content-aware compression model on the LibriSpeech and Common Voice datasets, measured by a combination of token count per second and downstream task performance (ASR Word Error Rate), because we hypothesize that a significant portion of audio input is low-complexity (e.g., silence, simple noise) and can be aggressively compressed without losing semantic information, thereby improving efficiency and reducing hallucination risk.

**PERFORMANCE TARGETS:**

*   30-50% reduction in audio tokens
*   Maintain ASR Word Error Rate within acceptable bounds
*   Real-time processing capabilities
*   Evaluation on LibriSpeech and Common Voice datasets

**TEAM MEMBERS:**

*   Rithwik Nukala (Lead)
*   Ogan Aktolun (Experiment Orchestrator)
*   Abdulmatin Omotoso (Core Implementation)
*   Kevin Li (Architecture Design & Results Analysis)
*   Amitesh Vatsa (Package Integrator & Results Analysis)
*   Ishan Singh (Architecture Design & Core Implementation)

**NOTEBOOK STRUCTURE:**

*   Section 1: Environment Setup & Dependencies
*   Section 2: Dataset Integration (LibriSpeech & Common Voice)
*   Section 3: Audio Content Analysis & Classification
*   Section 4: Dynamic Compression Algorithm Development
*   Section 5: Low-Complexity Content Detection
*   Section 6: Semantic Preservation Framework
*   Section 7: Real-Time Processing Pipeline
*   Section 8: ASR Integration & Evaluation
*   Section 9: Token Reduction Analysis
*   Section 10: Performance Benchmarking
*   Section 11: Results Analysis & Validation
*   Section 12: Package Development & Documentation

## API Usage Section:
### Code Example:

Complete working example adapted for reasoning tasks
Clear parameter explanations (context, prompt, model, rate)
Security note about getting personal API keys

### Usage Tips:

Start with no compression (rate: 0) for baseline testing
Personal API key requirement for security
Dashboard monitoring for experiment tracking
Baseline comparison guidance for methodology evaluation

### Generate API key
To generate the api key:
1. please log into the [dashboard](https://hallucinating-prompts.scaledown.ai/dashboard) and
2. switch to API keys tab
3. Generate an API key
4. You can track the usage over time

In [None]:
import requests
import json
url = "https://api.scaledown.xyz/compress/"
payload = json.dumps({
  "context": "<context about messi>",
  "prompt": "How many awards does messi have",
  "model": "gemini-2.5-flash",
  "scaledown": {
    "rate": 0
  }
})
headers = {
  'x-api-key': 'add your api key here',
  'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

{"message":"Forbidden"}


# SECTION 1: ENVIRONMENT SETUP & DEPENDENCIES
## Primary: Rithwik Nukala, Ishan Singh | Supporting: All

In [None]:
        "# Cell 1.1: Audio Processing Environment Setup\n",
        "\"\"\"\n",
        "TODO: Set up comprehensive audio processing environment\n",
        "- Install and configure librosa, soundfile, torchaudio\n",
        "- Set up speech recognition libraries (whisper, wav2vec2)\n",
        "- Configure GPU acceleration for audio processing\n",
        "- Install dataset handling libraries for LibriSpeech and Common Voice\n",
        "- Set up real-time audio processing frameworks\n",

('- Set up real-time audio processing frameworks\n',)

In [None]:
        "TODO: Set up performance monitoring for dynamic compression\n",
        "- Implement real-time processing metrics\n",
        "- Set up memory usage tracking for large audio files\n",
        "- Configure GPU utilization monitoring\n",
        "- Create compression ratio tracking\n",

('- Create compression ratio tracking\n',)

# SECTION 2: DATASET INTEGRATION (LIBRISPEECH & COMMON VOICE)
# Primary: Ogan Aktolun, Abdulmatin Omotoso | Supporting: All

According to [LibriSpeech dataset description](https://huggingface.co/datasets/openslr/librispeech_asr), "lower-WER speakers designated as 'clean' and the higher WER speakers designated as 'other'". I (Kevin) think that for the sake of early testing/preprocessing, we exclusively use the "clean" set until we verify validity of methods? The "clean" set is also smaller.

In [None]:
"""
        "TODO: Implement LibriSpeech dataset loading and preprocessing\n",
        "- Load LibriSpeech dataset with proper audio format handling\n",
        "- Implement efficient batch processing for large dataset\n",
        "- Create audio segmentation for analysis\n",
        "- Set up ground truth transcriptions for ASR evaluation. """

'\n        "TODO: Implement LibriSpeech dataset loading and preprocessing\n",\n        "- Load LibriSpeech dataset with proper audio format handling\n",\n        "- Implement efficient batch processing for large dataset\n",\n        "- Create audio segmentation for analysis\n",\n        "- Set up ground truth transcriptions for ASR evaluation. '

In [None]:
%%capture
!pip install whisper
!pip install torch torchcodec evaluate jiwer

In [None]:
#Environment setup

import os
import pickle
import random
from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchaudio
import torchvision.models as models
from torch.utils.data import DataLoader, Dataset
from torchaudio.transforms import AmplitudeToDB, MelSpectrogram
from tqdm import tqdm

import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")
print(f"Torchaudio version: {torchaudio.__version__}")

# Device setup
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {DEVICE}")

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if DEVICE == "cuda":
    torch.cuda.manual_seed_all(SEED)

PyTorch version: 2.8.0+cu126
Torchaudio version: 2.8.0+cu126
Device: cuda


In [None]:
def get_config() -> Dict:
    """Returns a dictionary of project configurations."""
    return {
        # Audio parameters
        "SAMPLE_RATE": 16000,
        "N_MELS": 128,
        "N_FFT": 1024,
        "HOP_LENGTH": 256,

        # Model parameters
        "TIME_FRAMES": 128, # Fixed temporal dimension for spectrograms
        "USE_PRETRAINED": True,
        "MOBILENET_VERSION": 'mobilenet_v3_small',

        # Training parameters
        "BATCH_SIZE": 32,
        "LEARNING_RATE": 1e-3,
        "EPOCHS": 30,
        "TRAIN_SPLIT": 0.8,

        # Data parameters
        "DATA_ROOT": "./data",
        "CACHE_DIR": "./cache",
        "USE_CACHE": True,
        "MODEL_SAVE_PATH": "best_wer_predictor.pth",

        # Whisper parameters
        "WHISPER_MODEL_HF": "openai/whisper-tiny.en",
        "WHISPER_MODEL_API": "whisper-1",

        # !!! CHOOSE YOUR WER COMPUTATION METHOD HERE !!!
        # Options: 'huggingface' (local, free) or 'openai_api' (powerful, paid)
        "WER_COMPUTATION_METHOD": 'huggingface',

        # API Key for OpenAI
        "OPENAI_API_KEY": os.environ.get("OPENAI_API_KEY"),

        # Compression thresholds
        "LOW_COMPLEXITY_THRESH": 0.15,
        "HIGH_COMPLEXITY_THRESH": 0.35,
    }

In [None]:
config_params = get_config()

In [None]:
%%capture
!pip install torch torchcodec evaluate jiwer

In [None]:
# LibriSpeech data
from datasets import load_dataset
from torch.utils.data import DataLoader

batch_size = 32

ds_ls_train100 = load_dataset("openslr/librispeech_asr", "clean", split = "train.100", streaming = True)
ds_ls_val = load_dataset("openslr/librispeech_asr", "clean", split = "validation", streaming = True)

#dataloader_ls_train100 = DataLoader(ds_ls_train100, batch_size = batch_size)
#dataloader_ls_val = DataLoader(ds_ls_val, batch_size = batch_size)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

In [None]:
#Set up groud truth for ASR evaluation
from transformers import pipeline
import torch

if torch.cuda.is_available():
    device = "cuda:0"
    torch_dtype = torch.float16
else:
    device = "cpu"
    torch_dtype = torch.float32

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    torch_dtype=torch_dtype,
    device=device,
)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [None]:
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset
from datasets import Audio

all_predictions = []
for pred in tqdm(
    pipe(
        KeyDataset(ds_ls_val, "audio"),
        generate_kwargs={"task": "transcribe", "max_new_tokens": 128},
        batch_size=32,
    ),
    total=len(ds_ls_val),
):
    all_predictions.append(pred["text"])


TypeError: object of type 'IterableDataset' has no len()

In [None]:
from evaluate import load

wer_metric = load("wer")

"""
wer_ortho = 100 * wer_metric.compute(
    references=common_voice_test["sentence"], predictions=all_predictions
)
wer_ortho"""

Downloading builder script: 0.00B [00:00, ?B/s]

'\nwer_ortho = 100 * wer_metric.compute(\n    references=common_voice_test["sentence"], predictions=all_predictions\n)\nwer_ortho'

In [None]:

        "TODO: Implement Common Voice dataset loading and preprocessing\n",
        "- Load Common Voice dataset with multi-language support\n",
        "- Handle varying audio quality and lengths\n",
        "- Implement speaker diversity analysis\n",
        "- Create evaluation splits for testing\n"

'- Create evaluation splits for testing\n'

In [None]:
# https://github.com/huggingface/datasets/issues/7693
# it seems like the mozilla one is old, so either you downgrade datasets or fix the dataset
%%capture
!pip install datasets==3.6.0

In [None]:
# cv12 requires hf login
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset
from torch.utils.data import DataLoader

batch_size = 32
language = "en"

ds_cv_test = load_dataset("mozilla-foundation/common_voice_17_0", language, split = "test", streaming = True)
ds_cv_train = load_dataset("mozilla-foundation/common_voice_17_0", language, split = "train", streaming = True)
dataloader_cv_test = DataLoader(ds_cv_test, batch_size = batch_size)
dataloader_cv_train = DataLoader(ds_cv_train, batch_size = batch_size)

RuntimeError: Dataset scripts are no longer supported, but found common_voice_17_0.py

In [None]:
from IPython.display import Audio

a = next(iter(ds_ls_val))

print(a.keys())
print(a["text"])

waveform = a["audio"]["array"]
sample_rate = a["audio"]["sampling_rate"]

Audio(waveform, rate = sample_rate)

ModuleNotFoundError: No module named 'datasets.features._torchcodec'

In [None]:
import torch
import torchaudio
import matplotlib.pyplot as plt

# size of fft
n_fft = 400

waveform_tensor = torch.tensor(waveform)
transform = torchaudio.transforms.Spectrogram(n_fft = n_fft)

spectrogram = transform(waveform_tensor)

print(spectrogram.shape)
plt.imshow(spectrogram.log2()[:, :].numpy())

NameError: name 'waveform' is not defined

In [None]:
# todo: make each into a function, get_wer to process batches

match config_params["WER_COMPUTATION_METHOD"]:
    case "openai_api" if config_params["OPENAI_API_KEY"] != None:
        from openai import OpenAI
        import tempfile
        import soundfile as sf

        # i havent tested this yet
        def get_wers(batch):
            wers = []

            for item in batch:
                waveform = item["audio"]
                sample_rate = item["sampling_rate"]
                ref = item["text"]

                # if audio is saved locally, i think u should save filepath_audio var as such
                # filepath_audio = a["file"]
                with tempfile.NamedTemporaryFile(suffix = ".flac") as filepath_audio:
                    sf.write(filepath_audio.name, waveform, sample_rate)

                    with open(filepath_audio.name, "rb") as f:
                        transcript = client.audio.transcriptions.create(
                            model = "whisper-1",
                            file = f
                        )

                pred = transcript.text
                wer = wer_metric.compute(predictions = [pred], references = [ref])
                wers.append(wer)

            return wers

        client = OpenAI()

    case "huggingface":
        from transformers import WhisperForConditionalGeneration, WhisperProcessor

        WHISPER_PROCESSOR = WhisperProcessor.from_pretrained(config_params["WHISPER_MODEL_HF"])
        WHISPER_MODEL = WhisperForConditionalGeneration.from_pretrained(config_params["WHISPER_MODEL_HF"]).to("cuda")

        def map_to_pred(batch):
            audio = batch["audio"]

            input_features = WHISPER_PROCESSOR(audio["array"], sampling_rate = audio["sampling_rate"], return_tensors = "pt").input_features
            batch["reference"] = WHISPER_PROCESSOR.tokenizer._normalize(batch["text"])

            with torch.no_grad():
                predicted_ids = WHISPER_MODEL.generate(input_features.to("cuda"))[0]

            transcription = WHISPER_PROCESSOR.decode(predicted_ids)
            batch["prediction"] = WHISPER_PROCESSOR.tokenizer._normalize(transcription)

            return batch

        def get_wers(batch):
            res = batch.map(map_to_pred)

            wers = []

            for ref, pred in zip(res["reference"], res["prediction"]):
                wer = wer_metric.compute(predictions = [pred], references = [ref])
                wers.append(wer)

                # print(ref, "\n", pred, "\n", wer, "\n\n")

            return wers

In [None]:
get_wers(ds_ls_val.take(10))

NotImplementedError: Subclasses of Dataset should implement __getitem__.

In [None]:
for i, row in enumerate(result):
    print(i, ": ", row["reference"], "\n -> ", row["prediction"], "\n\n")



0 :  chapter 16 i might have told you of the beginning of this liaison in a few lines but i wanted you to see every step by which we came i to agree to whatever marguerite wished 
 ->  chapter 16 


1 :  marguerite to be unable to live apart from me it was the day after the evening when she came to see me that i sent her manon lescaut from that time seeing that i could not change my mistress is life i changed my own 
 ->  marguerite to be unable to live apart from me it was the day after the evening when she came to see me that i sent her a mononless scot from that time seeing that i could not change my mistress is life i changed my own 


2 :  i wished above all not to leave myself time to think over the position i had accepted for in spite of myself it was a great distress to me thus my life generally so calm 
 ->  i wished above all not to leave myself time to think over the position i had accepted for in spite of myself it was a great distress to me thus my life generally so calm 


KeyboardInterrupt: 

In [None]:
reference = a["text"]

wer_metric.compute(predictions = [hypothesis], references = [reference])

1.0

# SECTION 3: AUDIO CONTENT ANALYSIS & CLASSIFICATION
# Primary: Kevin Li, Rithwik Nukala | Supporting: Ishan Singh

In [None]:
       "TODO: Implement audio content complexity analysis\n",
        "- Design complexity scoring algorithm\n",
        "- Classify audio segments by complexity (silence, noise, speech)\n",
        "- Implement real-time complexity detection\n",
        "- Create adaptive thresholds for different content types\n",

In [None]:
       "TODO: Implement advanced voice activity detection\n",
        "- Design robust VAD algorithm\n",
        "- Handle noisy environments\n",
        "- Optimize for real-time processing\n",
        "- Integrate with compression decisions\n",

In [None]:
#Helper functions

def _fix_spec_length(spec: torch.Tensor, time_frames: int) -> torch.Tensor:
    """Pads or crops a spectrogram to a fixed temporal length."""
    _, T = spec.shape
    if T < time_frames:
        spec = F.pad(spec, (0, time_frames - T))
    elif T > time_frames:
        start = (T - time_frames) // 2
        spec = spec[:, start:start + time_frames]
    return spec

class AudioDataset(Dataset):
    """Custom dataset for audio samples with WER labels."""
    def __init__(self, samples: List[Dict], wer_scores: List[float], config: Dict):
        self.samples = samples
        self.wer_scores = wer_scores
        self.config = config
        self.mel_transform = nn.Sequential(
            MelSpectrogram(
                sample_rate=config["SAMPLE_RATE"], n_fft=config["N_FFT"],
                hop_length=config["HOP_LENGTH"], n_mels=config["N_MELS"]
            ),
            AmplitudeToDB()
        )

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample, wer = self.samples[idx], self.wer_scores[idx]
        audio = torch.from_numpy(sample['audio']).float()

        if sample['sampling_rate'] != self.config["SAMPLE_RATE"]:
            resampler = torchaudio.transforms.Resample(sample['sampling_rate'], self.config["SAMPLE_RATE"])
            audio = resampler(audio)

        if audio.dim() == 1: audio = audio.unsqueeze(0)
        mel_spec = self.mel_transform(audio).squeeze(0)
        mel_spec = _fix_spec_length(mel_spec, self.config["TIME_FRAMES"])
        mel_spec = mel_spec.unsqueeze(0).repeat(3, 1, 1)

        return mel_spec, torch.tensor(wer, dtype=torch.float32)

def _compute_wer_with_huggingface(samples: List[Dict], config: Dict) -> List[float]:
    """Computes WER using a local Hugging Face Whisper model."""
    print("Computing WER scores using local Hugging Face model...")
    try:
        from transformers import pipeline
        import jiwer

        asr = pipeline("automatic-speech-recognition", model=config["WHISPER_MODEL_HF"], device=0 if DEVICE == "cuda" else -1)
        wer_scores = []
        for sample in tqdm(samples, desc="Computing WER (HF)"):
            result = asr({"array": sample['audio'], "sampling_rate": sample['sampling_rate']})
            hypothesis = result['text'].strip().lower()
            reference = sample['text'].strip().lower()
            wer = jiwer.wer(reference, hypothesis)
            wer_scores.append(min(wer, 1.0))
        return wer_scores
    except Exception as e:
        # Instead of falling back to mock scores, raise an error.
        print(f"\nError with Hugging Face method: {e}")
        raise RuntimeError(f"Failed to compute WER with Hugging Face. Check model/library installation. Original error: {e}") from e

def _compute_wer_with_openai_api(samples: List[Dict], config: Dict) -> List[float]:
    """Computes WER using the OpenAI API."""
    print("Computing WER scores using OpenAI API...")
    if not config["OPENAI_API_KEY"]:
        # Instead of falling back to mock scores, raise an error.
        raise ValueError("OpenAI API key not found. Please set the OPENAI_API_KEY environment variable.")

    import jiwer
    client = openai.OpenAI(api_key=config["OPENAI_API_KEY"])
    wer_scores = []

    for sample in tqdm(samples, desc="Computing WER (API)"):
        try:
            audio_data, sr = sample['audio'], sample['sampling_rate']
            if np.issubdtype(audio_data.dtype, np.floating):
                audio_data = (audio_data * 32767).astype(np.int16)

            bytes_wav = io.BytesIO()
            write_wav(bytes_wav, rate=sr, data=audio_data)
            bytes_wav.seek(0)
            bytes_wav.name = "audio.wav"

            response = client.audio.transcriptions.create(model=config["WHISPER_MODEL_API"], file=bytes_wav)
            hypothesis = response.text.strip().lower()
        except Exception as e:
            print(f"\nAPI call failed for a sample: {e}. Assigning max WER.")
            hypothesis = ""

        reference = sample['text'].strip().lower()
        wer = jiwer.wer(reference, hypothesis)
        wer_scores.append(min(wer, 1.0))

    return wer_scores

def compute_wer_scores(samples: List[Dict], config: Dict) -> List[float]:
    """Dispatcher function to compute WER scores based on the chosen method."""
    method = config["WER_COMPUTATION_METHOD"]
    if method == 'huggingface':
        return _compute_wer_with_huggingface(samples, config)
    elif method == 'openai_api':
        return _compute_wer_with_openai_api(samples, config)
    else:
        # Instead of falling back to mock scores, raise an error.
        raise ValueError(f"Invalid WER_COMPUTATION_METHOD: '{method}'. Valid options are 'huggingface' or 'openai_api'.")

def load_and_cache_data(config: Dict, subset="train-clean-100", max_samples=None):
    """Loads LibriSpeech data and computes/caches WER scores."""
    os.makedirs(config["DATA_ROOT"], exist_ok=True)
    os.makedirs(config["CACHE_DIR"], exist_ok=True)

    cache_file_suffix = f"{config['WER_COMPUTATION_METHOD']}_{max_samples or 'all'}.pkl"
    cache_file = Path(config["CACHE_DIR"]) / f"{subset}_wer_cache_{cache_file_suffix}"

    if config["USE_CACHE"] and cache_file.exists():
        print(f"Loading cached data from {cache_file}")
        with open(cache_file, 'rb') as f: data = pickle.load(f)
        return data['samples'], data['wer_scores']

    print(f"Loading LibriSpeech '{subset}'...")
    dataset = torchaudio.datasets.LIBRISPEECH(config["DATA_ROOT"], url=subset, download=True)

    samples = []
    total = min(len(dataset), max_samples) if max_samples else len(dataset)
    for i in tqdm(range(total), desc="Preparing samples"):
        waveform, sr, transcript, _, _, _ = dataset[i]
        samples.append({'audio': waveform.squeeze(0).numpy(), 'sampling_rate': sr, 'text': transcript})

    wer_scores = compute_wer_scores(samples, config)

    if config["USE_CACHE"]:
        print(f"Caching results to {cache_file}")
        with open(cache_file, 'wb') as f:
            pickle.dump({'samples': samples, 'wer_scores': wer_scores}, f)

    return samples, wer_scores

In [None]:
# Mobilenet CNN adapted to our use case

def create_mobilenet_predictor(config: Dict) -> nn.Module:
    """Creates a MobileNet-based regression model for WER prediction."""
    version, pretrained = config["MOBILENET_VERSION"], config["USE_PRETRAINED"]

    if version == 'mobilenet_v3_small':
        backbone = models.mobilenet_v3_small(weights='IMAGENET1K_V1' if pretrained else None)
        in_features = backbone.classifier[0].in_features
    else:
        raise ValueError(f"Unsupported MobileNet version: {version}")

    backbone.classifier = nn.Sequential(
        nn.Linear(in_features, 256), nn.ReLU(), nn.Dropout(0.2),
        nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
    )

    class Regressor(nn.Module):
        def __init__(self, model):
            super().__init__()
            self.model = model
        def forward(self, x): return self.model(x).squeeze(-1)

    return Regressor(backbone)

In [None]:
#Training

def train_epoch(model, dataloader, criterion, optimizer, device):
    """Trains the model for one epoch."""
    model.train()
    total_loss = 0
    for specs, wers in tqdm(dataloader, desc="Training"):
        specs, wers = specs.to(device), wers.to(device)
        optimizer.zero_grad()
        loss = criterion(model(specs), wers)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(dataloader)

def evaluate(model, dataloader, criterion, device):
    """Evaluates the model on the validation set."""
    model.eval()
    total_loss, all_preds, all_targets = 0, [], []
    with torch.no_grad():
        for specs, wers in tqdm(dataloader, desc="Evaluating"):
            specs, wers = specs.to(device), wers.to(device)
            predictions = model(specs)
            total_loss += criterion(predictions, wers).item()
            all_preds.extend(predictions.cpu().numpy())
            all_targets.extend(wers.cpu().numpy())

    all_preds, all_targets = np.array(all_preds), np.array(all_targets)
    mae = np.mean(np.abs(all_preds - all_targets))
    return total_loss / len(dataloader), mae, all_preds, all_targets

def run_training_pipeline(config: Dict):
    """Executes the full training and validation pipeline."""
    print("\n1. Loading and preparing data...")
    samples, wer_scores = load_and_cache_data(config, subset="test-clean", max_samples=100) # Using 100 for quick demo
    print(f"Loaded {len(samples)} samples. WER stats: mean={np.mean(wer_scores):.3f}, std={np.std(wer_scores):.3f}")

    n_train = int(len(samples) * config["TRAIN_SPLIT"])
    train_dataset = AudioDataset(samples[:n_train], wer_scores[:n_train], config)
    val_dataset = AudioDataset(samples[n_train:], wer_scores[n_train:], config)

    train_loader = DataLoader(train_dataset, batch_size=config["BATCH_SIZE"], shuffle=True, num_workers=2)
    val_loader = DataLoader(val_dataset, batch_size=config["BATCH_SIZE"], shuffle=False, num_workers=2)
    print(f"Data split: {len(train_dataset)} training, {len(val_dataset)} validation samples.")

    print("\n2. Initializing model...")
    model = create_mobilenet_predictor(config).to(DEVICE)
    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=config["LEARNING_RATE"])
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3)

    history = {"train_loss": [], "val_loss": [], "mae": []}
    best_val_loss = float('inf')

    print("\n3. Starting training loop...")
    for epoch in range(config["EPOCHS"]):
        print(f"\n--- Epoch {epoch+1}/{config['EPOCHS']} ---")
        train_loss = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
        val_loss, mae, preds, targets = evaluate(model, val_loader, criterion, DEVICE)
        scheduler.step(val_loss)

        history["train_loss"].append(train_loss); history["val_loss"].append(val_loss); history["mae"].append(mae)
        print(f"Epoch Summary: Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | MAE: {mae:.4f}")

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), config["MODEL_SAVE_PATH"])
            print(f"  -> New best model saved to {config['MODEL_SAVE_PATH']}")

    return history, {"predictions": preds, "targets": targets}, val_dataset

In [None]:
config = get_config()

print("=" * 70)
print("STARTING TRAINING PIPELINE")
print("=" * 70)

# This function will run the entire training process
history, final_eval_results, val_dataset = run_training_pipeline(config)

print("\nTraining complete!")

STARTING TRAINING PIPELINE

1. Loading and preparing data...
Loading cached data from cache/test-clean_wer_cache_huggingface_100.pkl
Loaded 100 samples. WER stats: mean=0.197, std=0.144
Data split: 80 training, 20 validation samples.

2. Initializing model...
Model parameters: 1,091,233

3. Starting training loop...

--- Epoch 1/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.90it/s]
Evaluating: 100%|██████████| 1/1 [00:01<00:00,  1.01s/it]


Epoch Summary: Train Loss: 0.0405 | Val Loss: 0.0148 | MAE: 0.0931
  -> New best model saved to best_wer_predictor.pth

--- Epoch 2/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.32it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.25it/s]


Epoch Summary: Train Loss: 0.0137 | Val Loss: 0.0134 | MAE: 0.0741
  -> New best model saved to best_wer_predictor.pth

--- Epoch 3/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.28it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.46it/s]


Epoch Summary: Train Loss: 0.0114 | Val Loss: 0.0153 | MAE: 0.0922

--- Epoch 4/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.18it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.19it/s]


Epoch Summary: Train Loss: 0.0103 | Val Loss: 0.0145 | MAE: 0.0864

--- Epoch 5/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.80it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.94it/s]


Epoch Summary: Train Loss: 0.0076 | Val Loss: 0.0140 | MAE: 0.0811

--- Epoch 6/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.58it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.16it/s]


Epoch Summary: Train Loss: 0.0064 | Val Loss: 0.0138 | MAE: 0.0787

--- Epoch 7/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.05it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.40it/s]


Epoch Summary: Train Loss: 0.0056 | Val Loss: 0.0140 | MAE: 0.0783

--- Epoch 8/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.15it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.42it/s]


Epoch Summary: Train Loss: 0.0061 | Val Loss: 0.0157 | MAE: 0.0840

--- Epoch 9/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.56it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.40it/s]


Epoch Summary: Train Loss: 0.0041 | Val Loss: 0.0153 | MAE: 0.0819

--- Epoch 10/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.74it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.16it/s]


Epoch Summary: Train Loss: 0.0020 | Val Loss: 0.0152 | MAE: 0.0808

--- Epoch 11/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.52it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.97it/s]


Epoch Summary: Train Loss: 0.0018 | Val Loss: 0.0153 | MAE: 0.0805

--- Epoch 12/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.53it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.28it/s]


Epoch Summary: Train Loss: 0.0014 | Val Loss: 0.0154 | MAE: 0.0798

--- Epoch 13/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.41it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.11it/s]


Epoch Summary: Train Loss: 0.0017 | Val Loss: 0.0154 | MAE: 0.0799

--- Epoch 14/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  1.79it/s]
Evaluating: 100%|██████████| 1/1 [00:01<00:00,  1.13s/it]


Epoch Summary: Train Loss: 0.0013 | Val Loss: 0.0153 | MAE: 0.0794

--- Epoch 15/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.15it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.36it/s]


Epoch Summary: Train Loss: 0.0015 | Val Loss: 0.0153 | MAE: 0.0789

--- Epoch 16/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.44it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.23it/s]


Epoch Summary: Train Loss: 0.0014 | Val Loss: 0.0151 | MAE: 0.0780

--- Epoch 17/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.51it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.27it/s]


Epoch Summary: Train Loss: 0.0016 | Val Loss: 0.0148 | MAE: 0.0770

--- Epoch 18/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.44it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.20it/s]


Epoch Summary: Train Loss: 0.0018 | Val Loss: 0.0146 | MAE: 0.0762

--- Epoch 19/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.57it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.42it/s]


Epoch Summary: Train Loss: 0.0013 | Val Loss: 0.0144 | MAE: 0.0751

--- Epoch 20/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  3.35it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.89it/s]


Epoch Summary: Train Loss: 0.0011 | Val Loss: 0.0141 | MAE: 0.0738

--- Epoch 21/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.75it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.33it/s]


Epoch Summary: Train Loss: 0.0016 | Val Loss: 0.0139 | MAE: 0.0726

--- Epoch 22/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.29it/s]
Evaluating: 100%|██████████| 1/1 [00:01<00:00,  1.87s/it]


Epoch Summary: Train Loss: 0.0011 | Val Loss: 0.0136 | MAE: 0.0713

--- Epoch 23/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  1.57it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s]


Epoch Summary: Train Loss: 0.0010 | Val Loss: 0.0133 | MAE: 0.0701
  -> New best model saved to best_wer_predictor.pth

--- Epoch 24/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.85it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.85it/s]


Epoch Summary: Train Loss: 0.0018 | Val Loss: 0.0130 | MAE: 0.0689
  -> New best model saved to best_wer_predictor.pth

--- Epoch 25/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.96it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  1.92it/s]


Epoch Summary: Train Loss: 0.0015 | Val Loss: 0.0128 | MAE: 0.0677
  -> New best model saved to best_wer_predictor.pth

--- Epoch 26/30 ---


Training: 100%|██████████| 3/3 [00:01<00:00,  2.82it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.50it/s]


Epoch Summary: Train Loss: 0.0010 | Val Loss: 0.0126 | MAE: 0.0666
  -> New best model saved to best_wer_predictor.pth

--- Epoch 27/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  5.57it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  3.49it/s]


Epoch Summary: Train Loss: 0.0020 | Val Loss: 0.0124 | MAE: 0.0657
  -> New best model saved to best_wer_predictor.pth

--- Epoch 28/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  5.28it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  3.54it/s]


Epoch Summary: Train Loss: 0.0017 | Val Loss: 0.0122 | MAE: 0.0647
  -> New best model saved to best_wer_predictor.pth

--- Epoch 29/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  5.37it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  3.55it/s]


Epoch Summary: Train Loss: 0.0010 | Val Loss: 0.0120 | MAE: 0.0634
  -> New best model saved to best_wer_predictor.pth

--- Epoch 30/30 ---


Training: 100%|██████████| 3/3 [00:00<00:00,  4.75it/s]
Evaluating: 100%|██████████| 1/1 [00:00<00:00,  2.39it/s]

Epoch Summary: Train Loss: 0.0007 | Val Loss: 0.0119 | MAE: 0.0625
  -> New best model saved to best_wer_predictor.pth

Training complete!





# SECTION 4: DYNAMIC COMPRESSION ALGORITHM DEVELOPMENT
## Primary: Rithwik Nukala, Ishan Singh | Supporting: Abdulmatin Omotoso

In [None]:
        "TODO: Implement core dynamic compression algorithm\n",
        "- Design content-aware compression ratios\n",
        "- Implement adaptive compression based on content type\n",
        "- Create real-time compression pipeline\n",
        "- Optimize for 30-50% token reduction target\n",

In [None]:
        "TODO: Implement compression quality control mechanisms\n",
        "- Define quality metrics for compressed audio\n",
        "- Implement feedback loop for compression adjustment\n",
        "- Create quality thresholds for different use cases\n",
        "- Validate semantic preservation\n",

# SECTION 5: LOW-COMPLEXITY CONTENT DETECTION
## Primary: Kevin Li, Ogan Aktolun | Supporting:

In [None]:
        "TODO: Implement robust silence and noise detection\n",
        "- Design multi-feature silence detection\n",
        "- Classify different types of background noise\n",
        "- Optimize detection for various audio conditions\n",
        "- Create confidence scoring for detection results\n

In [None]:
       "        TODO: Compute how aggressively a segment can be compressed\n",
        "        - Combine silence and noise detection results\n",
        "        - Consider semantic importance\n",
        "        - Return compression recommendation\n",

# SECTION 6: SEMANTIC PRESERVATION FRAMEWORK
## Primary: Abdulmatin Omotoso, Ishan Singh | Supporting:All

In [None]:
        "TODO: Implement semantic importance scoring for audio segments\n",
        "- Design semantic importance metrics\n",
        "- Integrate with speech recognition confidence\n",
        "- Consider linguistic importance of segments\n",
        "- Preserve critical information during compression\n",

# SECTION 7: REAL-TIME PROCESSING PIPELINE,
## Primary: Amitesh Vatsa, Rithwik Nukala | Supporting: All

In [None]:
        "TODO: Implement real-time compression pipeline\n",
        "- Design streaming audio processing\n",
        "- Implement low-latency compression\n",
        "- Optimize for real-time performance\n",
        "- Handle variable audio input rates\n",

# SECTION 8: ASR INTEGRATION & EVALUATION
## Primary: Ogan Aktolun, Kevin Li | Supporting: All

In [None]:
        "TODO: Implement comprehensive ASR evaluation framework\n",
        "- Integrate with multiple ASR systems (Whisper, Wav2Vec2)\n",
        "- Calculate Word Error Rate for compressed vs original audio\n",
        "- Analyze error patterns by content type\n",
        "- Validate that ASR performance is maintained\n",

# SECTION 9: TOKEN REDUCTION ANALYSIS,
## Primary: Kevin Li, Amitesh Vatsa | Supporting: All

In [None]:
        "TODO: Implement comprehensive token reduction analysis\n",
        "- Calculate token count per second for original and compressed audio\n",
        "- Analyze token reduction by content type\n",
        "- Validate 30-50% reduction target achievement\n",
        "- Create detailed reduction reports\n",

# SECTION 10: PERFORMANCE BENCHMARKING,
## Primary: All Team Members | Lead: Ogan Aktolun

In [None]:
        "TODO: Run comprehensive benchmarking on LibriSpeech and Common Voice\n",
        "- Execute full evaluation pipeline on both datasets\n",
        "- Measure compression performance across different audio types\n",
        "- Validate token reduction and ASR performance targets\n",
        "- Generate comparative analysis with baseline methods\n",

# SECTION 11: RESULTS ANALYSIS & VALIDATION
## Primary: Kevin Li, Amitesh Vatsa | Supporting: All

In [None]:
        "TODO: Validate methodology 1 performance targets and analyze results\n",
        "- Validate 30-50% token reduction achievement\n",
        "- Confirm ASR Word Error Rate within acceptable bounds\n",
        "- Analyze effectiveness across different content types\n",
        "- Document insights and lessons learned\n",