# 01: Exploratory Data Analysis (EDA)

This notebook visualizes the audio data to:
1.  Confirm audio is loaded correctly (16kHz, normalized).
2.  Analyze duration distribution (determines optimal UAP vector length).
3.  Visualize waveforms and Mel-spectrograms.
4.  Check amplitude statistics.

In [None]:
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import librosa
import soundfile as sf
import torch
from IPython.display import Audio, display

# Add src to path to import modules
sys.path.append(os.path.join(os.getcwd(), '..'))
from src.data.audio_loader import load_audio, get_audio_duration

## 1. Load and Verify Audio Pipeline

We load a few sample audio files to ensure the `audio_loader.py` functions work as expected (16kHz resampling, float32 normalization).

In [None]:
import glob
from src.data.download_data import download_librispeech_sample

# 1. Download/Locate Data
# We use a 'data' directory in the project root
data_root = os.path.join(os.getcwd(), '..', 'data') 
dataset_path = download_librispeech_sample(data_root)
print(f"Dataset location: {dataset_path}")

# 2. Find all .flac files
sample_paths = glob.glob(os.path.join(dataset_path, "**", "*.flac"), recursive=True)
print(f"Found {len(sample_paths)} audio files.")

if len(sample_paths) > 0:
    print(f"Sample: {sample_paths[0]}")
else:
    print("WARNING: No audio files found. Check download.")

## 2. Duration Distribution Analysis

Whisper processes fixed-length inputs (e.g., 30s chunks). We need to know the distribution of our data to decide how to handle variable-length utterances.
- If audio > 30s: We will tile the perturbation or crop the audio.
- If audio < 30s: We need to handle short sequences.

In [None]:
durations = []
for path in sample_paths:
    dur = get_audio_duration(path)
    durations.append(dur)

plt.figure(figsize=(10, 5))
plt.hist(durations, bins=30, edgecolor='black')
plt.axvline(x=30, color='r', linestyle='--', label='30s Threshold')
plt.title('Distribution of Audio Utterance Durations')
plt.xlabel('Duration (seconds)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

print(f"Mean Duration: {np.mean(durations):.2f}s")
print(f"Max Duration: {np.max(durations):.2f}s")

## 3. Visualize Waveforms and Mel-Spectrograms

Visualizing the Mel-spectrogram is crucial because Whisper operates on this representation.

In [None]:
def plot_mel_spectrogram(audio, sr, title="Mel Spectrogram"):
    y = audio.astype(np.float32)
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=80, fmax=8000)
    S_dB = librosa.power_to_db(S, ref=np.max)
    
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(S_dB, x_axis='time', y_axis='mel', sr=sr, fmax=8000)
    plt.colorbar(format='%+2.0f dB')
    plt.title(title)
    plt.tight_layout()
    plt.show()

def plot_waveform(audio, sr, title="Waveform"):
    plt.figure(figsize=(10, 4))
    librosa.display.waveshow(audio, sr=sr)
    plt.title(title)
    plt.xlabel('Time')
    plt.ylabel('Amplitude')
    plt.tight_layout()
    plt.show()

# Visualize first sample
if len(sample_paths) > 0:
    audio, sr = load_audio(sample_paths[0])
    plot_waveform(audio, sr)
    plot_mel_spectrogram(audio, sr)

## 4. Amplitude Statistics

Confirm that the normalization step results in values in [-1, 1].

In [None]:
amplitudes = []
for path in sample_paths:
    audio, sr = load_audio(path)
    amplitudes.extend(audio.tolist())

amplitudes = np.array(amplitudes)

print(f"Min: {amplitudes.min():.4f}")
print(f"Max: {amplitudes.max():.4f}")
print(f"Mean: {amplitudes.mean():.4f}")
print(f"Std: {amplitudes.std():.4f}")

# Check for clipping (values outside [-1, 1])
clipped_count = np.sum((amplitudes < -1.0) | (amplitudes > 1.0))
print(f"\nClipped Samples: {clipped_count} ({100*clipped_count/len(amplitudes):.2f}%)")

## Summary for Week 1/2

Based on the EDA:
- [ ] Determine if we need to handle variable length inputs (tiling vs cropping).
- [ ] Set the `UAP_LENGTH` constant for the attack scripts.
- [ ] Confirm that normalization keeps data in valid range [-1, 1].