# Audio Analysis

Demonstrates audio processing and feature extraction with librosa and torchaudio.

**Libraries:**
- [librosa](https://librosa.org/) — Audio analysis: MFCCs, mel spectrogram, chroma, beat tracking, HPSS
- [torchaudio](https://pytorch.org/audio/) — PyTorch-native audio transforms and resampling
- [soundfile](https://python-soundfile.readthedocs.io/) — WAV/FLAC/OGG I/O

Uses librosa's built-in Tchaikovsky Nutcracker sample — no external files needed.

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt

import librosa
import librosa.display
import soundfile as sf
import torch
import torchaudio
import torchaudio.transforms as T

%matplotlib inline

## Load Audio

`librosa.ex()` returns paths to bundled audio samples (no download required).

In [None]:
y, sr = librosa.load(librosa.ex('nutcracker'), duration=30.0)

print(f"Audio shape : {y.shape}")
print(f"Sample rate : {sr} Hz")
print(f"Duration    : {len(y)/sr:.2f} s")
print(f"Value range : [{y.min():.4f}, {y.max():.4f}]")

## Waveform and Energy Analysis

In [None]:
zcr = librosa.feature.zero_crossing_rate(y)
rms = librosa.feature.rms(y=y)

times = np.linspace(0, len(y) / sr, len(y))
zcr_times = librosa.times_like(zcr, sr=sr)
rms_times = librosa.times_like(rms, sr=sr)

fig, axes = plt.subplots(3, 1, figsize=(12, 8))
axes[0].plot(times, y, linewidth=0.4, color='steelblue')
axes[0].set_title("Waveform")
axes[0].set_ylabel("Amplitude")

axes[1].semilogy(zcr_times, zcr[0], color='darkorange', linewidth=0.8)
axes[1].set_title(f"Zero-Crossing Rate (mean={zcr.mean():.4f})")
axes[1].set_ylabel("ZCR")

axes[2].plot(rms_times, rms[0], color='seagreen', linewidth=0.8)
axes[2].set_title(f"RMS Energy (mean={rms.mean():.4f})")
axes[2].set_xlabel("Time (s)")
axes[2].set_ylabel("Energy")

plt.tight_layout()
plt.show()

## Harmonic-Percussive Source Separation (HPSS)

HPSS decomposes audio into tonal (harmonic) and rhythmic (percussive) layers.
Using them separately improves downstream feature quality.

In [None]:
y_harmonic, y_percussive = librosa.effects.hpss(y)

print(f"Harmonic energy  : {np.sum(y_harmonic**2):.2f}")
print(f"Percussive energy: {np.sum(y_percussive**2):.2f}")

fig, axes = plt.subplots(3, 1, figsize=(12, 7))
for ax, sig, title, color in zip(
    axes,
    [y, y_harmonic, y_percussive],
    ['Original', 'Harmonic Component', 'Percussive Component'],
    ['steelblue', 'royalblue', 'tomato'],
):
    ax.plot(np.linspace(0, len(sig)/sr, len(sig)), sig, linewidth=0.3, color=color)
    ax.set_title(title)
    ax.set_ylabel("Amplitude")
axes[-1].set_xlabel("Time (s)")
plt.tight_layout()
plt.show()

## Beat Tracking and Tempo Estimation

librosa uses the percussive component for more stable beat detection.

In [None]:
tempo, beat_frames = librosa.beat.beat_track(y=y_percussive, sr=sr)
beat_times = librosa.frames_to_time(beat_frames, sr=sr)
tempo_val = float(tempo) if np.ndim(tempo) == 0 else float(tempo[0])

print(f"Estimated tempo : {tempo_val:.1f} BPM")
print(f"Number of beats : {len(beat_times)}")
print(f"Average beat interval: {np.mean(np.diff(beat_times)):.3f} s")

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(times, y, linewidth=0.4, color='steelblue', alpha=0.7, label='Waveform')
for bt in beat_times:
    ax.axvline(x=bt, color='red', alpha=0.5, linewidth=0.7)
ax.axvline(x=beat_times[0], color='red', alpha=0.9, linewidth=0.7, label='Beat')
ax.set_xlabel("Time (s)")
ax.set_ylabel("Amplitude")
ax.set_title(f"Beat Tracking — {tempo_val:.1f} BPM")
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()

## Spectral Features: Mel Spectrogram, MFCC, Chroma

| Feature | Description | Use Case |
|---------|-------------|----------|
| **Mel Spectrogram** | Frequency energy on perceptual scale | General audio representation |
| **MFCC** | Compact spectral shape descriptors | Speech recognition, genre classification |
| **Chroma** | Pitch class energy (C, C#, D, ...) | Harmony, chord recognition |

In [None]:
hop_length = 512
n_mfcc = 13

mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, hop_length=hop_length)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)
mfcc = librosa.feature.mfcc(y=y_harmonic, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)
mfcc_delta = librosa.feature.delta(mfcc)
chroma = librosa.feature.chroma_cqt(y=y_harmonic, sr=sr, hop_length=hop_length)

centroid = librosa.feature.spectral_centroid(y=y, sr=sr, hop_length=hop_length)
bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr, hop_length=hop_length)
rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr, hop_length=hop_length, roll_percent=0.85)

print(f"Mel spectrogram : {mel_spec.shape}")
print(f"MFCC            : {mfcc.shape}")
print(f"Chroma          : {chroma.shape}")
print(f"\nSpectral centroid mean : {centroid.mean():.1f} Hz")
print(f"Spectral bandwidth mean: {bandwidth.mean():.1f} Hz")

In [None]:
fig, axes = plt.subplots(4, 1, figsize=(13, 14))

img0 = librosa.display.specshow(mel_db, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel', ax=axes[0])
axes[0].set_title("Mel Spectrogram (dB)")
fig.colorbar(img0, ax=axes[0], format="%+2.0f dB")

img1 = librosa.display.specshow(mfcc, sr=sr, hop_length=hop_length, x_axis='time', ax=axes[1])
axes[1].set_title(f"MFCCs (first {n_mfcc} coefficients)")
axes[1].set_ylabel("MFCC coefficient")
fig.colorbar(img1, ax=axes[1])

img2 = librosa.display.specshow(chroma, sr=sr, hop_length=hop_length, x_axis='time', y_axis='chroma', ax=axes[2])
axes[2].set_title("Chroma Features (CQT) — Pitch Class Energy")
fig.colorbar(img2, ax=axes[2])

feat_times = librosa.times_like(centroid, sr=sr, hop_length=hop_length)
axes[3].plot(feat_times, centroid[0], label='Centroid', linewidth=0.8, color='steelblue')
axes[3].plot(feat_times, rolloff[0], label='Rolloff (85%)', linewidth=0.8, color='darkorange')
axes[3].fill_between(feat_times, centroid[0]-bandwidth[0], centroid[0]+bandwidth[0],
                     alpha=0.25, color='steelblue', label='±Bandwidth')
axes[3].set_xlabel("Time (s)")
axes[3].set_ylabel("Frequency (Hz)")
axes[3].set_title("Spectral Centroid, Bandwidth, and Rolloff")
axes[3].legend(loc='upper right')

plt.tight_layout()
plt.show()

## Beat-Synchronised Feature Matrix

Aggregating frame-level features at beat boundaries reduces sequence length and
aligns features with musical structure — useful for song-level classification.

In [None]:
beat_mfcc = librosa.util.sync(mfcc, beat_frames, aggregate=np.median)
beat_chroma = librosa.util.sync(chroma, beat_frames, aggregate=np.median)
beat_features = np.vstack([beat_chroma, beat_mfcc])

print(f"Beat-sync chroma : {beat_chroma.shape}")
print(f"Beat-sync MFCC   : {beat_mfcc.shape}")
print(f"Combined (25 x beats): {beat_features.shape}")

fig, ax = plt.subplots(figsize=(12, 5))
img = ax.imshow(beat_features, aspect='auto', origin='lower', cmap='coolwarm')
ax.set_xlabel("Beat Number")
ax.set_yticks(range(25))
ax.set_yticklabels([f"C{i}" for i in range(12)] + [f"MFCC{i}" for i in range(13)], fontsize=7)
ax.set_title("Beat-Synchronised Feature Matrix (Chroma + MFCC)")
fig.colorbar(img, ax=ax)
plt.tight_layout()
plt.show()

## torchaudio: Transforms and Resampling

In [None]:
waveform = torch.tensor(y).unsqueeze(0).float()

# Resample to 16 kHz (common for speech models)
target_sr = 16000
resampler = T.Resample(orig_freq=sr, new_freq=target_sr)
waveform_16k = resampler(waveform)

# Mel Spectrogram
mel_transform = T.MelSpectrogram(sample_rate=sr, n_fft=1024, hop_length=256, n_mels=80)
to_db = T.AmplitudeToDB(stype='power', top_db=80)
mel_db_torch = to_db(mel_transform(waveform))

# Power Spectrogram
spec_db_torch = to_db(T.Spectrogram(n_fft=1024, hop_length=256, power=2.0)(waveform))

print(f"Waveform        : {waveform.shape}")
print(f"Resampled 16kHz : {waveform_16k.shape}")
print(f"Mel Spectrogram : {mel_db_torch.shape}")

fig, axes = plt.subplots(2, 1, figsize=(12, 7))
axes[0].imshow(spec_db_torch[0].numpy(), aspect='auto', origin='lower',
               cmap='magma', extent=[0, len(y)/sr, 0, sr/2])
axes[0].set_xlabel("Time (s)")
axes[0].set_ylabel("Frequency (Hz)")
axes[0].set_title("torchaudio Spectrogram (dB)")

axes[1].imshow(mel_db_torch[0].numpy(), aspect='auto', origin='lower',
               cmap='magma', extent=[0, len(y)/sr, 0, 80])
axes[1].set_xlabel("Time (s)")
axes[1].set_ylabel("Mel Band")
axes[1].set_title("torchaudio Mel Spectrogram (80 mel bands, dB)")

plt.tight_layout()
plt.show()

---
## Summary

1. **HPSS** separates harmonic and percussive layers — use them independently for cleaner features
2. **Beat tracking** on the percussive layer gives stable tempo and beat positions
3. **Mel Spectrogram + MFCC + Chroma** form a standard feature set for music analysis
4. **Beat-synchronised features** reduce sequence length while preserving musical structure
5. **torchaudio** provides GPU-accelerated transforms and easy integration with PyTorch pipelines

**Resources:**
- [librosa Tutorial](https://librosa.org/doc/latest/tutorial.html)
- [torchaudio Transforms](https://pytorch.org/audio/stable/transforms.html)
- [Music Information Retrieval (MIR)](https://musicinformationretrieval.com/)