# Hands-on Session 6: Audio Processing

## Learning Objectives

By the end of this hands-on session, you will be able to:
- Understand how audio is represented digitally (sampling, amplitude, bit depth)
- Load and explore audio datasets using ü§ó Datasets
- Visualize audio data in different representations (waveform, spectrogram, mel spectrogram)
- Process and preprocess audio data using librosa
- Prepare audio data for machine learning models

---

## Prerequisites

- Basic Python knowledge
- Understanding of NumPy arrays
- Familiarity with matplotlib for visualization

## Setup: Install Required Libraries

First, let's install the necessary libraries for working with audio data.

In [None]:
# Install necessary packages
!pip install -q torchcodec==0.7 datasets[audio] librosa matplotlib transformers soundfile

In [None]:
# Import libraries
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from datasets import load_dataset, Audio
from transformers import WhisperFeatureExtractor, AutoProcessor
import IPython.display as ipd

print("Libraries imported successfully!")

---

# Part 1: Understanding Audio Data Representation

## 1.1 From Continuous to Digital: Sampling

Audio is a **continuous signal** in the physical world - sound waves are continuous changes in air pressure. However, computers can only work with discrete, finite values. To convert continuous audio into digital form, we use **sampling**.

### Key Concepts:

**Sampling** is the process of measuring the value of a continuous signal at fixed time intervals.

**Sampling Rate (or Sampling Frequency)** is the number of samples taken per second, measured in Hertz (Hz).

Common sampling rates:
- **16 kHz (16,000 Hz)**: Common for speech recognition models
- **22.05 kHz**: Used for lower-quality audio
- **44.1 kHz**: CD-quality audio
- **48 kHz**: Professional audio/video
- **192 kHz**: High-resolution audio

**Important**: The sampling rate determines the highest frequency that can be captured, known as the **Nyquist limit** (= sampling_rate / 2).

For example:
- Speech sampled at 16 kHz can capture frequencies up to 8 kHz (sufficient for human speech)
- Music typically needs 44.1 kHz to capture frequencies up to ~20 kHz (human hearing range)

## 1.2 Amplitude and Bit Depth

**Amplitude** describes the sound pressure level (loudness) at any given instant, measured in decibels (dB).

**Bit Depth** determines the precision with which amplitude values are recorded:
- **16-bit**: 65,536 possible amplitude levels (standard for most audio)
- **24-bit**: 16,777,216 possible amplitude levels (professional audio)
- **32-bit float**: Used in ML (values normalized to [-1.0, 1.0] range)

For machine learning, audio is typically converted to 32-bit floating-point format with values in the range [-1.0, 1.0].

## 1.3 Hands-on: Loading and Exploring Audio with Librosa

Let's start by loading a sample audio file using librosa. Librosa comes with several example audio files we can use.

In [None]:
# Load an example audio file (trumpet sound)
audio_array, sampling_rate = librosa.load(librosa.ex('trumpet'))

print(f"Audio shape: {audio_array.shape}")
print(f"Sampling rate: {sampling_rate} Hz")
print(f"Duration: {len(audio_array) / sampling_rate:.2f} seconds")
print(f"Data type: {audio_array.dtype}")
print(f"Value range: [{audio_array.min():.3f}, {audio_array.max():.3f}]")

In [None]:
# Listen to the audio
ipd.Audio(audio_array, rate=sampling_rate)

---

# Part 2: Audio Representations and Visualizations

## 2.1 Time Domain: Waveform

The **waveform** is the most intuitive representation - it shows amplitude over time (time domain representation).

Useful for:
- Identifying timing of sound events
- Overall loudness assessment
- Detecting noise or irregularities

In [None]:
# Visualize the waveform
plt.figure(figsize=(14, 5))
librosa.display.waveshow(audio_array, sr=sampling_rate)
plt.title('Waveform (Time Domain Representation)')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.show()

## 2.2 Frequency Domain: Spectrum

The **frequency spectrum** shows which frequencies are present in the signal and their strength (frequency domain representation).

It's calculated using the **Discrete Fourier Transform (DFT)**, specifically the Fast Fourier Transform (FFT) algorithm.

Key points:
- X-axis: Frequency (Hz) - typically on log scale
- Y-axis: Amplitude (dB)
- Shows the frequency composition at a single point in time

In [None]:
# Compute the frequency spectrum of the first 4096 samples
n_fft = 4096
dft_input = audio_array[:n_fft]

# Apply windowing to reduce spectral leakage
window = np.hanning(len(dft_input))
windowed_input = dft_input * window

# Calculate DFT
dft = np.fft.rfft(windowed_input)

# Get amplitude spectrum in decibels
amplitude = np.abs(dft)
amplitude_db = librosa.amplitude_to_db(amplitude, ref=np.max)

# Get frequency bins
frequency = librosa.fft_frequencies(sr=sampling_rate, n_fft=len(dft_input))

# Plot
plt.figure(figsize=(14, 5))
plt.plot(frequency, amplitude_db)
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude (dB)')
plt.title('Frequency Spectrum (Frequency Domain Representation)')
plt.xscale('log')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 2.3 Time-Frequency Domain: Spectrogram

The **spectrogram** combines time and frequency - it shows how frequencies change over time!

It's created by:
1. Splitting audio into short overlapping segments (frames)
2. Computing FFT for each segment
3. Stacking the results together

Key properties:
- X-axis: Time
- Y-axis: Frequency (Hz)
- Color/Intensity: Amplitude/Power (dB)

Algorithm: **STFT (Short-Time Fourier Transform)**

In [None]:
# Compute and visualize the spectrogram
D = librosa.stft(audio_array)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

plt.figure(figsize=(14, 5))
librosa.display.specshow(S_db, sr=sampling_rate, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram (Time-Frequency Representation)')
plt.tight_layout()
plt.show()

print(f"Spectrogram shape: {S_db.shape}")
print(f"(frequency bins, time frames) = ({S_db.shape[0]}, {S_db.shape[1]})")

## 2.4 Perceptual Representation: Mel Spectrogram

The **mel spectrogram** is a variant that mimics human hearing!

**Key insight**: Human hearing is NOT linear - we're more sensitive to changes in lower frequencies.

The **Mel scale** is a perceptual scale that:
- Uses logarithmic spacing for frequencies
- Approximates how humans perceive pitch
- Maps Hz to Mel units: mel = 2595 * log10(1 + f/700)

Process:
1. Compute STFT (like regular spectrogram)
2. Apply mel filterbank to group frequencies into mel bands
3. Convert to log scale (log-mel spectrogram)

**Very popular for speech and music ML models!**

In [None]:
# Compute and visualize mel spectrogram
S_mel = librosa.feature.melspectrogram(y=audio_array, sr=sampling_rate, n_mels=128, fmax=8000)
S_mel_db = librosa.power_to_db(S_mel, ref=np.max)

plt.figure(figsize=(14, 5))
librosa.display.specshow(S_mel_db, sr=sampling_rate, x_axis='time', y_axis='mel', fmax=8000)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_layout()
plt.show()

print(f"Mel spectrogram shape: {S_mel_db.shape}")
print(f"(mel bands, time frames) = ({S_mel_db.shape[0]}, {S_mel_db.shape[1]})")
print(f"\nNotice: fewer frequency bins (128 mel bands) vs regular spectrogram ({S_db.shape[0]} bins)")

### üìù Exercise 1: Compare Representations

Let's compare all representations side by side!

In [None]:
# Create a comprehensive visualization
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# 1. Waveform
librosa.display.waveshow(audio_array, sr=sampling_rate, ax=axes[0])
axes[0].set_title('Time Domain: Waveform')
axes[0].set_xlabel('Time (s)')
axes[0].set_ylabel('Amplitude')

# 2. Spectrogram
img1 = librosa.display.specshow(S_db, sr=sampling_rate, x_axis='time', y_axis='hz', ax=axes[1])
axes[1].set_title('Time-Frequency Domain: Spectrogram')
fig.colorbar(img1, ax=axes[1], format='%+2.0f dB')

# 3. Mel Spectrogram
img2 = librosa.display.specshow(S_mel_db, sr=sampling_rate, x_axis='time', y_axis='mel', 
                                  fmax=8000, ax=axes[2])
axes[2].set_title('Perceptual Representation: Mel Spectrogram')
fig.colorbar(img2, ax=axes[2], format='%+2.0f dB')

plt.tight_layout()
plt.show()

### Discussion Questions
1. What information is visible in the waveform that's hard to see in spectrograms?
2. What can you see in spectrograms that's invisible in the waveform?
3. How does the mel spectrogram differ from the regular spectrogram?

---

# Part 3: Loading and Exploring Real Datasets with ü§ó Datasets

## 3.1 Introduction to Hugging Face Datasets

The ü§ó Datasets library provides:
- Easy access to thousands of audio datasets
- Automatic downloading and caching
- Streaming for large datasets
- Built-in audio processing

Let's load the **MINDS-14** dataset - recordings of people asking banking questions in multiple languages.

In [None]:
# Load the MINDS-14 dataset (Australian English subset)
minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
print(minds)

In [None]:
# Explore a single example
example = minds[0]
print("Keys in example:", example.keys())
print("\n--- Example Details ---")
print(f"Path: {example['path']}")
print(f"Transcription: {example['transcription']}")
print(f"Intent class: {example['intent_class']}")
print(f"\nAudio info:")
print(f"  Sampling rate: {example['audio']['sampling_rate']} Hz")
print(f"  Array shape: {example['audio']['array'].shape}")
print(f"  Duration: {len(example['audio']['array']) / example['audio']['sampling_rate']:.2f} seconds")

In [None]:
# Convert intent class to readable label
id2label = minds.features["intent_class"].int2str
print(f"\nIntent label: {id2label(example['intent_class'])}")

# Listen to the audio
print("\nListen to the audio:")
ipd.Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])

In [None]:
# Visualize the waveform
plt.figure(figsize=(14, 4))
librosa.display.waveshow(example['audio']['array'], sr=example['audio']['sampling_rate'])
plt.title(f"Waveform: '{example['transcription']}'")
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.show()

## 3.2 Dataset Manipulation

Let's learn to filter and clean datasets - removing features we don't need.

In [None]:
# Remove unnecessary columns
columns_to_remove = ["english_transcription", "lang_id"]
minds_cleaned = minds.remove_columns(columns_to_remove)
print("After cleaning:")
print(minds_cleaned)

---

# Part 4: Audio Preprocessing for Machine Learning

## 4.1 Resampling Audio

**Why resample?**
- Different datasets have different sampling rates
- ML models are trained on specific sampling rates
- Must match the model's expected rate!

Most speech models use **16 kHz** sampling rate.

In [None]:
# Check current sampling rate
print(f"Current sampling rate: {minds_cleaned[0]['audio']['sampling_rate']} Hz")
print(f"Number of samples: {len(minds_cleaned[0]['audio']['array'])}")

# Resample to 16 kHz using ü§ó Datasets
minds_resampled = minds_cleaned.cast_column("audio", Audio(sampling_rate=16000))

# Check after resampling
print(f"\nAfter resampling:")
print(f"New sampling rate: {minds_resampled[0]['audio']['sampling_rate']} Hz")
print(f"Number of samples: {len(minds_resampled[0]['audio']['array'])}")
print(f"\nNote: Array length doubled (8kHz ‚Üí 16kHz upsampling)")

## 4.2 Filtering by Duration

Often we need to filter audio by length to:
- Avoid memory issues (very long files)
- Ensure consistency (remove very short clips)
- Meet model requirements

In [None]:
# Add duration column
print("Adding duration column...")
durations = []
for x in minds_resampled: 
  durations.append(librosa.get_duration(y=x['audio']['array'], sr=x['audio']['sampling_rate']))
minds_with_duration = minds_resampled.add_column("duration", durations)

print(f"Original dataset size: {len(minds_resampled)}")
print(f"Duration range: {min(durations):.2f}s - {max(durations):.2f}s")

# Filter examples shorter than 20 seconds
MAX_DURATION = 20.0

def is_audio_length_in_range(length):
    return length < MAX_DURATION

minds_filtered = minds_with_duration.filter(
    is_audio_length_in_range, 
    input_columns=["duration"]
)

# Remove temporary duration column
minds_filtered = minds_filtered.remove_columns(["duration"])

print(f"Filtered dataset size: {len(minds_filtered)}")
print(f"Removed {len(minds_resampled) - len(minds_filtered)} examples")

## 4.3 Feature Extraction for Models

ML models don't directly work with raw waveforms. They need **feature extractors** to convert audio into the right format.

Let's use **Whisper's feature extractor** as an example (Whisper is a state-of-the-art speech recognition model from OpenAI).

**What Whisper's feature extractor does:**
1. Pads/truncates audio to 30 seconds
2. Converts to log-mel spectrogram (80 mel bands)
3. No attention mask needed (unique to Whisper!)

In [None]:
# Load Whisper's feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

print("Feature extractor loaded!")
print(f"Expected sampling rate: {feature_extractor.sampling_rate} Hz")
print(f"Mel bands (n_mels): {feature_extractor.feature_size}")
print(f"FFT window size: {feature_extractor.n_fft}")
print(f"Hop length: {feature_extractor.hop_length}")

In [None]:
# Define preprocessing function
def prepare_dataset(example):
    audio = example["audio"]
    
    # The feature extractor handles resampling automatically
    features = feature_extractor(
        audio["array"], 
        sampling_rate=audio["sampling_rate"],
        padding=True
    )
    
    return features

# Apply to a single example first
example = minds_filtered[0]
processed = prepare_dataset(example)

print("Processed features:")
print(f"Input features shape: {np.array(processed['input_features']).shape}")
print(f"(batch, mel_bands, time_frames) = {np.array(processed['input_features']).shape}")
print(f"\nThis is ready to be fed into the Whisper model!")

In [None]:
# Visualize the processed features
plt.figure(figsize=(14, 5))
librosa.display.specshow(
    np.asarray(processed['input_features'][0]),
    x_axis='time',
    y_axis='mel',
    sr=feature_extractor.sampling_rate,
    hop_length=feature_extractor.hop_length
)
plt.colorbar(format='%+2.0f dB')
plt.title('Whisper Input: Log-Mel Spectrogram (80 mel bands)')
plt.tight_layout()
plt.show()

### üìù Exercise 2: Process the Entire Dataset

Now let's apply the preprocessing to the entire dataset using the `.map()` function.

In [None]:
# Apply preprocessing to entire dataset
minds_processed = minds_filtered.map(prepare_dataset, remove_columns=["audio", "path"])

print("Processed dataset:")
print(minds_processed)
print(f"\nFeatures available: {minds_processed.column_names}")
print(f"\nDataset is now ready for training or inference!")

---

# Part 5: Practical Exercises and Exploration

## üìù Exercise 3: Explore Different Audio Representations

**Task**: Load a different example audio and create all three visualizations (waveform, spectrogram, mel spectrogram) side by side.

**Steps**:
1. Try loading a different librosa example: `librosa.ex('trumpet')`, `librosa.ex('brahms')`, or `librosa.ex('choice')`
2. Or load your own audio file using: `librosa.load('path/to/your/file.wav')`
3. Create the three visualizations
4. Compare and discuss what you observe

In [None]:
# YOUR CODE HERE
# Try loading a different audio example and visualizing it



## üìù Exercise 4: Experiment with Different Mel Band Counts (10 minutes)

**Task**: Compare mel spectrograms with different numbers of mel bands (n_mels).

**Questions to explore**:
- What happens with fewer mel bands (e.g., 40)?
- What happens with more mel bands (e.g., 256)?
- Which provides better frequency resolution?
- Which is more computationally efficient?

In [None]:
# YOUR CODE HERE
# Compare different n_mels values



## üìù Exercise 5: Explore Different Datasets (10-15 minutes)

**Task**: Load a different language/dialect from MINDS-14 and compare with the Australian English version.

Available languages: `en-AU`, `en-GB`, `en-US`, `de-DE`, `fr-FR`, `es-ES`, `it-IT`, `nl-NL`, `pl-PL`, `pt-PT`, `zh-CN`, `ko-KR`, etc.

**Questions**:
- How do spectrograms differ across languages?
- Are there visible differences in speech patterns?
- How does duration vary by language?

In [None]:
# YOUR CODE HERE
# Load a different language subset and compare



---

# Summary and Key Takeaways

## What We Learned

### 1. **Audio Representation Fundamentals**
- **Sampling**: Converting continuous signals to discrete values
- **Sampling Rate**: Determines maximum frequency (Nyquist limit = sr/2)
- **Bit Depth**: Precision of amplitude values
- Common for ML: 16 kHz sampling, 32-bit float, values in [-1.0, 1.0]

### 2. **Audio Visualizations**
- **Waveform**: Time domain - shows amplitude over time
- **Spectrum**: Frequency domain - shows frequencies at one instant
- **Spectrogram**: Time-frequency domain - shows how frequencies change over time
- **Mel Spectrogram**: Perceptual scale mimicking human hearing

### 3. **Working with Audio Datasets**
- ü§ó Datasets library for easy loading and processing
- Dataset manipulation: filtering, resampling, column management
- Batch processing with `.map()` function

### 4. **Preprocessing for ML Models**
- **Resampling**: Match model's expected sampling rate
- **Filtering**: Remove too-long or too-short examples
- **Feature Extraction**: Convert raw audio to model inputs (e.g., log-mel spectrograms)
- Different models need different preprocessing!

## Important Concepts to Remember

| Concept | Key Point |
|---------|-----------|
| **Sampling Rate** | 16 kHz for speech, 44.1 kHz for music |
| **Mel Spectrogram** | Most common input for audio ML models |
| **Feature Extractor** | Model-specific preprocessing (always check!) |
| **Resampling** | Must match training data sampling rate |
| **STFT** | Short-Time Fourier Transform for spectrograms |

## Next Steps

This hands-on covered the **fundamentals** of audio processing. Next topics could include:
- Audio augmentation techniques
- Building audio classification models
- Speech recognition with Whisper
- Music generation models
- Audio-to-audio tasks (source separation, enhancement)

---

## üéì Additional Resources

- [Hugging Face Audio Course](https://huggingface.co/learn/audio-course/chapter0/1)
- [Librosa Documentation](https://librosa.org/doc/latest/index.html)
- [ü§ó Datasets Documentation](https://huggingface.co/docs/datasets/)
- [Whisper Paper](https://arxiv.org/abs/2212.04356)