# Hands-on Session 6: Music Classification

# Part 1: Reminders



## 1.1 Introduction to Audio Transformers

Modern audio processing heavily relies on **Transformer architectures**, originally designed for NLP but adapted brilliantly for audio tasks.

### Why Transformers for Audio?

Traditional approaches used CNNs on spectrograms. Transformers offer:
- **Self-attention**: Captures long-range dependencies in audio
- **Transfer learning**: Pre-trained on massive datasets
- **Flexibility**: Same architecture for multiple tasks

### Popular Audio Transformer Models

| Model | Description | Use Cases |
|-------|-------------|-----------|
| **Wav2Vec2** | Self-supervised learning from raw waveforms | Speech recognition, audio classification |
| **HuBERT** | Hidden Unit BERT - masked prediction | Speech tasks, speaker recognition |
| **DistilHuBERT** | Distilled (smaller, faster) version of HuBERT | Resource-constrained applications |
| **Whisper** | Multilingual speech recognition | Transcription, translation |
| **Audio Spectrogram Transformer (AST)** | Vision Transformer applied to spectrograms | General audio classification |

![Audio Transformer Architecture](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/wav2vec2.png)

*Source: Hugging Face - Wav2Vec2 architecture processes raw waveforms through convolutional layers then transformer blocks*

The input can also be a spectrogram, like for the AST model: 

![AST Architecture](https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/ast.png)

## 1.2 How Audio Models Process Data

### From Raw Audio to Model Input

Let's understand the pipeline:

```
Raw Audio (waveform)
    ‚Üì
Feature Extraction (normalization, padding)
    ‚Üì
Model Encoder (self-attention layers)
    ‚Üì
Classification Head
    ‚Üì
Predictions (genre, intent, etc.)
```

### Key Concepts in Audio Transformers

**1. Feature Normalization**
- Audio samples normalized to **zero mean, unit variance**
- Ensures stable training and convergence
- Done by the feature extractor automatically

**2. Attention Mechanism**
- Models learn which parts of audio are important
- Can focus on different time segments
- Captures temporal patterns and relationships

**3. Pre-training Strategy**
- **Contrastive Learning**: Distinguish between different audio segments
- **Masked Prediction**: Predict masked parts of audio (like BERT)
- **Multi-task Learning**: Train on multiple related tasks

### Connection to Our Task

When we fine-tune HuBERT for music classification:
- We keep the **pre-trained encoder** (learned from 60,000 hours of speech!)
- The encoder already knows how to extract meaningful temporal patterns
- We just add a new **classification head** and adapt to music
- This is why transfer learning works so well!

## 1.3 How HuBERT Works

HuBERT (Hidden Unit BERT) is a fascinating model that applies BERT-style pre-training to speech. Let's understand its clever approach!

![HuBERT Overview](https://jonathanbgn.com/assets/images/illustrated-hubert/hubert_explained.png)

*Source: [Jonathan Bgn - HuBERT Visually Explained](https://jonathanbgn.com/2021/10/30/hubert-visually-explained.html)*

#### The Core Idea: Discovering "Hidden Units"

The key insight: **Speech needs to be converted into discrete units (like words in text) before applying BERT!**

**Problem**: Unlike text (which already has discrete words/tokens), raw audio is continuous. How do we create discrete units?

**Solution**: Use clustering! 

#### Step 1: Clustering to Create Pseudo-Labels

HuBERT uses **K-means clustering** to group similar audio segments (25ms each) into K clusters. Each cluster becomes a "hidden unit" - think of it as discovering the phonetic alphabet of speech automatically!

![HuBERT Clustering](https://jonathanbgn.com/assets/images/illustrated-hubert/hubert_clustering.png)

*Clustering process: Audio segments are grouped into K clusters, creating discrete labels*

**How it works:**
1. Take raw audio and split into 25ms segments
2. Extract features (MFCCs for first iteration, later use learned representations)
3. Apply K-means to assign each segment to a cluster
4. Each cluster ID becomes a "target label" for that audio segment

**Multi-iteration refinement:**
- **1st iteration**: Use MFCC features for clustering (handcrafted features)
- **2nd iteration**: Use representations from 6th transformer layer (learned features)
- **3rd iteration** (LARGE/X-LARGE): Use 9th layer representations (even better!)

Each iteration produces better "hidden units" as the model learns more sophisticated representations!

#### Step 2: Masked Prediction (Like BERT)

Once we have discrete labels from clustering, we train exactly like BERT:

![HuBERT Prediction](https://jonathanbgn.com/assets/images/illustrated-hubert/hubert_pretraining_prediction.png)

*Prediction step: Mask ~50% of inputs, predict their cluster assignments from context*

**Training process:**
1. **Mask**: Hide ~50% of the audio frames
2. **Predict**: Model predicts cluster assignments for masked positions
3. **Learn**: Use cross-entropy loss (simple and stable!)
4. **Key trick**: Only compute loss on masked positions (handles noisy labels better)

#### Why HuBERT is Powerful

**Advantages over Wav2Vec2:**
- **Simpler loss**: Cross-entropy vs contrastive + diversity loss
- **More stable training**: No temperature tuning needed
- **Better targets**: Re-uses learned representations for clustering
- **Matches or beats Wav2Vec2** on speech recognition!

**The iterative magic:**
```
Iteration 1: MFCC ‚Üí Clustering ‚Üí Train ‚Üí Get basic representations
                              ‚Üì
Iteration 2: Layer 6 features ‚Üí Better clustering ‚Üí Train ‚Üí Get better representations  
                              ‚Üì
Iteration 3: Layer 9 features ‚Üí Even better clustering ‚Üí Train ‚Üí Best representations!
```

Each iteration discovers more meaningful "hidden units" - from crude phonetic categories to fine-grained acoustic patterns!

---

# Part 2: Hands-on: Fine-tuning for Music Genre Classification

Now let's put everything into practice! We'll fine-tune a pre-trained audio model to classify music genres.

## 2.1 The GTZAN Dataset

**GTZAN** is a famous music classification dataset:
- **1,000 songs** (30 seconds each)
- **10 genres**: Blues, Classical, Country, Disco, Hip-hop, Jazz, Metal, Pop, Reggae, Rock
- **Balanced dataset**: 100 songs per genre

This is a challenging task - even humans sometimes disagree on genre classification!

In [None]:
# Load the GTZAN dataset
gtzan = load_dataset("marsyas/gtzan", "all")
print("Original dataset:")
print(gtzan)

# Note: One recording is corrupted, so we have 999 instead of 1000

In [None]:
# Create train/test split (90/10)
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)
print("\nAfter split:")
print(gtzan)
print(f"\nTraining samples: {len(gtzan['train'])}")
print(f"Test samples: {len(gtzan['test'])}")

In [None]:
# Explore a sample
sample = gtzan["train"][0]
print("Sample structure:")
print(f"Keys: {sample.keys()}")
print(f"\nGenre (as integer): {sample['genre']}")
print(f"Sampling rate: {sample['audio']['sampling_rate']} Hz")
print(f"Duration: {len(sample['audio']['array']) / sample['audio']['sampling_rate']:.2f} seconds")

# Get genre label
id2label_fn = gtzan["train"].features["genre"].int2str
print(f"Genre (as label): {id2label_fn(sample['genre'])}")

In [None]:
# Listen to a few examples from different genres
print("Listen to different music genres:\n")

for _ in range(3):
    example = gtzan["train"].shuffle()[0]
    genre_label = id2label_fn(example["genre"])
    print(f"Genre: {genre_label}")
    display(ipd.Audio(example["audio"]["array"], rate=example["audio"]["sampling_rate"]))
    print("-" * 50)

## 2.2 Choosing a Pre-trained Model: DistilHuBERT

For this task, we'll use **DistilHuBERT**:
- **Distilled version** of HuBERT (73% faster!)
- Pre-trained on LibriSpeech (speech data)
- **Transfer learning**: We adapt it from speech ‚Üí music
- Compact enough for Google Colab free tier

### Model Architecture Overview

```
Input: Raw Waveform (16kHz)
    ‚Üì
CNN Feature Extractor (7 conv layers)
    ‚Üì
Transformer Encoder (12 layers, 768 hidden dim)
    ‚Üì
Classification Head (768 ‚Üí C classes)
    ‚Üì
Output: Genre Predictions
```

![Fine-tuning Process](https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/wav2vec2-ctc.png)

*Fine-tuning: We freeze or update the pre-trained encoder and train a new classification head (not CTC)*

## 2.3 Preprocessing for DistilHuBERT

Key preprocessing steps:
1. **Resample** to 16 kHz (model's expected rate)
2. **Normalize** audio (zero mean, unit variance)
3. **Truncate** to maximum length (30 seconds)
4. Generate **attention masks** for batching

In [None]:
# Load the feature extractor for DistilHuBERT
from transformers import AutoFeatureExtractor

model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, 
    do_normalize=True, 
    return_attention_mask=True
)

print(f"Feature extractor loaded!")
print(f"Expected sampling rate: {feature_extractor.sampling_rate} Hz")
print(f"Normalization: {feature_extractor.do_normalize}")
print(f"Return attention mask: {feature_extractor.return_attention_mask}")

In [None]:
# Resample the dataset to 16kHz
sampling_rate = feature_extractor.sampling_rate
gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sampling_rate))

print(f"Dataset resampled to {sampling_rate} Hz")
print(f"Example sampling rate: {gtzan['train'][0]['audio']['sampling_rate']} Hz")

In [None]:
# Understand feature normalization with an example
sample = gtzan["train"][0]["audio"]

print("Before feature extraction:")
print(f"Mean: {np.mean(sample['array']):.6f}")
print(f"Variance: {np.var(sample['array']):.6f}")
print(f"Range: [{sample['array'].min():.3f}, {sample['array'].max():.3f}]")

# Apply feature extractor
inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])

print("\nAfter feature extraction:")
print(f"Keys: {list(inputs.keys())}")
print(f"Mean: {np.mean(inputs['input_values']):.9f}")
print(f"Variance: {np.var(inputs['input_values']):.6f}")
print(f"Range: [{np.min(inputs['input_values']):.3f}, {np.max(inputs['input_values']):.3f}]")
print("\nAudio normalized to zero mean and unit variance!")

In [None]:
# Define preprocessing function for the entire dataset
max_duration = 30.0

def preprocess_function(examples):
    """Preprocess audio examples for DistilHuBERT"""
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

# Apply preprocessing to the dataset
print("Preprocessing dataset... (this may take a few minutes)")
gtzan_encoded = gtzan.map(
    preprocess_function,
    remove_columns=["audio", "file"],
    batched=True,
    batch_size=100,
    num_proc=1,
)

print("\nPreprocessing complete!")
print(gtzan_encoded)

In [None]:
# Rename 'genre' to 'label' for the Trainer
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

# Create label mappings
id2label = {
    str(i): id2label_fn(i) 
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

print("Label mappings:")
for i in range(len(id2label)):
    print(f"{i}: {id2label[str(i)]}")

print(f"\nDataset ready for training!")
print(f"Features: {gtzan_encoded['train'].column_names}")

## 2.4 Loading the Model for Fine-tuning

Now we load the pre-trained DistilHuBERT model and add a classification head on top for our 10 genres.

### What Happens During Loading:
1. **Download pre-trained weights** from Hugging Face Hub
2. **Remove the original head** (designed for speech tasks)
3. **Add new classification head** (768 hidden ‚Üí 10 genres)
4. **Initialize new head** with random weights
5. Keep pre-trained encoder weights (transfer learning!)

In [None]:
# Load model with classification head
from transformers import AutoModelForAudioClassification

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

print(f"Model loaded!")
print(f"Model: {model_id}")
print(f"Number of labels: {num_labels}")
print(f"Total parameters: {model.num_parameters():,}")
print(f"\nModel architecture:")
print(model)

## 2.5 Setting Up Training

We'll use the ü§ó Transformers `Trainer` - a high-level API that handles:
- Training loop
- Gradient computation and optimization
- Evaluation
- Logging and checkpointing
- Mixed precision training (FP16)
- Automatic model uploading to Hub

### Training Hyperparameters

Key hyperparameters to consider:
- **Learning rate**: 5e-5 (standard for fine-tuning)
- **Batch size**: 8 (adjust based on GPU memory)
- **Epochs**: 10 (balance between training time and performance)
- **Warmup**: 10% of steps (gradual learning rate increase)
- **Evaluation strategy**: Every epoch
- **FP16**: Mixed precision for faster training

In [None]:
# Define training arguments
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 10

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-gtzan",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,  # Mixed precision training
    push_to_hub=False,  # Set to True if you want to push to Hub
)

print("Training arguments configured!")
print(f"Output directory: {training_args.output_dir}")
print(f"Training epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")

In [None]:
# Define evaluation metric
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

print("Evaluation metric (accuracy) loaded!")

In [None]:
# Initialize the Trainer
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

print("Trainer initialized!")
print(f"Training samples: {len(gtzan_encoded['train'])}")
print(f"Evaluation samples: {len(gtzan_encoded['test'])}")
print(f"Steps per epoch: {len(gtzan_encoded['train']) // batch_size}")
print(f"Total training steps: {(len(gtzan_encoded['train']) // batch_size) * num_train_epochs}")

## 2.6 Training the Model

‚ö†Ô∏è **Note**: Full training takes ~1 hour on a T4 GPU. For the hands-on session, you can:
- Reduce `num_train_epochs` to 2-3 for faster demo
- Use a pre-trained checkpoint (see next section)
- Discuss the training process while showing pre-computed results

### What Happens During Training:

1. **Forward Pass**: Audio ‚Üí CNN ‚Üí Transformer ‚Üí Classification head ‚Üí Predictions
2. **Loss Computation**: Compare predictions with true labels (Cross-Entropy)
3. **Backward Pass**: Compute gradients via backpropagation
4. **Optimizer Step**: Update weights (AdamW optimizer)
5. **Evaluation**: After each epoch, evaluate on test set
6. **Checkpointing**: Save best model based on accuracy

In [None]:
# Train the model
# UNCOMMENT THE LINE BELOW TO START TRAINING
# trainer.train()

In [None]:
# For demonstration purposes, we'll show expected training results:
print("Expected Training Results (10 epochs):")
print("=" * 60)
print("| Epoch | Train Loss | Val Loss | Accuracy |")
print("|-------|------------|----------|----------|")
print("|   1   |   1.73     |   1.80   |   0.44   |")
print("|   2   |   1.24     |   1.30   |   0.64   |")
print("|   3   |   0.98     |   0.99   |   0.70   |")
print("|   4   |   0.69     |   0.75   |   0.79   |")
print("|   5   |   0.45     |   0.62   |   0.81   |")
print("|   6   |   0.30     |   0.54   |   0.83   | <- Best")
print("|   7   |   0.22     |   0.63   |   0.78   |")
print("|   8   |   0.31     |   0.59   |   0.81   |")
print("|   9   |   0.16     |   0.54   |   0.83   |")
print("|  10   |   0.12     |   0.57   |   0.82   |")
print("=" * 60)
print("\nBest validation accuracy: 83%")
print("Training time: ~60 minutes on T4 GPU")

## 2.7 Using a Pre-trained Model for Inference

Instead of training from scratch, we can use an already fine-tuned model from the Hub!

### Using the Pipeline API

The simplest way to use a model is through the `pipeline()` API - a high-level interface for inference.

In [None]:
# Load a pre-trained music classification model from the Hub
from transformers import pipeline

# Use a fine-tuned model (this is the model from the HF course)
pipe = pipeline(
    "audio-classification",
    model="sanchit-gandhi/distilhubert-finetuned-gtzan"
)

print("Pre-trained model loaded!")
print(f"Model: {pipe.model.name_or_path}")
print(f"Task: {pipe.task}")

In [None]:
# Test on samples from our dataset
print("Testing the model on GTZAN samples:\n")

for i in range(3):
    # Get a random sample
    example = gtzan["test"].shuffle()[i]
    true_genre = id2label_fn(example["genre"])
    
    # Make prediction
    predictions = pipe(example["audio"]["array"])
    
    print(f"Sample {i+1}:")
    print(f"  True genre: {true_genre}")
    print(f"  Top 3 predictions:")
    for pred in predictions[:3]:
        print(f"    - {pred['label']}: {pred['score']:.2%}")
    print()
    
    # Play the audio
    display(ipd.Audio(example["audio"]["array"], rate=example["audio"]["sampling_rate"]))
    print("-" * 60)

## 2.8 Model Analysis and Interpretability

Let's analyze what the model learned and where it makes mistakes.

### Understanding Model Performance

**Confusion Matrix**: Shows which genres get confused with each other
**Common confusions**:
- Rock ‚Üî Metal (similar instrumentation)
- Jazz ‚Üî Blues (overlapping styles)
- Disco ‚Üî Pop (similar production)

In [None]:
# Analyze predictions on the entire test set
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

print("Making predictions on test set...")
all_predictions = []
all_labels = []

for example in gtzan["test"]:
    pred = pipe(example["audio"]["array"])
    predicted_label = pred[0]["label"]
    true_label = id2label_fn(example["genre"])
    
    all_predictions.append(predicted_label)
    all_labels.append(true_label)

# Classification report
print("\nClassification Report:")
print("=" * 70)
print(classification_report(all_labels, all_predictions, zero_division=0))

# Confusion matrix
print("\nConfusion Matrix:")
genres = sorted(set(all_labels))
cm = confusion_matrix(all_labels, all_predictions, labels=genres)

plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=genres, yticklabels=genres)
plt.title('Music Genre Classification - Confusion Matrix', fontsize=14, pad=20)
plt.xlabel('Predicted Genre', fontsize=12)
plt.ylabel('True Genre', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

---

## Practical Exercises

### Exercise 1: Error Analysis

**Goal**: Find and analyze misclassified examples to understand model limitations.

**Tasks**:
- Find 5 misclassified examples from the test set
- Listen to them - do you agree with the model or the label?
- What makes classification difficult for these examples?

In [None]:
# Exercise 1: Error Analysis
# Find misclassified examples and listen to them

# TODO: Create an empty list to store misclassified examples

# TODO: Loop through all_predictions and all_labels using enumerate
# Hint: for i, (pred, true) in enumerate(zip(all_predictions, all_labels)):
    # TODO: Check if pred != true
        # TODO: Append (i, pred, true) to the misclassified list
        # TODO: Break when you have 5 misclassified examples

print("Misclassified Examples:\n")
print("=" * 70)

# TODO: Loop through the misclassified list
    # TODO: Get the example from gtzan["test"] using the index
    # TODO: Print the example number, true genre, and predicted genre
    # TODO: Display the audio using ipd.Audio
    # TODO: Print a separator line

# Reflection questions
print("\nReflection Questions:")
print("1. Do you agree with the true labels or the model's predictions?")
print("2. What audio characteristics might have confused the model?")
print("3. Are there genres that sound similar to each other?")

### Exercise 2: Model Comparison

**Goal**: Compare different pre-trained models to see which performs best.

**Tasks**:
- Try a different model from the Hub (e.g., "facebook/wav2vec2-base-gtzan")
- Make predictions on the same test samples
- Compare predictions between models

In [None]:
# Exercise 2: Model Comparison
# Try a different model and compare predictions

from transformers import pipeline

# TODO: Load an alternative model using pipeline
# Some options:
# - "facebook/wav2vec2-base-gtzan"
# - "MIT/ast-finetuned-audioset-10-10-0.4593"
# Uncomment and modify to use a real alternative model:
# pipe2 = pipeline("audio-classification", model="MODEL_NAME_HERE")

print("Comparing Model Predictions:\n")
print("=" * 70)

# TODO: Test on 3 random examples
# Hint: for i in range(3):
    # TODO: Get a random sample from gtzan["test"].shuffle()
    # TODO: Get the true genre label using id2label_fn
    
    # TODO: Make predictions from both models (pipe and pipe2)
    
    # TODO: Print the sample number, true genre, and top predictions from both models
    # TODO: Check if the models agree or disagree
    # TODO: Display the audio
    # TODO: Print a separator line

print("\nTry loading different models from the Hub and see which performs best!")

### Exercise 3: Feature Visualization

**Goal**: Visualize spectrograms of correctly vs incorrectly classified songs.

**Tasks**:
- Find one correctly classified and one misclassified example
- Visualize their mel spectrograms side-by-side
- Look for visual differences that might explain the model's behavior

In [None]:
# Exercise 3: Feature Visualization
# Compare spectrograms of correct vs incorrect predictions

import librosa
import librosa.display

# TODO: Find one correct and one incorrect prediction
# Hint: Initialize correct_idx = None and incorrect_idx = None
# TODO: Loop through all_predictions and all_labels using enumerate
    # TODO: If pred == true and correct_idx is None, save the index
    # TODO: If pred != true and incorrect_idx is None, save the index
    # TODO: Break when you have both indices

# Visualize spectrograms
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# TODO: For the correct prediction:
    # TODO: Get the example from gtzan["test"]
    # TODO: Get the true label
    # TODO: Compute mel spectrogram using librosa.feature.melspectrogram
    # TODO: Convert to dB using librosa.power_to_db
    # TODO: Display using librosa.display.specshow on axes[0]
    # TODO: Set title to show it's correctly classified

# TODO: For the incorrect prediction:
    # TODO: Get the example from gtzan["test"]
    # TODO: Get the true label and predicted label
    # TODO: Compute mel spectrogram
    # TODO: Convert to dB
    # TODO: Display on axes[1]
    # TODO: Set title to show true vs predicted labels

plt.tight_layout()
plt.show()

#### Analysis Questions

1. Do you see visual differences in frequency patterns?
2. Are there specific frequency ranges that look similar/different?
3. How does the temporal structure (over time) differ?
4. What acoustic features might explain the misclassification?

### Exercise 4: Test Your Own Music

**Goal**: Upload your own music file and see how the model classifies it!

**Tasks**:
- Upload a music file (or use `librosa.ex()` examples)
- Run classification and see the top predictions
- Discuss whether the predictions make sense

In [None]:
# Exercise 4: Test Your Own Music
# Load and classify your own audio file

import librosa
import librosa.display

# Option 1: Use librosa example files
# You can try: 'brahms', 'choice', 'fishin', 'nutcracker', 'trumpet'
audio_path = librosa.ex('brahms')  # Classical music example

# Option 2: Upload your own file (uncomment and modify path)
# audio_path = "path/to/your/music/file.mp3"

# TODO: Load the audio using librosa.load
# Hint: audio_array, sr = librosa.load(audio_path, sr=16000, duration=30.0)

print("Analyzing Your Music...\n")
print("=" * 70)

# TODO: Make prediction using pipe

# TODO: Print the top 5 genre predictions
# Hint: Loop through predictions[:5] and print label and score
# Optional: Create a bar visualization using "‚ñà" * int(pred['score'] * 50)

print("\n" + "=" * 70)

# TODO: Display the audio using ipd.Audio

# TODO: Visualize the mel spectrogram
# Hint: Use librosa.feature.melspectrogram and librosa.power_to_db
# Then display with librosa.display.specshow

#### Discussion
1. Does the top prediction match your expectation?
2. Are there any surprising predictions in the top 5?
3. What visual features in the spectrogram might support the prediction?
4. Try different audio files and see how predictions change!

---

## Exercise Summary

Great work! You've now:
- **Exercise 1**: Analyzed model errors and found patterns
- **Exercise 2**: Compared different models
- **Exercise 3**: Visualized spectrograms to understand features
- **Exercise 4**: Tested the model on your own music

**Key Takeaways:**
- Models make mistakes on ambiguous or genre-blending songs
- Different models can have different strengths and weaknesses
- Visual features (spectrograms) reveal why models make certain predictions
- Real-world music often doesn't fit neatly into single genres!

---

# Final Summary: Deep Learning for Audio

## What We Learned About Deep Learning

### 1. **Transfer Learning**
- Pre-trained models (HuBERT, Wav2Vec2) learn general audio representations
- Fine-tuning adapts them to specific tasks (speech to music)
- Much faster and better than training from scratch!

### 2. **Model Architecture**
```
Raw Audio ‚Üí CNN Feature Extractor ‚Üí Transformer Encoder ‚Üí Task Head ‚Üí Output
```
- **CNNs**: Extract local patterns from waveforms
- **Transformers**: Capture long-range dependencies via self-attention
- **Classification Head**: Task-specific layer (e.g., genre prediction)

### 3. **Training Process**
1. Load pre-trained weights (transfer learning)
2. Add task-specific head
3. Fine-tune on target dataset
4. Evaluate and iterate

### 4. **Key Insights**

| Aspect | Key Point |
|--------|-----------|
| **Data** | Quality > Quantity (899 samples achieved 83% accuracy!) |
| **Preprocessing** | Normalization crucial for stable training |
| **Architecture** | Transformers excel at capturing temporal patterns |
| **Evaluation** | Balanced dataset means accuracy is meaningful |
| **Transfer Learning** | Speech models adapt well to music! |

## Advanced Topics to Explore

- **Data Augmentation**: Time stretching, pitch shifting, noise addition
- **Multi-label Classification**: Songs with multiple genres
- **Zero-shot Classification**: Using CLAP or other contrastive models
- **Attention Visualization**: Understanding what the model focuses on
- **Model Compression**: Quantization, pruning, distillation
- **Real-time Inference**: Optimizing for production deployment

## Key Differences: Audio vs Other Domains

| Aspect | Audio | Vision | NLP |
|--------|-------|--------|-----|
| **Input** | 1D waveform or 2D spectrogram | 2D image | 1D token sequence |
| **Key Challenge** | Temporal dynamics | Spatial patterns | Sequential dependencies |
| **Pre-training** | Contrastive/Masked prediction | Image classification | Masked language modeling |
| **Data Rate** | 16,000 samples/sec | Fixed resolution | Variable length |

## Resources for Further Learning

- [Hugging Face Audio Course](https://huggingface.co/learn/audio-course) - Complete course on audio ML
- [Papers with Code - Audio Classification](https://paperswithcode.com/task/audio-classification) - Latest research
- [Hugging Face Hub - Audio Models](https://huggingface.co/models?pipeline_tag=audio-classification) - Pre-trained models
- [ESC-50 Dataset](https://github.com/karolpiczak/ESC-50) - Environmental sound classification
- [AudioSet](https://research.google.com/audioset/) - Large-scale audio dataset

---

## Congratulations!

You've completed a comprehensive hands-on session covering:
- Audio data fundamentals (sampling, representations)
- Visualization techniques (waveform, spectrogram, mel spectrogram)
- Dataset loading and preprocessing
- Feature extraction for ML models
- Deep learning with transformers
- Fine-tuning for music classification
- Model evaluation and analysis

**You now have the tools to work with audio in your own ML projects!**