# Session 3: Core Machine Learning Concepts for Music and Audio

**Agenda:**

- Statistical Basics of Generative Modeling in Artificial Intelligence
- Variational Autoencoders, expressing loss
- **Hands On:**
    1. Studying datasets: Lakh MIDI and Free Music Archive
    2. Use and train RAVE models

## Statistical Basics of Generative Modeling in Artificial Intelligence

![](./assets/discriminative_vs_generative.png)

-> In a **Discriminative task**, we want to learn how to go from a *sample* $x$
to a *label* $y$. There is one ideal output that we want to model.

-> In a **Generative task**, we want to learn how to go from a *label* $y$ to a
*sample* $x$. There are many different possible outputs. **This involves
modeling a distribution properly to generate *new samples***.


### Generative AI is Distribution Estimation

![](./assets/data_distribution_estimation.png)

-> In our estimated data distribution, high probability connects to data that is
*likely* to be found in the initial distribution, while low probability connects
to data that is *unlikely* to be found in the initial distribution

### Distribution Sampling

![](./assets/distribution_sampling.png)

-> Generative AI allows us to map from a *simple distribution* to a more
*complex* data distribution. By sampling from our simple distribution and going
through a series of operations, we can obtain data that matches our target
distribution.

## Variatial Autoencoders

In Variational autoencoders, our *simple distribution* $\pi$ is a latent space.

![](./assets/vae.png)

Our **generator** is parameterized by a set of weights $\theta$, giving us
$p_\theta(x|z)$. Our output data distribution is $p_\theta(x)$.

### Goal: $p_{data}(x) \approx p_\theta(x)$

<div class="alert alert-danger">
We'll be going over some maths that will be helpful once we start looking at
diffusion models.
</div>

<center><img src="./assets/pdata_ptheta.png" width="50%" /></center>

To approximate $p_{data}(x)$, we try to **minimize the *Kullback-Leibler* (KL)
divergence**:

$$
\min_\theta \mathcal{D}_\text{KL} (p_{data} \; || \; p_\theta)
$$

With some maths, we can prove that this is equivalent to **maximizing the
likelihood of our target distribution**.

\begin{align*}
&\phantom{=} \arg \min_\theta \mathcal{D}_\text{KL} (p_{data} \; || \; p_\theta) \\
&= \arg \min_\theta \sum_x p_{data}(x) \log \frac{p_{data}(x)}{p_\theta(x)} \\
&= \arg \min_\theta \sum_x -p_{data}(x) \log p_\theta(x) + const \\
&= \arg \max_\theta \sum_x p_{data}(x) \log p_\theta(x) \\
&= \arg \max_\theta \mathop{\mathbb{E}}_{x \sim p_{data}} \log p_\theta(x)
\end{align*}

Note: $p_\theta(x)$ is defined as:

$$
p_\theta(x) = \int_z p_\theta(x|z)p(z)dz
$$

But we cannot control the "true" distribution $p(z)$. So we introduce a
controllable distribution $q_\phi(z|x)$. This allows us (through some maths shown here)
to calculate a **lower bound** of $\log p_\theta(x)$ called the ELBO (Evidence
Lower BOund).

$$
\log p_\theta(x) \geq \mathop{\mathbb{E}}_{z \sim q_\phi(z|x)} \big[ \log p_\theta (x | z) \big] - \mathcal{D}_\text{KL}\big(q_\phi(z|x) \; || \; p_\theta(z) \big)
$$

<div class="alert alert-info">

Note:
- The first term is the **reconstruction loss**, which quantifies *how well* the
decoder can reconstruct $x$ from a sampled $z$.
- The second term is the **regularization loss**, which forces the learned
distribution $q_\phi(z|x)$ to be closed to a prior $p(z)$ (e.g., a Gaussian).

</div>

Now, our model relies on us sampling $z \sim q_\phi(z|x)$, but we cannot
**backpropagate** through a sampling operation! As a result, we rewrite our
latent space as:

$$
z = \mu + \sigma \cdot \epsilon, \; \;\epsilon \sim \mathcal{N}(0,1)
$$

<center><img src="./assets/vae_final.png"/></center>

Our final loss function is an objective to **minimize**:

$$
\mathcal{L}_{\theta,\phi} = {\color{orange} \mathop{\mathbb{E}}_{x \sim p_{data}(x)} \big[} - \mathop{\mathbb{E}}_{z \sim q_\phi(z|x)[\log p_\theta(x|z)] + \mathcal{D}_\text{KL}(q_\phi(z|x) \; || \; p(z))} {\color{orange} \big]}
$$

## Hands On 1: Datasets

### Dataset 1: Lakh MIDI Dataset (LMD)

Available at https://colinraffel.com/projects/lmd/. We will use the
"lmd-matched" subset, which aligns tracks to records (with metadata) from the
Million Song Dataset.

In [None]:
# Let's create the dataset directory if it doesn't exist
!mkdir -p ../datasets

# Download the dataset and metadata
!wget -O ../datasets/lmd_matched.tar.gz http://hog.ee.columbia.edu/craffel/lmd/lmd_matched.tar.gz
!wget -O ../datasets/msd_summary_file.h5 http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_summary_file.h5

# Unzip the dataset
!tar -xvzf ../datasets/lmd_matched.tar.gz -C ../datasets

# Remove the tar.gz file after extraction
!rm ../datasets/lmd_matched.tar.gz

In [None]:
# Open the HDF5 file and read the metadata
import h5py

f = h5py.File("../datasets/msd_summary_file.h5", "r")

In [None]:
import pandas as pd
import numpy as np

# Read the metadata into a pandas DataFrame
analysis_df = pd.DataFrame(f["analysis"]["songs"][:])
metadata_df = pd.DataFrame(f["metadata"]["songs"][:])

# Concatenate the two DataFrames (they should be ordered the same)
lmd_df = pd.concat([analysis_df, metadata_df], axis=1)

In [None]:
# The Million Song Dataset contains more tracks than LMD-matched, so we will
# need to filter some out

# First, let's find all of the tracks that are in the LMD-matched dataset
# (the track_ids are the names of the directories)
import os

lmd_matched_dir = "../datasets/lmd_matched"
track_ids = set()
for root, dirs, files in os.walk(lmd_matched_dir):
    # Only add the track name if it is a final directory (with files)
    if len(files) > 0:
        track_ids.add(os.path.basename(root))

# Print the number of tracks in the LMD-matched dataset
print(f"Number of tracks in LMD-matched dataset: {len(track_ids)}")

In [None]:
# We can filter the DataFrame to only include the tracks that are in
# the LMD-matched dataset
lmd_df = lmd_df[lmd_df["track_id"].astype(str).isin(track_ids)]

In [None]:
# We can look at the columns of the DataFrame to see what metadata is available
lmd_df.columns

In [None]:
# Let's print how many different artists and albums we have
print("Number of artists:", len(lmd_df["artist_name"].unique()))
print("Number of albums:", len(lmd_df["release"].unique()))

In [None]:
import matplotlib.pyplot as plt

# We can use matplotlib to print some data distributions
fig, ax = plt.subplots(4, 1, figsize=(10, 20))

# Plot the distribution of durations
ax[0].hist(lmd_df["duration"], bins=100)
ax[0].set_title("Duration distribution")
ax[0].set_xlabel("Duration (seconds)")
ax[0].set_ylabel("Count")

# Plot the distribution of tempos
ax[1].hist(lmd_df["tempo"], bins=100)
ax[1].set_title("Tempo distribution")
ax[1].set_xlabel("Tempo (BPM)")
ax[1].set_ylabel("Count")

# Plot the distribution of keys
ax[2].hist(lmd_df["key"], bins=12)
ax[2].set_title("Key distribution")
ax[2].set_xlabel("Key")
ax[2].set_ylabel("Count")
ax[2].set_xticks(range(12), ["C", "Db", "D", "Eb", "E", "F", "F#", "G", "Ab", "A", "Bb", "B"])

# Plot the distribution of loudness
ax[3].hist(lmd_df["loudness"], bins=100)
ax[3].set_title("Loudness distribution")
ax[3].set_xlabel("Loudness")
ax[3].set_ylabel("Count")

plt.tight_layout()
plt.show()

In [None]:
# Let's listen to a song from the dataset

import midi2audio
from IPython.display import Audio, display

midi2audio_obj = midi2audio.FluidSynth("../session2_setup/assets/soundfont.sf2")
midi2audio_obj.midi_to_audio("../datasets/lmd_matched/G/D/V/TRGDVGJ128F92FFA60/0b75fc85e0d028c29350a0ee9c148ed1.mid", "assets/lmd_example.wav")


display(Audio("assets/lmd_example.wav", rate=44100))

### Dataset 2: Free Music Archive

Available at https://github.com/mdeff/fma. We will use the "fma_small" subset,
but feel free to explore other, larger subsets.

In [None]:
# Let's create the dataset directory if it doesn't exist
!mkdir -p ../datasets

# Download the dataset and metadata
!wget -O ../datasets/fma_small.zip https://os.unil.cloud.switch.ch/fma/fma_small.zip
!wget -O ../datasets/fma_metadata.zip https://os.unil.cloud.switch.ch/fma/fma_metadata.zip

In [None]:
# We need to unzip the dataset in Python (unzip doesn't support PKZIP)
import zipfile
import os

# Unzip the dataset
with zipfile.ZipFile("../datasets/fma_small.zip", "r") as zip_ref:
    zip_ref.extractall("../datasets")

# Unzip the metadata
with zipfile.ZipFile("../datasets/fma_metadata.zip", "r") as zip_ref:
    zip_ref.extractall("../datasets")

# Remove the zip files after extraction
os.remove("../datasets/fma_small.zip")
os.remove("../datasets/fma_metadata.zip")

In [None]:
import pandas as pd
import ast

# Load CSV metadata for FMA 
fma_df = pd.read_csv("../datasets/fma_metadata/tracks.csv", index_col=0, header=[0, 1])

# Some logic from FMA's `utils.py`
COLUMNS = [('track', 'tags'), ('album', 'tags'), ('artist', 'tags'),
            ('track', 'genres'), ('track', 'genres_all')]
for column in COLUMNS:
    fma_df[column] = fma_df[column].map(ast.literal_eval)

COLUMNS = [('track', 'date_created'), ('track', 'date_recorded'),
            ('album', 'date_created'), ('album', 'date_released'),
            ('artist', 'date_created'), ('artist', 'active_year_begin'),
            ('artist', 'active_year_end')]
for column in COLUMNS:
    fma_df[column] = pd.to_datetime(fma_df[column])

SUBSETS = ('small', 'medium', 'large')
fma_df['set', 'subset'] = fma_df['set', 'subset'].astype(
    pd.CategoricalDtype(categories=SUBSETS, ordered=True))

COLUMNS = [('track', 'genre_top'), ('track', 'license'),
            ('album', 'type'), ('album', 'information'),
            ('artist', 'bio')]
for column in COLUMNS:
    fma_df[column] = fma_df[column].astype('category')

# Filter the DataFrame to only include the tracks that are in the FMA small dataset
fma_df = fma_df[fma_df["set"]["subset"] == "small"]

In [None]:
fma_df.columns

In [None]:
# Let's print the number of tracks, artists, and albums
print("Number of tracks:", len(fma_df))
print("Number of artists:", len(fma_df["artist"]["name"].unique()))
print("Number of albums:", len(fma_df["album"]["title"].unique()))

# Let's print the number of unique genres
print("Number of unique genres:", len(fma_df["track"]["genre_top"].unique()))

In [None]:
import matplotlib.pyplot as plt

# Again, we can use matplotlib to print some data distributions
fig, ax = plt.subplots(3, 1, figsize=(10, 20))

# Plot the distribution of durations
ax[0].hist(fma_df["track"]["duration"], bins=100)
ax[0].set_title("Duration distribution")
ax[0].set_xlabel("Duration (seconds)")
ax[0].set_ylabel("Count")

# Plot the distribution of release data
ax[1].hist(fma_df["album"]["date_released"].dt.year, bins=100)
ax[1].set_title("Release date distribution")
ax[1].set_xlabel("Release date (year)")
ax[1].set_ylabel("Count")

# Plot only the distribution of genres

# Remove unused categories first
genre_series = fma_df["track"]["genre_top"].cat.remove_unused_categories()

ax[2].hist(genre_series.cat.codes, bins=len(genre_series.cat.categories))
ax[2].set_title("Genre distribution")
ax[2].set_xlabel("Genre")
ax[2].set_ylabel("Count")
ax[2].set_xticks(range(len(genre_series.cat.categories)), genre_series.cat.categories)

plt.tight_layout()
plt.show()

In [None]:
import librosa
from IPython.display import Audio, display

# Let's listen to a song from the dataset
audio_path = "../datasets/fma_small/019/019073.mp3"

y, sr = librosa.load(audio_path, sr=44100)
display(Audio(y, rate=sr))

## Hands On 2: RAVE

![](./assets/rave.png)

### Using RAVE

In [None]:
import huggingface_hub

# We'll start by downloading a model from the Hugging Face Hub
# We'll use the `percussion.ts` model
# from `lancelotblanchard/rave_percussion`

model_path = huggingface_hub.hf_hub_download(
    repo_id="shuoyang-zheng/jaspers-rave-models",
    filename="guitar_picking_dm_b2048_r44100_z8_causal.ts",
    cache_dir="../huggingface_hub_cache",
    force_download=False,
)

In [None]:
import torch

# We'll load our **compiled** model using torch.jit.load

rave = torch.jit.load(model_path)
print(rave) # This should print the model architecture

In [None]:
from torchinfo import summary

# We can use torchinfo to get a nicer looking summary of the model
summary(rave)

In [None]:
import librosa
from IPython.display import Audio, display

# Let's load a sample audio file
# nature beautiful bird calls by buzzatsea -- https://freesound.org/s/562864/
# License: Creative Commons 0
y, sr = librosa.load("assets/bird_calls.m4a")
display(Audio(y, rate=sr))

In [None]:
import torch

# We can use our RAVE model to encode the audio into a latent representation

# First, we need to convert our audio to a PyTorch tensor and reshape it to the
# required shape: (batch_size, n_channels, n_samples)
audio = torch.from_numpy(y).float()
audio = audio.reshape(1, 1, -1)

# Now we can encode the audio using the model's `encode` method
# Note: We need to use torch.no_grad() to avoid tracking gradients
with torch.no_grad():
    latent = rave.encode(audio)

# We can print the shape of the latent representation
# It should be of shape (batch_size, latent_dim, n_latent_codes)

print(latent.shape)

In [None]:
import matplotlib.pyplot as plt

# Just for fun, let's visualize the latent representation
plt.figure(figsize=(10, 4))

plt.imshow(latent[0].detach().numpy(), aspect="auto")
plt.colorbar()
plt.title("Latent Representation")
plt.xlabel("Latent Codes")
plt.ylabel("Latent Dimension")

plt.show()

In [None]:

# Now, let's decode the latent representation back to audio and listen to it
# We can use the `decode` method to decode the latent representation
# This should be again of shape (batch_size, n_channels, n_samples)

with torch.no_grad():
    decoded_audio = rave.decode(latent)
print(decoded_audio.shape) 

# Let's listen to the decoded audio
display(Audio(decoded_audio[0].detach().numpy(), rate=sr))

In [None]:
# Let's play around with the latent representation. One thing we can do is
# to add some noise to the latent representation and see how it affects the
# decoded audio. We'll use some Gaussian noise for this.

noised_latent = latent + torch.randn_like(latent) * 0.5

with torch.no_grad():
    noised_decoded_audio = rave.decode(noised_latent)
    
display(Audio(noised_decoded_audio[0].detach().numpy(), rate=sr))

In [None]:
# How about decoding a completely random latent representation?
# We'll generate a random tensor of the same shape as the latent representation

random_latent = torch.randn_like(latent)

with torch.no_grad():
    random_decoded_audio = rave.decode(random_latent)
    
display(Audio(random_decoded_audio[0].detach().numpy(), rate=sr))

### RAVE Training

Available at https://colab.research.google.com/drive/1aK8K186QegnWVMAhfnFRofk_Jf7BBUxl?usp=sharing.