# Moving Forward: Generating Our Own Dataset

## Mistakes were made

The available dataset was a disastrous failure. There are multiple reasons why using a ready dataset was not good for our task:
1. **Cleanliness:** The audio provided by datasets is way too perfect and clean. It does not match the prediction data at all, where the prediction data could be the full track, or a source-separated drum-track courtesy of another neural net. I am planning on using demucs v4, which yields the best results for separation quality currently.
2. **Class imbalance:** The dataset had a *horrendous* imbalance coefficient of 49 (meaning that there were 49 times as many snare drums as there were crash cymbals) $$Imbalance\ Coefficient = \frac{N_{\text{majority}}}{N_{\text{minority}}} =\frac{N_{\text{snare}}}{N_{\text{crash}}}=\frac{125049}{2552}\approx 49.01$$
3. **MIDI mismatch:** After listening to some stitched audio recordings of crash cymbals I acquired after processing the dataset, I have noticed that some onsets that were given by the MIDI were incredibly wrong. After setting a cutoff value of `50` for velocity, it helped with weeding out "ghost" crash cymbals that were very quiet, but reduced the duration of **ALL** available crash cymbals to a mere 7 seconds. It led me to also notice that the velocity value does not accurately represent the volume in the waveform, meaning that there could be loud crashes with a low velocity. Thus, there is an irreperable difference between the .wav and .midi files, which I assumed to be perfectly in line with one another.

All of this leads to one solution which I dreaded of even thinking about: **labeling my own dataset.** As scary as it sounds, it might not be as bad as you think. Here is the general outline I will follow to generate my own drumming dataset:
1. Select an audio source separating algorithm.
2. Install CUDA compatible with the latest PyTorch (currently it is CUDA 11.7)
3. Gather several fully instrumented audio files representative of different genres, tempos and styles.
4. Separate the drum tracks from the audio files and save them for the future dataset.
5. Record all drum track *file paths* in a `master.csv` file to be accessed later.
6. Accessing the .csv, use an onset detection algorithm to find instances of drum notes in each the drum tracks. I will be using the `librosa` package for this task.
7. Save the onsets as a .csv and register their name in the `master.csv` file.
8. Concatenate the entire list of onsets and verify if you have enough to label.
9. Start the labeling process by using the `pigeon` package to quickly and iteratively go through every onset and give them all a label.

I was able to get this idea of labeling my own data from Medium user [YoshiMan](https://yoshi-man.medium.com), who did a similar task a month ago and published their results in TowardsAI. 

* It took them 5 hours to manually label **4,513** drum notes, which equates to roughly 900 labels per hour.
* The small dataset I used had **400,000** drum notes, and it was **6GB** of audio in size. 
* The large dataset I intended to use is **130GB** in size and has a colossal drum count of **13.1 million**.

In the case of MIR (music information retrieval) tasks, the nature of the prediction data is vastly different from the audio provided by ready datasets. Thus, it is imperative to use training data that closely resembles the prediction data. In this case, using a drum track obtained through a source separation algorithm provides a much more accurate representation of the prediction data, ensuring that the trained model will perform well on real-world tasks. Although the quantity of data may suffer significantly, the quality of the data will be worth the sacrifice, as it ensures that the trained model will perform well on real-world tasks, providing a more robust and reliable solution for the given task.

## Making The Perfect Playlist for My Model

To ensure that my efforts to label my own dataset are not wasted, I need to choose various tracks spanning multiple genres, tempos and styles. There are several genres that come to my mind:

1. **Punk Rock.** One of the very first issues I came across using a ready-made dataset was that it was incredibly imbalanced. One of the least occurent classes was the Crash Cymbal, making up a measly `0.64%` of the dataset. Punk Rock has a lot of aggressive style that utilizes a variety of crashes that should remedy my issue.
2. **Power Metal.** A sort of evolution of the Punk Rock style, this can help us gather good data for the genre's infamously cheesy blastbeat-laden fills and choruses. This will be a good source for gathering kick drum data.
3. **Progressive Metal.** A genre I selected due to its complex time-signatures, drum patterns and tempo relations. On top of that, has a rather varied accompaniment that can help bring extra data augmentation into the mix.
4. **Rock and Country.** Very average genres in terms of drumming patterns, helps with the more traditional and predictable drumming.
5. **Hip-hop.** Known for its complex drum rhythms, this will be crucial for training the model to recognize more intricate drumming.

## Creating the Dataset

### Getting Drum Onsets

In [None]:
import subprocess as sp
import torch
import librosa
import os
from IPython.display import Audio, display, Image
from ipywidgets import widgets
import matplotlib.pyplot as plt
import csv
import pandas as pd

output_path = "separated\htdemucs"
dataset_name = "HeartsOnFire-v.1.0.0"

This code searches for all mp3 files in the `dataset_name` folder and adds them to the dataset accordingly: splitting the drums, getting the onsets and saving csv data.

In [None]:
files = os.listdir(dataset_name)
mp3_files = [os.path.splitext(f)[0] for f in files if f.endswith('.mp3')]
labels = ['changed', 'kick', 'snare', 'hihat', 'tom', 'crash', 'ride', 'click', 'uncertain', 'other']

def generate_onsets(folder_path):
    drum_path = os.path.join(folder_path, "drums.mp3")
    y, sr = librosa.load(drum_path, sr=44100)

    onset_env = librosa.onset.onset_strength(y=y, sr=sr)
    onset_frames = librosa.onset.onset_detect(onset_envelope=onset_env, sr=sr)
    onset_times = librosa.frames_to_time(onset_frames, sr=sr)
    return onset_times

for filename in mp3_files:
    folder_path = os.path.join(dataset_name, filename)
    src_path = os.path.join(dataset_name, filename+".mp3")

    #step 1: split
    command = ["demucs", "--mp3", "--mp3-bitrate", "320", "--two-stems=drums", src_path]
    if torch.cuda.is_available():
        print("Generating splits for \""+filename+"\" with GPU...")
        sp.run(command)

    #step 2: move everything
    dest_path = os.path.join(folder_path, "audio.mp3")
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    os.rename(src_path, dest_path)

    src_path = os.path.join("separated/htdemucs", filename, "drums.mp3")
    dest_path = os.path.join("HeartsOnFire-v.1.0.0", filename, "drums.mp3")
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    os.rename(src_path, dest_path)

    src_path = os.path.join("separated/htdemucs", filename, "no_drums.mp3")
    dest_path = os.path.join("HeartsOnFire-v.1.0.0", filename, "no_drums.mp3")
    os.rename(src_path, dest_path)
    demucs_path = os.path.join("separated/htdemucs", filename)

    if os.path.exists(demucs_path) and os.path.isdir(demucs_path) and not os.listdir(demucs_path):
        os.rmdir(demucs_path)
    print("Generating onsets...")

    #step 3: generate onsets
    onset_times = generate_onsets(folder_path)
    
    #step 4: save onsets
    csv_file = os.path.join(folder_path, "onsets.csv")
    with open(csv_file, 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["onset_time"]+labels)
        for onset_time in onset_times:
            writer.writerow([onset_time]+[False]*len(labels))
    print("Done!")


#step 5: update the master csv
audio_info = []

for root, dirs, files in os.walk(dataset_name):
    for dir in dirs:
        processed_folder = os.path.join(root, dir)
        
        if os.path.isdir(processed_folder):
            audio_path = os.path.join(processed_folder, "audio.mp3")
            drums_path = os.path.join(processed_folder, "drums.mp3")
            no_drums_path = os.path.join(processed_folder, "no_drums.mp3")
            onsets_path = os.path.join(processed_folder, "onsets.csv")
            
            if os.path.isfile(audio_path) and os.path.isfile(drums_path) and \
                os.path.isfile(no_drums_path) and os.path.isfile(onsets_path):
                audio_info.append((dir, audio_path, drums_path, no_drums_path, onsets_path))

# write the audio information to a CSV file
with open(os.path.join(dataset_name, "master.csv"), "w", newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["name", "audio", "drums", "no_drums", "onsets"])
    writer.writerows(audio_info)

We can check an audio file for onsets:

In [None]:
drum_path = os.path.join(dataset_name, "Toehider - Daddy Issues/drums.mp3")
no_drum_path = os.path.join(dataset_name, "Toehider - Daddy Issues/no_drums.mp3")
audio_path = os.path.join(dataset_name, "Toehider - Daddy Issues/audio.mp3")

y, sr = librosa.load(drum_path, sr=44100)
y2, sr = librosa.load(audio_path, sr=44100)
y_h, y_p = librosa.effects.hpss(y2)
onset_env = librosa.onset.onset_strength(y=y, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=onset_env, sr=sr)
onset_times = librosa.frames_to_time(onset_frames, sr=sr)
clicks = librosa.clicks(frames=onset_frames, sr=sr, length=len(y))

# Create a time vector
t = librosa.times_like(onset_env, sr=sr)

# Plot the onset strength envelope and onsets
plt.figure(figsize=(8, 4))
plt.plot(t, onset_env, label='Onset strength envelope')
plt.vlines(onset_times, 0, onset_env.max(), color='r', linestyle='--', linewidth=2.0, label='Onsets')
plt.xlabel('Time (s)')
plt.ylabel('Onset strength')
plt.title('Onsets detected in audio file')
plt.legend()
plt.show()

We can listen where `librosa` placed onsets by adding clicks to our audio files:

In [None]:
y2, sr2 = librosa.load(no_drum_path, sr=sr)
display(Audio((y2 + y + clicks), rate=sr))

### Reading the Dataset

And now we can read all info with pandas:

In [None]:
df = pd.read_csv(os.path.join(dataset_name, "master.csv"))
df

## Labeling the Onsets

Now that we have a dataset with all onsets, we need to go through every single one and label them. 

We need to install `pigeonXT` for its **multi-label classification functionality**, as well as its Audio playback feature. We will load an audiofile and get snippets of it at the onset to classify that onset.

In [None]:
filtered_df = df[df['name'] == "Toehider - I Have Little To No Memory of These Memories"]
selected_row = filtered_df.iloc[0]

In [None]:
from pigeonXT import annotate
import re


onsets_df = pd.read_csv(selected_row['onsets'])
onsets_df = onsets_df.rename(columns={'onset_time': 'example'})
onsets_df['name'] = selected_row['name']

drum_file, sr = librosa.load(selected_row['drums'], sr=44100)
display(Audio(drum_file, rate=sr, autoplay=False))

def display_fn(html):
    value = html.value
    match = re.search(r'\d+\.\d+', value)

    if match:
        number_str = match.group(0)
        onset_time = float(number_str)
    else:
        print('lol')
        return html

    start_sample = int(onset_time * sr)
    end_sample = start_sample + sr//8
    end_sample2 = start_sample + 2*sr
    display(Audio(drum_file[start_sample:end_sample], rate=sr, autoplay=True), Audio(drum_file[start_sample:end_sample2], rate=sr))

annotations = annotate(
    onsets_df,
    options=labels,
    task_type='multilabel-classification',
    display_fn=lambda filename: display_fn(filename)
)

**Save labeling progress by running cell below:**

In [None]:
output = annotations.rename(columns={'example': 'onset_time'})
output.to_csv(selected_row['onsets'], index=False)
output

In [None]:
onsets_df = pd.concat([pd.read_csv(row) for row in df['onsets']])
counts = onsets_df.select_dtypes(include=bool).sum(axis=0)
counts

Now we can filter out any unlabeled classes and split them into validation/training sets for training our first model

In [None]:
for idx, row in df.iterrows():
    onsets_path = row['onsets']
    onsets_df = pd.read_csv(onsets_path)
    onsets_df = onsets_df[onsets_df[labels].any(axis=1)]
    # Shuffle the dataframe
    onsets_df = onsets_df.sample(frac=1).reset_index(drop=True)
    
    # Assign 80% of onsets to training set, and 20% to validation set
    n = len(onsets_df)
    train_n = int(0.8 * n)
    onsets_df['split'] = 'training'
    onsets_df.loc[train_n:, 'split'] = 'validation'
    
    # Sort the dataframe by onset_time column
    onsets_df = onsets_df.sort_values(by='onset_time')
    
    # Write the updated onsets dataframe back to the CSV file
    onsets_df.to_csv(onsets_path, index=False)