# MIDI files and metadata

We will be working with the [Clean MIDI subset](https://colinraffel.com/projects/lmd/#get) from The Lakh MIDI Dataset v0.1. Download it and place the `clean_midi` folder in this directory. This dataset contains 2,200 directories, each named after an artist. Every directory contains a collection of `.mid` files named after the songs. The MIDI files include multiple tracks for various instruments, with a total of 17,257 MIDI files.

We won't be working with the raw dataset directly. For this project, we'll need to clean it up by removing duplicates and corrupt MIDI files, and then extend it with some metadata from LastFM and Spotify.

In [None]:
import os
import re
import json
import pretty_midi
# LastFM metadata
import pylast
# Spotify metadata
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
os.remove('clean_midi/err.txt')
os.remove('clean_midi/out.txt')

## 1. Removing Duplicates
Many songs have two or more different MIDI files. We want to keep only one so at rendering time each song has the same probability of being selected, ensuring that the dataset doesn't contain multiple repetitions of the same song. Fortunately, the duplicated MIDI files usually end with ".1", ".2", ..., ".n", making them easy to identify.

However, some songs have two different files with different capitalization. In this case, the duplicate doesn't end with ".i" because the system doesn't recognize it as a duplicate, for example, "Here comes the sun.mid" and "Here Comes the Sun.mid" refer to the same song but none of them ends with ".i". To handle these cases as well, we first convert all the MIDI file names to lowercase.

In [None]:
for artist in os.listdir('clean_midi'):
    songs = os.listdir(f'clean_midi/{artist}')
    for song in songs:
        os.rename(f'clean_midi/{artist}/{song}', f'clean_midi/{artist}/{song.lower()}')

Next, we actually delete the duplicates. First, we search for ".i" in the file names. If we find one, we compare whether the original version or the i-th version has more instruments in the MIDI file. If the original has more, we delete the i-th version; otherwise, we replace the original with the i-th version.

In [None]:
for artist in os.listdir('clean_midi'):
    songs = os.listdir(f'clean_midi/{artist}')
    for song in songs:
        if re.search(r'\.\d', song) is not None:
            duplicated_path = f'clean_midi/{artist}/{song}'
            original_name = '.'.join(song.split('.')[:-2])
            original_path = f'clean_midi/{artist}/{original_name}.mid'
            if os.path.isfile(original_path):
                try:
                    original_midi = pretty_midi.PrettyMIDI(original_path)
                except:
                    os.remove(original_path)
                    os.rename(duplicated_path, original_path)
                    continue
                try:
                    duplicated_midi = pretty_midi.PrettyMIDI(duplicated_path)
                except:
                    os.remove(duplicated_path)
                    continue
                if len(original_midi.instruments) >= len(duplicated_midi.instruments):
                    os.remove(duplicated_path)
                else:
                    os.remove(original_path)
                    os.rename(duplicated_path, original_path)

This process removed 6,901 duplicate MIDI files.

## Getting the metadata

### LastFM metadata

Using the LastFM API, we will search for each MIDI file in the database. If we find a match, we will save a `.json` file with the same name as the MIDI file including the song's name, artist, and most importantly, the LastFM tags. These tags will be crucial later on as they contain descriptive data, such as the emotional tone of the song, that would be impossible to derive from the `.mid` file or its future `.wav` rendering alone. Tags often include words like "melancholic", "warm", or "gentle", which users assign to songs. This metadata (along with other information we'll cover later) will be used to generate prompts using a large language model (LLM) to accurately describe the sound.

In [None]:
LASTFM_API_KEY = "LASTFM_API_KEY"
LASTFM_API_SECRET = "LASTFM_API_SECRET"
LASTFM_USERNAME = "LASTFM_USERNAME"
LASTFM_PASSWORD = "LASTFM_PASSWORD"

In [None]:
network = pylast.LastFMNetwork(
    api_key=LASTFM_API_KEY,
    api_secret=LASTFM_API_SECRET,
    username=LASTFM_USERNAME,
    password_hash=pylast.md5(LASTFM_PASSWORD),
)

In [None]:
for artist in os.listdir('clean_midi'):
    songs = os.listdir(f'clean_midi/{artist}')
    for song in songs:
        matched_search = network.search_for_track(artist, song[:-4])
        if int(matched_search.get_total_result_count()) == 0:
            os.remove(f'clean_midi/{artist}/{song}')
            continue
        best_match = matched_search.get_next_page()[0]
        tags = best_match.get_top_tags()
        data = {
            "Song": best_match.get_name(),
            "Artist": best_match.get_artist().get_name(),
            "Tags": [tag.item.get_name() for tag in tags]
        }
        with open(f'clean_midi/{artist}/{song[:-4]}.json', 'w') as json_file:
            json.dump(data, json_file)

This process removed 1,148 MIDI files (those not found in LastFM's database).

### Spotify metadata

Spotify provides additional metadata that is highly useful for this dataset. First, it offers general musical information such as key, mode, and tempo (e.g., "A Major, 129 BPM"). Additionally, Spotify provides descriptive parameters like "acousticness", "danceability", and "energy". Another crucial piece of metadata is the track's sections. Spotify defines sections as "large variations in rhythm or timbre, e.g., chorus, verse, bridge, guitar solo, etc." Each section includes its own descriptions of tempo, key, mode, time signature, and loudness.

This is particularly interesting for our project because rendering the entire song into a single `.wav` file typically results in 2 to 4-minute-long files, but for our model, we want shorter audio clips. This means we need to cut the `.wav` files, but doing so every N seconds could lead to abrupt cuts or starting a clip in the middle of a musical phrase. The section data is ideal for our use case, allowing us to cut the audio into smaller, coherent pieces. Sections are typically 10-30 seconds long, which aligns with our requirements.

In [None]:
SPOTIFY_API_KEY = "SPOTIFY_API_KEY"
SPOTIFY_API_SECRET = "SPOTIFY_API_SECRET"

In [None]:
auth_manager = SpotifyClientCredentials(client_id=SPOTIFY_API_KEY, client_secret=SPOTIFY_API_SECRET)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [None]:
key_dict = {
    -1: '', 0: 'C', 1: 'C#', 2: 'D', 3: 'D#', 4: 'E', 5: 'F', 6: 'F#', 7: 'G', 8: 'G#', 9: 'A', 10: 'A#', 11: 'B'
}

mode_dict = {
    0: 'Minor', 1: 'Major'
}

In [None]:
for artist in os.listdir('clean_midi'):
    songs = os.listdir(f'clean_midi/{artist}')
    songs = [ song for song in songs if song.endswith(".mid") ]
    for song in songs:
        metadata = open(f'clean_midi/{artist}/{song[:-4]}.json', encoding='utf-8')
        metadata = json.load(metadata)

        search_result = sp.search(metadata['Artist'] + " " + metadata['Song'], limit=1, type='track')
        if len(search_result['tracks']['items']) == 0:
            os.remove(f'clean_midi/{artist}/{song}')
            os.remove(f'clean_midi/{artist}/{song[:-4]}.json')
            continue

        track_id = search_result['tracks']['items'][0]['id']
        spotify_features = sp.audio_features([track_id])[0]
        metadata.update({
            "duration_ms":  spotify_features["duration_ms"],
            "acousticness": spotify_features["acousticness"],
            #"danceability": spotify_features["danceability"],
            "energy": spotify_features["energy"],
        })

        spotify_analysis = sp.audio_analysis(track_id)
        metadata.update({
            "key": key_dict[spotify_analysis["track"]["key"]],
            "mode": mode_dict[spotify_analysis["track"]["mode"]],
            "tempo": spotify_analysis["track"]["tempo"],
            "sections": spotify_analysis["sections"]
        })
        
        with open(f'clean_midi/{artist}/{song[:-4]}.json', 'w') as json_file:
            json.dump(metadata, json_file)

This process removed 32 MIDI files (those not found in Spotify's database).

### Splitting the MIDI files

As I mentioned at the beginning, the MIDI files contain several tracks that are meant to be played with different instruments, according to the program, which is essentially an instrument code. [Here you can see the map of codes to instruments.](https://www.ccarh.org/courses/253/handout/gminstruments/). Since we want to synthesize each instrument separately, potentially with a different VST, it's useful to have a separate MIDI file for each instrument. Therefore, we're going to split the MIDI files into multiple files, one for each instrument.

We will also remove any MIDI files that `pretty_midi` is unable to open, as they are likely corrupt or broken.

In [None]:
for artist in os.listdir('clean_midi'):
    songs = os.listdir(f'clean_midi/{artist}')
    songs = [ song for song in songs if song.endswith(".mid") ]
    for song in songs:
        print(artist, song)
        try:
            midi_data = pretty_midi.PrettyMIDI(f'clean_midi/{artist}/{song}')
        except:
            os.remove(f'clean_midi/{artist}/{song}')
            os.remove(f'clean_midi/{artist}/{song[:-4]}.json')
            continue
        os.mkdir(f'clean_midi/{artist}/{song[:-4]}')
        for instrument in midi_data.instruments:
            new_midi = pretty_midi.PrettyMIDI()
            new_midi.instruments.append(instrument)
            instrument_name = instrument.name
            for char in ["_", "/", ":", "*", '"', ">", "<", "|", "?", "'", "\x00", "\x0b", "\t", "\x12", '\\']:
                instrument_name = instrument_name.replace(char, "")
            instrument_name = instrument_name.rstrip()
            new_midi.write(f'clean_midi/{artist}/{song[:-4]}/{instrument.program}_{1 if instrument.is_drum else 0}_{instrument_name}.mid')
        os.remove(f'clean_midi/{artist}/{song}')

This process removed 101 MIDI files (those that PrettyMIDI couldn't open, likely corrupted files).

Since we’ve been removing various MIDI files throughout the notebook and some artists only had one MIDI track, we’ve left some empty directories, which we will now remove.

In [None]:
for artist in os.listdir('clean_midi'):
    if len(os.listdir(f'clean_midi/{artist}')) == 0:
        os.rmdir(f'clean_midi/{artist}')

We started with 17,257 MIDI files, each mapping to a song (including duplicates), with each file containing combined tracks. We ended up with a total of 9,075 unique songs, 8,182 fewer than the original dataset due to the removal of duplicates and corrupt files. Now, all the `midi` files are valid and separated by instrument. Additionally, we have a `.json`  file for each song that looks like this: 

```json
{
  "Song": "Here Comes The Sun - Remastered 2009",
  "Artist": "The Beatles",
  "Tags": [
    "sunshine pop",
    "rock",
    "60s",
    ...
  ],
  "duration_ms": 185733,
  "acousticness": 0.0339,
  "energy": 0.54,
  "key": "A",
  "mode": "Major",
  "tempo": 129.177,
  "sections": [
    {
      "start": 0.0,
      "duration": 16.22566,
      "confidence": 1.0,
      "loudness": -23.088,
      "tempo": 129.83,
      "tempo_confidence": 0.518,
      "key": 9,
      "key_confidence": 0.797,
      "mode": 1,
      "mode_confidence": 0.713,
      "time_signature": 3,
      "time_signature_confidence": 0.625
    },
    ...
  ]
}
```