# Understanding the Lakh Dataset

MIDI data is a file type that preserves symbolic meaning. It doesn't hold any particular sound, rather it contains the data needed to produce that sound. Think of it like a recipe book. There are no ingredients inside of the book, but it is referenced during the cooking process. 

MIDI files come with the necessary data to synthesize the data. Hence it may include what instruments should be used to synthesize the music, but someone may choose to ignore this the same way a chef may change out parsley for cilatro in a recipe.

### Useful Tools 
Since we will be working programmatically with musical data we can will be using pythonic tools. 
Here is a list of valuable tools:


NOTES
- Scores are kind of like semantic trees and have a structure that is well defined. It reminds me of the ARC. 
- Midi files contain all sorts of data from key changes, to lyrics. Maybe we can work with that kind of data



## Notes on tips from the Dr. Raffel
https://nbviewer.org/github/craffel/midi-dataset/blob/master/Tutorial.ipynb

https://nbviewer.org/github/craffel/midi-ground-truth/blob/master/Statistics.ipynb
- midi files might not be consistent

In [1]:
# Imports 
from music21 import *
# Imports from Dr. Raffel 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pretty_midi
import librosa
import mir_eval
import mir_eval.display
import tables
import IPython.display
import os
import json


# Local path constants
DATA_PATH = '../data'
RESULTS_PATH = '../results'
# Path to the file match_scores.json distributed with the LMD
SCORE_FILE = os.path.join(RESULTS_PATH, 'match_scores.json')

# Utility functions for retrieving paths
def msd_id_to_dirs(msd_id):
    """Given an MSD ID, generate the path prefix.
    E.g. TRABCD12345678 -> A/B/C/TRABCD12345678"""
    return os.path.join(msd_id[2], msd_id[3], msd_id[4], msd_id)

def msd_id_to_mp3(msd_id):
    """Given an MSD ID, return the path to the corresponding mp3"""
    return os.path.join(DATA_PATH, 'msd', 'mp3',
                        msd_id_to_dirs(msd_id) + '.mp3')

def msd_id_to_h5(h5):
    """Given an MSD ID, return the path to the corresponding h5"""
    return os.path.join(RESULTS_PATH, 'lmd_matched_h5',
                        msd_id_to_dirs(msd_id) + '.h5')

def get_midi_path(msd_id, midi_md5, kind):
    """Given an MSD ID and MIDI MD5, return path to a MIDI file.
    kind should be one of 'matched' or 'aligned'. """
    return os.path.join(RESULTS_PATH, 'lmd_{}'.format(kind),
                        msd_id_to_dirs(msd_id), midi_md5 + '.mid')

In [23]:

import os
os.chdir(os.path.join(os.path.dirname(os.getcwd()), 'data')) #Chatgpt
files = os.listdir(data_path)

# Print the files (only files, not directories)
for file in files:
    if os.path.isfile(os.path.join(data_path, file)):
        print(file)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/rakin/Desktop/data'

## Some Basic Notes on Tools: 
- Lilypond: A language similar to Tex for music. Tex actually supports lilypond sentences for files as well. It provides customizability for engraving scores. Think of it like as a way to build fonts, closeness, etc. 

- Humdrum: A language and tool set. The language provides a syntax and grammer to encode musical information. The language is actually what MIDI uses. The tool set is a command line tool set that provides a whole bunch of command line tools for musicology analysis 

- Music21: A python library built at MIT. This library contains many different functions from python to work with music

- PrettyMidi: This is a python libraries built by Colin Raffel. It allows for easy viewing of Midifiles, its especially valuable for jupyter notebook printouts. If you are connected to a compute cluster, then you can't see applications, hence the printings done by Music21 are rendered unusable. 

# The 4 Notebooks 

Link: https://nbviewer.org/github/craffel/midi-dataset/blob/master/Tutorial.ipynb

# Some Basics of MIDI & Music 

Vocabulary
- Notes: The most fundamental building block of music. Think of it as letters in a word 
- Scale: Specific sequence of notes in some specific order 
- Major scale: Starting and ending notes are the same, every other note is included just one
- Semitone: half step(closes note) 
    - When we look at transitions from one note to another we can give it an annotation such as semitone or tone. DE := tone | C#D := semitone
- Tone: whole step 
- Octave: 
- Upbeat : 
- Downbeat: 
- Key signature: 
- Progression: series of chords played in a sequence 
- (beat, downbeat) : Useful if we're working with audio files and want to do key estimation. 
- Pitch 
- Harmony 
- fundamental frequency 


##  Extra stuff 
What else can we teach the model? How can we evaluate how well it does on musical reasoning based tasks. And how can we do it the 'lug fep' language. 


