<a href="https://colab.research.google.com/github/paulhutchings/midi-nn/blob/note_sequences/Midi_NN_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MIDI-NN EDA with IPeC Dataset

This notebook contains an Exploratory Data Analysis (EDA) of the International Piano e-Competition dataset(IPeC). The notebook guides the user through downloading the dataset from Github, converting the MIDI files to NoteSequences for easier use, and performing a number of data operations to get an insight into the MIDI data contained in the dataset.

The notebook contains several checkpoints where you can either download crucial files to save, or re-upload files from an earlier session to save time.

In [None]:
# install dependencies
%pip install note-seq multiprocess

## Downloading and pre-processing the data
The dataset contains over 2000 MIDI files and can be downloaded as a single `.zip` file from a Google Cloud Storage Bucket, or as smaller `.tar.gz` archives from [Github](https://github.com/paulhutchings/international-e-piano-dataset)

In [None]:
# download the dataset hosted on GCS and extract the files
!curl -o midi.zip https://storage.googleapis.com/datasets.studiop.page/international-e-piano-midi_2002-2018.zip
!unzip midi.zip
!ls

In [None]:
# or upload .zip file from your computer
from google.colab import files
uploaded = files.upload()
!unzip midi.zip
!ls

Now, we will convert all of the MIDI files into [NoteSequences](https://github.com/magenta/note-seq), a serialized data structure used by Google's Magenta project that is much easier to work with than raw MIDI files.

The conversion process below creates a dictionary of the file names to the NoteSequence representations and writes them to a file for later use. The code utilizes the `multiprocess` module, a fork of the normal `multiprocessing` module, to speed up the conversion time. Feel free to adjust the parameters below to suit your needs.


In [None]:
from note_seq import midi_file_to_note_sequence
import json, argparse, os, time
from multiprocess import Pool

input_dir = 'midi'
out_file = 'notesequences'
processes = 8

In [None]:
def convert_midi_files(args):
    input_dir, files = args
    filemap = {} # dictionary of filenames to NoteSequences for reconstruction
    for file in files:
        print(f'Converting {file}...')
        filename = file[:-4]
        input_path = input_dir + '/' + file
        sequence = midi_file_to_note_sequence(input_path)
        filemap[filename] = sequence.SerializeToString()
    return filemap

In [None]:
def merge_dicts(dicts):
    merged = {}
    for d in dicts:
        merged = {**merged, **d}
    return merged

def split(arr, n):
    k, m = divmod(len(arr), n)
    return (arr[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [None]:
if __name__ == '__main__':
    start = time.time()

    files = [file for file in os.listdir(input_dir) if file.lower().endswith('.mid')]
    split_files = list(split(files, processes))

    # parallelize the conversions. Merge the dictionaries at the end
    with Pool(processes) as pool:
        results = pool.map(convert_midi_files, [(input_dir, split_files[i]) for i in range(len(split_files))])
    filemaps = merge_dicts(results)
    end = time.time() - start

    # Write dictionary to file for later use
    with open(out_file, 'w') as outfile:
        outfile.write(str(filemaps))
    print('Done')
    print(f'Conversion took {round(end, 2)}s')

### Checkpoint - download/upload NoteSequences file
Start here if you've already converted the MIDI files into NoteSequences. You can either download or upload the NoteSequences file. Uploaded files will retain their filename and be placed into the current working directory of the notebook.

In [None]:
from google.colab import files
files.download('notesequences')

In [None]:
from google.colab import files
uploaded = files.upload()
!ls

## Dataset Analysis
We'll begin with some simple statistics. We'll gather the following statistics:


*   Max
*   Min
*   Average
*   Media
*   Standard deviation

For each of the following attributes of each NoteSequence:


*   Note velocity
*   Pitch
*   Note duration



In [None]:
# imports
import pandas as pd
import numpy as np
import bokeh, os, ast, functools
from note_seq.protobuf import music_pb2

In [None]:
# convert notesequences file back into dictionary
def load_ns_file(file):
    with open(file, 'r') as file:
        filemaps = ast.literal_eval(file.read())
    for key in filemaps:
        filemaps[key] = music_pb2.NoteSequence().FromString(filemaps[key])
    return filemaps

In [None]:
# returns common statistics for an array of items
def get_stats(arr):
  return [
          max(arr),
          min(arr),
          round(np.average(arr), 2),
          round(np.median(arr), 2),
          round(np.std(arr), 2)
  ]

# gets statistics for each sequence
def get_sequence_stats(seq):
  notes = seq.notes
  seq_length = len(notes)

  velocities = [note.velocity for note in notes]
  vel_stats = get_stats(velocities)

  pitches = [note.pitch for note in notes]
  num_unique_pitches = len(set(pitches))
  pitch_stats = get_stats(pitches)

  durations = [round(note.end_time - note.start_time, 2)  for note in notes]
  dur_stats = get_stats(durations)

  return [seq_length, num_unique_pitches] + vel_stats + pitch_stats + dur_stats


In [None]:
num_midi_files = len(os.listdir('midi'))
ns = load_ns_file('notesequences')

# create a Pandas DataFrame for the statistics
stats = [[name] + get_sequence_stats(seq) for (name, seq) in ns.items()]
df_cols = [
           'Sequence name', 
           'Sequence length',
           'Number of unique pitches',
           'Max velocity', 
           'Min velocity', 
           'Avg velocity', 
           'Median velocity', 
           'Velocity std',
           'Highest pitch',
           'Lowest pitch',
           'Avg pitch',
           'Median pitch',
           'Pitch std',
           'Longest note (s)',
           'Shortest note (s)',
           'Avg duration (s)',
           'Median duration (s)',
           'Duration std'
]
df = pd.DataFrame(stats, columns=df_cols)
df.to_csv('stats.csv')

### Checkpoint - upload/download CSV stats file

In [None]:
from google.colab import files
files.download('data/stats.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import files
uploaded = files.upload()
!ls
df = pd.read_csv('stats.csv')

In [None]:
print(f'Total number of sequences (MIDI files): {num_midi_files}')
df.head(10)

Total number of sequences (MIDI files): 2431


Unnamed: 0,Sequence name,Sequence length,Number of unique pitches,Max velocity,Min velocity,Avg velocity,Median velocity,Velocity std,Highest pitch,Lowest pitch,Avg pitch,Median pitch,Pitch std,Longest note (s),Shortest note (s),Avg duration (s),Median duration (s),Duration std
0,Ozel01,1529,45,100,3,61.82,61.0,16.01,84,38,64.29,65.0,9.63,14.48,0.0,0.76,0.52,0.88
1,Hebert03,1391,60,95,3,61.07,63.0,12.47,96,28,70.69,73.0,12.93,5.95,0.0,0.09,0.06,0.32
2,KIM_W03,12338,80,114,6,68.93,70.0,17.05,105,25,69.83,69.0,16.3,3.23,0.0,0.11,0.05,0.23
3,Teo04,1934,70,113,3,72.66,73.0,14.86,100,24,58.3,59.0,13.17,0.85,0.0,0.08,0.06,0.08
4,PrjevalskayaM16,3085,84,121,2,60.41,62.0,24.62,106,21,66.29,68.0,17.71,6.97,0.0,0.11,0.07,0.21
5,KorchinskayaKogan07,13805,85,126,1,68.99,69.0,18.69,107,21,62.56,63.0,15.42,15.78,0.01,0.23,0.08,0.56
6,YeZ05,8673,76,111,3,63.56,64.0,16.69,102,27,65.37,67.0,12.39,13.35,0.0,0.15,0.09,0.41
7,ChernovA22,1349,68,92,6,44.7,43.0,11.95,102,22,63.97,65.0,12.78,7.15,0.03,0.51,0.24,0.76
8,Shi06,13323,86,112,1,56.53,57.0,20.38,107,21,65.36,67.0,16.29,18.4,0.0,0.18,0.07,0.41
9,Zhdanov11,8501,87,124,1,75.8,78.0,18.0,107,21,68.3,70.0,18.91,2.2,0.0,0.11,0.06,0.18


### Distributions

Now that we have an overview of the data, let's move on to some more interesting visualizations. We'll now create a series of histograms to view the distribution of the above statistics.



In [None]:
# function to help create histograms
from bokeh.plotting import figure
from bokeh.io import output_notebook, show, output_file
from bokeh.models import ColumnDataSource, HoverTool, Panel
from bokeh.models.widgets import Tabs


def hist_hover(dataframe, column, colors=["SteelBlue", "Tan"], bins=30, log_scale=False, show_plot=True):

    # build histogram data with Numpy
    hist, edges = np.histogram(dataframe[column], bins = bins)
    hist_df = pd.DataFrame({column: hist,
                             "left": edges[:-1],
                             "right": edges[1:]})
    hist_df["interval"] = ["%d to %d" % (left, right) for left, 
                           right in zip(hist_df["left"], hist_df["right"])]

    # bokeh histogram with hover tool
    if log_scale == True:
        hist_df["log"] = np.log(hist_df[column])
        src = ColumnDataSource(hist_df)
        plot = figure(plot_height = 400, plot_width = 600,
              title = "Histogram of {}".format(column.capitalize()),
              x_axis_label = column.capitalize(),
              y_axis_label = "Log Count")    
        plot.quad(bottom = 0, top = "log",left = "left", 
            right = "right", source = src, fill_color = colors[0], 
            line_color = "black", fill_alpha = 0.7,
            hover_fill_alpha = 1.0, hover_fill_color = colors[1])
    else:
        src = ColumnDataSource(hist_df)
        plot = figure(plot_height = 400, plot_width = 600,
              title = "Histogram of {}".format(column.capitalize()),
              x_axis_label = column.capitalize(),
              y_axis_label = "Number of sequences")    
        plot.quad(bottom = 0, top = column,left = "left", 
            right = "right", source = src, fill_color = colors[0], 
            line_color = "black", fill_alpha = 0.7,
            hover_fill_alpha = 1.0, hover_fill_color = colors[1])
    # hover tool
    hover = HoverTool(tooltips = [('Interval', '@interval'),
                              ('Number of sequences', f'@{column}')])
    plot.add_tools(hover)
    # output
    if show_plot == True:
        show(plot)
    else:
        return plot

In [None]:
# create histogram of note lengths
from bokeh.plotting import output_notebook
output_notebook()
cols = [
         'Sequence length',
         'Number of unique pitches',
         'Median velocity',
         'Velocity std',
         'Median pitch',
         'Pitch std',
         'Median duration (s)',
         'Duration std'
]
for col in cols:
  hist_hover(df, col, bins=10)


### Analysis

As we can see from the distributions above, the vast majority of the sequences (MIDI files) contain less than 10,000 notes, and the number of files with more notes drops off significantly.

We can also see that the sequences in general have a farily large variety of pitches, with most of them containing more than 75% of all available pitches on the Piano (88).

The distribution of median velocity is grouped somewhat closely around the middle. Since velocities are 0-127, we see the 63-69 bin leading by a large margin, while the bins to the sides fall off very quickly as we get into the higher and lower velocities. This means that the general dynamic level of all of our sequences is fairly similar, since we do not have many samples with a very low or very high median velocity.

When it comes to the standard deviation of velocities, we again see a clsuter in the middle around the 15-20 range, or around 12-16% variation in dynamics throughout the sequence. This reinforces what the median velocity shows us regarding the overall dynamic level of the dataset.

Despite having a large number of samples with a high number of unique pitches, the median and std deviation of pitch still remains a strong bell-curve shape around the middle of the keyboard, particularly the range of the right hand. While this is not surprising from a musical perspective, we may have expected a slightly wider curve in the distribution.

The last 2 charts are probably the most intersting. They show that the overwhelming majority of notes in the sequences are of a short duration - "fast" notes, if you like. This suggests one of two things: one - that the tempos for most sequences are faster, or two - the values of most of the notes is relatively small. The answer to this would have to come from viewing the sheet music for each sequence, or incorperating the MIDI tempo and time signature data into this analysis. This was not done at the time due to being unaware that they could possibly provide additional insight into the dataset, and also to keep the EDA from being too complex/time consuming.

### Relationships

Next, we'll create some scatter plots to see if there are any relationships between the 3 attributes. Specifically, we want to see if there is any relationship between velocity, pitch, and duration, as well as between the sequence length and number of unique pitches.

In [None]:
from bokeh.plotting import figure

def create_scatter(df, xcol, ycol, color='SteelBlue', showPlot=True):
  p = figure(plot_height=400, plot_width=600, title=f'{xcol} vs {ycol}', 
             x_axis_label=xcol, y_axis_label=ycol)
  p.circle(x=df[xcol], y=df[ycol], alpha=0.5, fill_color=color, line_color=None)
  if showPlot:
    show(p)
  else:
    return p

In [None]:
from bokeh.plotting import output_notebook
output_notebook()

create_scatter(df, 'Sequence length', 'Median duration (s)')
create_scatter(df, 'Median velocity', 'Median duration (s)', 'DarkOrange')
create_scatter(df, 'Median velocity', 'Median pitch', 'Purple')
create_scatter(df, 'Sequence length', 'Number of unique pitches', 'Green')
create_scatter(df, 'Number of unique pitches', 'Velocity std', 'DarkRed')

### Analysis

While there are no very strong relationships between the different attributes in the dataset, there are some weaker ones that are of interest.

The first 2 plots deal with the Median Duration. There is a noticible curve where as the duration of the sequence increases, the median note duration decreases. The second shows us that in general, as the velocity increases, the duration decreases. What both of these trends show us is that the longer the sequence, the more shorter, louder notes it contains, whereas the shorter sequences are more likely to contain longer and/or softer notes.

The third plot concerning Median Pitc vs Median Velocity is interesting due to the very strict grid pattern that results. I having a feeling that this implies some sort of relationship or has some sort of significance, I just don't what that is.

We do see another weak trend of longer sequences usually containing more unique pitches - which makes sense from a logical standpoint. However, the curve starts to come back down as we get longer, resulting in a shape resembling the beginnings of an upside-down U. This could be due to musical forms and structures, where material is reused - increasing the length but not adding to the uniqueness of pitches used.

In the final scatter plot we see another interesting trend. In general, we see that the seuqneces with a higher number of unique pitches also tend to have a higher std deviation in velocity. In other words, the sequences that have a large variety of pitches in them also have a large dynamic contrast. It would stand to reason that those samples would be more valuable for training given their more varied content.

Overall, we see a lot of stratification in the scatter plots. That is, clusters of dots long a single line/value in either the x or y axis. Given the fine-grained nature of several of the statistics, such as number of notes, and the median/std deviations, being more crude when categorizing the data - such as rounding to whole numbers instead of decimals, and binning according to wider ranges, may provide different insights by "clearing up" some of the more noisy data.

### Conclusions

When it comes to training a machine learning model on musical expression, there are a few insights that may be relevant to the development of the model architecture. Such an example is the indication that the general dynamic levels hover around the medians very closely. I believe that this could present a problem where the model does not learn dynamic expression well enough, and as a result produces sequences with little to no dynamic variation.