In [25]:
import pandas as pd
import re
import numpy as np
import seaborn as sns
sns.set_style( 'darkgrid' )
import matplotlib.pyplot as plt

- Pandas is for the dataframe manipulation.
- Regex is for dealing with the csv's and searching for specific notes better.
- Numpy is here because I think we might use it, but if we don't I'll remove it.
- Seaborn is for visualising notes and patterns.
- Matplotlib is for making sure Seaborn works

Possible errors in Data scraping:
- A few tracks had invalid headers when converting MIDI to CSV:<br>
<br>
WTK-1-Fugue6 Track 3, WTK-1-Fugue12 Track 4, WTK-1-Fugue19 Track 4<br>
--<br>
WTK-1-Prelude13 Track 3, WTK-1-Prelude14 Track 3, WTK-1-Prelude15 Track 3, WTK-1-Prelude19 Track 3, 
WTK-1-Prelude20 Track 3<br>
<br>
- WTK-1-Prelude 24 wasn't able to save as a file, so I'll troubleshoot that later

Data sourced from: https://www.bachcentral.com/midiindexcomplete.html <br>
Midi to csv converter source and coumentation: https://www.fourmilab.ch/webtools/midicsv/

#RECORD STRUCTURE<br>
Each record in the CSV representation of a MIDI contains at least three fields:<br>
<br>
#Track<br>
Numeric field identifying the track to which this record belongs. Tracks of MIDI data are numbered starting at 1. Track 0 is reserved for file header, information, and end of file records.<br>
#Time<br>
Absolute time, in terms of MIDI clocks, at which this event occurs. Meta-events for which time is not meaningful (for example, song title, copyright information, etc.) have an absolute time of 0.<br>
#Type<br>
Name identifying the type of the record. Record types are text consisting of upper and lower case letters and the underscore (“_”), contain no embedded spaces, and are not enclosed in quotes. csvmidi ignores upper/lower case in the Type field; the specifications “Note_on_c”, “Note_On_C”, and “NOTE_ON_C” are considered identical.<br>
Records in the CSV file are sorted first by the track number, then by time. Out of order records will be discarded with an error message from csvmidi. Following the three required fields are parameter fields which depend upon the Type; some Types take no parameters. Each Type and its parameter fields is discussed below.<br>
<br>
Any line with an initial nonblank character of “#” or “;” is ignored; either delimiter may be used to introduce comments in a CSV file. Only full-line comments are permitted; you cannot use these delimiters to terminate scanning of a regular data record. Completely blank lines are ignored.<br>
<br>
#File Structure Records<br>
##0, 0, Header, format, nTracks, division<br>
The first record of a CSV MIDI file is always the Header record. Parameters are format: the MIDI file type (0, 1, or 2), nTracks: the number of tracks in the file, and division: the number of clock pulses per quarter note. The Track and Time fields are always zero.<br>
##0, 0, End_of_file<br>
The last record in a CSV MIDI file is always an End_of_file record. Its Track and Time fields are always zero.<br>
##Track, 0, Start_track<br>
A Start_track record marks the start of a new track, with the Track field giving the track number. All records between the Start_track record and the matching End_track will have the same Track field.<br>
##Track, Time, End_track<br>
An End_track marks the end of events for the specified Track. The Time field gives the total duration of the track, which will be identical to the Time in the last event before the End_track.<br>
#File Meta-Events<br>
The following events occur within MIDI tracks and specify various kinds of information and actions. They may appear at any time within the track. Those which provide general information for which time is not relevant usually appear at the start of the track with Time zero, but this is not a requirement.<br>
<br>
Many of these meta-events include a text string argument. Text strings are output in CSV records enclosed in ASCII double quote (") characters. Quote characters embedded within strings are represented by two consecutive quotes. Non-graphic characters in the ISO 8859-1 Latin-1 set are output as a backslash followed by their three digit octal character code. Two consecutive backslashes denote a literal backslash in the string. Strings in MIDI files can be extremely long, theoretically as many as 228−1 characters; programs which process MIDI CSV files should take care to avoid buffer overflows or truncation resulting from lines containing long string items. All meta-events which take a text argument are identified by a suffix of “_t”.<br>
<br>
##Track, Time, Title_t, Text<br>
The Text specifies the title of the track or sequence. The first Title meta-event in a type 0 MIDI file, or in the first track of a type 1 file gives the name of the work. Subsequent Title meta-events in other tracks give the names of those tracks.<br>


##Track, Time, Text_t, Text<br>
This meta-event supplies an arbitrary Text string tagged to the Track and Time. It can be used for textual information which doesn't fall into one of the more specific categories given above.<br>

##Track, Time, Time_signature, Num, Denom, Click, NotesQ<br>
The time signature, metronome click rate, and number of 32nd notes per MIDI quarter note (24 MIDI clock times) are given by the numeric arguments. Num gives the numerator of the time signature as specified on sheet music. Denom specifies the denominator as a negative power of two, for example 2 for a quarter note, 3 for an eighth note, etc. Click gives the number of MIDI clocks per metronome click, and NotesQ the number of 32nd notes in the nominal MIDI quarter note time of 24 clocks (8 for the default MIDI quarter note definition).<br>
##Track, Time, Key_signature, Key, Major/Minor<br>
The key signature is specified by the numeric Key value, which is 0 for the key of C, a positive value for each sharp above C, or a negative value for each flat below C, thus in the inclusive range −7 to 7. The Major/Minor field is a quoted string which will be major for a major key and minor for a minor key.<br>
##Track, Time, Tempo, Number<br>
The tempo is specified as the Number of microseconds per quarter note, between 1 and 16777215. A value of 500000 corresponds to 120 quarter notes (“beats”) per minute. To convert beats per minute to a Tempo value, take the quotient from dividing 60,000,000 by the beats per minute.<br>
<br>
##Track, Time, Unknown_meta_event, Type, Length, Data, …<br>
If midicsv encounters a meta-event with a code not defined by the standard MIDI file specification, it outputs an unknown meta-event record in which Type gives the numeric meta-event type code, Length the number of data bytes in the meta-event, which can be any value between 0 and 228−1, followed by the Data bytes. Since meta-events include their own length, it is possible to parse them even if their type and meaning are unknown. csvmidi will reconstruct unknown meta-events with the same type code and content as in the original MIDI file.<br>
<br>
<br>
#Channel Events<br>
These events are the “meat and potatoes” of MIDI files: the actual notes and modifiers that command the instruments to play the music. Each has a MIDI channel number as its first argument, followed by event-specific parameters. To permit programs which process CSV files to easily distinguish them from meta-events, names of channel events all have a suffix of “_c”.<br>
<br>
##Track, Time, Note_on_c, Channel, Note, Velocity<br>
Send a command to play the specified Note (Middle C is defined as Note number 60; all other notes are relative in the MIDI specification, but most instruments conform to the well-tempered scale) on the given Channel with Velocity (0 to 127). A Note_on_c event with Velocity zero is equivalent to a Note_off_c.<br>
<br>
##Track, Time, Note_off_c, Channel, Note, Velocity<br>
Stop playing the specified Note on the given Channel. The Velocity should be zero, but you never know what you'll find in a MIDI file.<br>
<br>
##Track, Time, Program_c, Channel, Program_num<br>
Switch the specified Channel to program (patch) Program_num, which must be between 0 and 127. The program or patch selects which instrument and associated settings that channel will emulate. The General MIDI specification provides a standard set of instruments, but synthesisers are free to implement other sets of instruments and many permit the user to create custom patches and assign them to program numbers.<br>
<br>
Apparently, due to instrument manufacturers' skepticism about musicians' ability to cope with the number zero, many instruments number patches from 1 to 128 rather than the 0 to 127 used within MIDI files. When interpreting Program_num values, note that they may be one less than the patch numbers given in an instrument's documentation.

In [26]:
preset_columns = ['track', 'time', 'type', 'channel', '5', '6', 'delete']

In [27]:
files = ['WTK-1-Prelude01.csv', 'WTK-2-Fugue02.csv', 'WTK-1-Fugue01.csv', 'WTK-1-Prelude02.csv', 'WTK-2-Fugue03.csv',
 'WTK-1-Fugue02.csv', 'WTK-1-Prelude03.csv', 'WTK-2-Fugue04.csv',
 'WTK-1-Fugue03.csv', 'WTK-1-Prelude04.csv', 'WTK-2-Fugue05.csv',
 'WTK-1-Fugue04.csv', 'WTK-1-Prelude05.csv', 'WTK-2-Fugue06.csv',
 'WTK-1-Fugue05.csv', 'WTK-1-Prelude06.csv', 'WTK-2-Fugue07.csv',
 'WTK-1-Fugue06.csv', 'WTK-1-Prelude07.csv', 'WTK-2-Fugue08.csv',
 'WTK-1-Fugue07.csv', 'WTK-1-Prelude08.csv', 'WTK-2-Fugue09.csv',
 'WTK-1-Fugue08.csv', 'WTK-1-Prelude09.csv', 'WTK-2-Fugue10.csv',
 'WTK-1-Fugue09.csv', 'WTK-1-Prelude10.csv', 'WTK-2-Fugue11.csv',
 'WTK-1-Fugue10.csv', 'WTK-1-Prelude11.csv', 'WTK-2-Fugue12.csv',
 'WTK-1-Fugue11.csv', 'WTK-1-Prelude12.csv', 'WTK-2-Prelude01.csv',
 'WTK-1-Fugue12.csv', 'WTK-1-Prelude13.csv', 'WTK-2-Prelude02.csv',
 'WTK-1-Fugue13.csv', 'WTK-1-Prelude14.csv', 'WTK-2-Prelude03.csv',
 'WTK-1-Fugue14.csv', 'WTK-1-Prelude15.csv', 'WTK-2-Prelude04.csv',
 'WTK-1-Fugue15.csv', 'WTK-1-Prelude16.csv', 'WTK-2-Prelude05.csv',
 'WTK-1-Fugue16.csv', 'WTK-1-Prelude17.csv', 'WTK-2-Prelude06.csv',
 'WTK-1-Fugue17.csv', 'WTK-1-Prelude18.csv', 'WTK-2-Prelude07.csv',
 'WTK-1-Fugue18.csv', 'WTK-1-Prelude19.csv', 'WTK-2-Prelude08.csv',
 'WTK-1-Fugue19.csv', 'WTK-1-Prelude20.csv', 'WTK-2-Prelude09.csv',
 'WTK-1-Fugue20.csv', 'WTK-1-Prelude21.csv', 'WTK-2-Prelude10.csv',
 'WTK-1-Fugue21.csv', 'WTK-1-Prelude22.csv', 'WTK-2-Prelude11.csv',
 'WTK-1-Fugue22.csv', 'WTK-1-Prelude23.csv', 'WTK-2-Prelude12.csv',
 'WTK-1-Fugue23.csv', 'WTK-1-Fugue24.csv', 'WTK-2-Fugue01.csv']

In [28]:
all_songs = pd.DataFrame()
for song in files:
    sub_df = pd.read_csv(song, names=preset_columns)
    sub_df = sub_df.loc[(sub_df.loc[:,'track'] != 0) & (sub_df.loc[:,'track'] != 1)]
    sub_df['title'] = song.split('.')[0]
    sub_df = sub_df.iloc[:-1]
    all_songs = pd.concat([all_songs, sub_df])
    

In [29]:
all_songs

Unnamed: 0,track,time,type,channel,5,6,delete,title
14,2,0,Start_track,,,,,WTK-1-Prelude01
15,2,0,Text_t,"""RH B""",,,,WTK-1-Prelude01
16,2,120,Note_on_c,0,67.0,64.0,,WTK-1-Prelude01
17,2,180,Note_off_c,0,67.0,44.0,,WTK-1-Prelude01
18,2,180,Note_on_c,0,72.0,64.0,,WTK-1-Prelude01
...,...,...,...,...,...,...,...,...
2160,2,39840,Note_off_c,0.0,60.0,64.0,,WTK-2-Fugue01
2161,2,39840,Note_off_c,0.0,64.0,64.0,,WTK-2-Fugue01
2162,2,39840,Note_off_c,0.0,67.0,64.0,,WTK-2-Fugue01
2163,2,39840,Note_off_c,0.0,72.0,64.0,,WTK-2-Fugue01


In [30]:
all_songs['type'] = [ s.strip().lower() for s in all_songs.type ]

In [31]:
all_songs.loc[:,'type'].unique()

array(['start_track', 'text_t', 'note_on_c', 'note_off_c', 'end_track',
       'program_c', 'unknown_event'], dtype=object)

In [32]:
# removing all 'unknown_events', since they're not notes
all_songs = all_songs[all_songs.loc[:,'type']!=' Unknown_event']

In [33]:
all_songs

Unnamed: 0,track,time,type,channel,5,6,delete,title
14,2,0,start_track,,,,,WTK-1-Prelude01
15,2,0,text_t,"""RH B""",,,,WTK-1-Prelude01
16,2,120,note_on_c,0,67.0,64.0,,WTK-1-Prelude01
17,2,180,note_off_c,0,67.0,44.0,,WTK-1-Prelude01
18,2,180,note_on_c,0,72.0,64.0,,WTK-1-Prelude01
...,...,...,...,...,...,...,...,...
2160,2,39840,note_off_c,0.0,60.0,64.0,,WTK-2-Fugue01
2161,2,39840,note_off_c,0.0,64.0,64.0,,WTK-2-Fugue01
2162,2,39840,note_off_c,0.0,67.0,64.0,,WTK-2-Fugue01
2163,2,39840,note_off_c,0.0,72.0,64.0,,WTK-2-Fugue01


In [34]:
all_songs.channel.unique()

array([nan, ' "RH B"', ' 0', ' "LH B"', 0.0, ' "1"', ' "2"', ' 1', ' "3"',
       ' "4"', ' 3', ' "RH H"', ' "RH L/LH H"', ' "LH L"', ' 2',
       ' "RH L"', ' "LH H"', ' "RH M"', ' "LH H/RH L"', ' "LH M"',
       ' "LH Pedal"', ' "RH L /LH H\\000\\221?"', ' "RH Melody"',
       ' "RH Chords"', ' "RH "', ' "LH"', ' "RH"',
       ' "RH L /LH H\\201p\\221"', ' 00x', ' "RH L/ LH H\\217x\\221"',
       ' "RH L/ LH H\\226@\\221"', ' "RH L/ LH H\\2108\\221"', 1.0, 2.0,
       ' "5"', ' 4'], dtype=object)

In [35]:
# channel column contains no useful data
all_songs = all_songs.drop(labels=['channel', 'delete'], axis=1)

In [36]:
all_songs

Unnamed: 0,track,time,type,5,6,title
14,2,0,start_track,,,WTK-1-Prelude01
15,2,0,text_t,,,WTK-1-Prelude01
16,2,120,note_on_c,67.0,64.0,WTK-1-Prelude01
17,2,180,note_off_c,67.0,44.0,WTK-1-Prelude01
18,2,180,note_on_c,72.0,64.0,WTK-1-Prelude01
...,...,...,...,...,...,...
2160,2,39840,note_off_c,60.0,64.0,WTK-2-Fugue01
2161,2,39840,note_off_c,64.0,64.0,WTK-2-Fugue01
2162,2,39840,note_off_c,67.0,64.0,WTK-2-Fugue01
2163,2,39840,note_off_c,72.0,64.0,WTK-2-Fugue01


In [37]:
# creating a more descriptive index
new_idx = [idx[0]+idx[1]+idx[2] for idx in
           [re.findall('(\d)-([A-Z])[a-z]+(\d){2}', title)[0] for title in all_songs.title]]

In [38]:
all_songs['song'] = new_idx

In [39]:
all_songs.set_index('song', inplace=True)

In [40]:
all_songs

Unnamed: 0_level_0,track,time,type,5,6,title
song,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1P1,2,0,start_track,,,WTK-1-Prelude01
1P1,2,0,text_t,,,WTK-1-Prelude01
1P1,2,120,note_on_c,67.0,64.0,WTK-1-Prelude01
1P1,2,180,note_off_c,67.0,44.0,WTK-1-Prelude01
1P1,2,180,note_on_c,72.0,64.0,WTK-1-Prelude01
...,...,...,...,...,...,...
2F1,2,39840,note_off_c,60.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,64.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,67.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,72.0,64.0,WTK-2-Fugue01


In [41]:
# removing text, since they're not notes
all_songs = all_songs[all_songs.loc[:,'type']!='text_t']

In [42]:
all_songs

Unnamed: 0_level_0,track,time,type,5,6,title
song,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1P1,2,0,start_track,,,WTK-1-Prelude01
1P1,2,120,note_on_c,67.0,64.0,WTK-1-Prelude01
1P1,2,180,note_off_c,67.0,44.0,WTK-1-Prelude01
1P1,2,180,note_on_c,72.0,64.0,WTK-1-Prelude01
1P1,2,240,note_off_c,72.0,77.0,WTK-1-Prelude01
...,...,...,...,...,...,...
2F1,2,39840,note_off_c,60.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,64.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,67.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,72.0,64.0,WTK-2-Fugue01


In [43]:
# program_c is not important for notes
all_songs = all_songs[all_songs.loc[:,'type']!='program_c']

In [44]:
all_songs

Unnamed: 0_level_0,track,time,type,5,6,title
song,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1P1,2,0,start_track,,,WTK-1-Prelude01
1P1,2,120,note_on_c,67.0,64.0,WTK-1-Prelude01
1P1,2,180,note_off_c,67.0,44.0,WTK-1-Prelude01
1P1,2,180,note_on_c,72.0,64.0,WTK-1-Prelude01
1P1,2,240,note_off_c,72.0,77.0,WTK-1-Prelude01
...,...,...,...,...,...,...
2F1,2,39840,note_off_c,60.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,64.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,67.0,64.0,WTK-2-Fugue01
2F1,2,39840,note_off_c,72.0,64.0,WTK-2-Fugue01
