# Data Processing and Exploration

In this part of the report we will explain the datasets we use in our analysis, as well as the creation of the dataset from the publicly available files that could be useful for any MIDI + audio metadata related project - that can crucially scale to the data volumes manageable by local workstation processing.

The goal of this part of the report is to obtain the dataset from the publicly available files, that we subsequently use in our analysis - promoting the replicability from the public data sources.

## Data Sources

The primary focus of our analysis are MIDI audio files. For the model to work, we need a corpora of MIDI files, and associated metadata and any additional attributes that can be used to enrich the initial MIDI audio dataset (e.g. http://millionsongdataset.com/blog/12-2-12-fixing-matching-errors/)


For this purpose, we have identified two relevant datasets:

#### 1. Lakh MIDI Dataset: https://colinraffel.com/projects/lmd/

This dataset contains 176581 unique MIDI files, used in music information retrieval. This dataset has done an initial effort to match the entries to a bigger corpora of the *Million Song Dataset*, and contains a subset of features thereof. 

#### 2. Million Song Dataset: http://millionsongdataset.com/

This dataset is a publicly available collection of audio features and metadata for 1M contemporary music tracks, including the MIDI files. The dataset was compiled by merging several datasets from the broader community to provide a reference dataset of commercial sizes. 

For our purposes, we are interested in the MIDI files and the associated metadata, however, the scale of the dataset with all the associated metadata provides a challenge. Fortunately, a summary file with the metadata of the full dataset, exluding heavier audio analysis attributes is available. We make use of this dataset (*Summary file of the whole dataset* http://millionsongdataset.com/pages/getting-dataset/) to provide additional features to the MIDI files of Lakh dataset. This step is important, as it allows coupling any new feature that follows the Million Song Dataset song identifiers with existing datasets. A summary track description of Million Song Dataset is available here: http://millionsongdataset.com/pages/example-track-description/. 

Compared to the Lakh dataset it is a broader dataset, involving more communities, and more actively maintained, in effect enabling extending the Lakh dataset with more features in the future if needed by linking more data.

In particular, 45 129 songs have been matched and aligned with the entries in the Lakh MIDI Dataset, with about 31 000 entries with MIDI files.

### Downloads

We make use of publicly available datasets. We provide the links to files that are available at the creation of this report. We indicate the file name variable that we will use in the code, so this notebook can be adapted as needed for other needs or runtime at other systems/platforms/data locations.

- (**MSD_SUMMARY**) A summary file of the whole Million Song Dataset. It comes in HDF5 format with all metadata, excluding arrays like audio analysis, similar artists and tags: http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_summary_file.h5
- (**LAKH_MIDI**) A subset of 45129 MIDI files with deduplicated and matched entries in the Million Song Dataset: http://hog.ee.columbia.edu/craffel/lmd/lmd_matched.tar.gz 
- (**LAKH_DATA**) Filtered HDF5 files, per song, from the Million Song Dataset. It is important to note that this dataset contains a subset of metadata that is available in the **MSD_SUMMARY**: http://hog.ee.columbia.edu/craffel/lmd/lmd_matched_h5.tar.gz
- (**MIDI_MATCH**) JSON file that matches with a given confidence score files from Lakh MIDI Dataset the entries between LMD-matched and LMD-aligned datasets - crucial for mapping the MIDI file to the metadata: http://hog.ee.columbia.edu/craffel/lmd/match_scores.json

## Setting up the environment

We use typical data processing tools which we import. We make use of the os library since the folder structure is complex and based on the MD5 checksum, rather than a flat file. This is an artifact of the dataset, described in the Million Song Dataset documentation.

Next we set the necessary global path/file variables, and define helper functions.

In [3]:
import os # for navigating the paths
import h5py # for opening the HDF5 files
import pandas # processing data into a dataframe
import json # processing the JSON files
import tables # processing of HDF5 files
import numpy # for data processing
from tqdm import tqdm # for long processing

In [4]:
# we make an assumption that all the files will be unpacked and saved in the same DATA_PATH_PREFIX location
DATA_PATH_PREFIX='D:\Music Dataset' # The location where all the folders with the data have been placed
LAKH_MIDI='lmd_matched\lmd_matched'
LAKH_DATA='lmd_matched_h5\lmd_matched_h5'
MIDI_MATCH='match_scores.json'
MSD_SUMMARY = 'msd_summary_file.h5'

In [6]:
# return a path from multiple elements (a list) concatenated by the char (default '\')
def concat_path(elements, char='\\'):
    return char.join(elements)

In [7]:
# open the JSON file that contains the confidence between file mappings
with open(concat_path([DATA_PATH_PREFIX, MIDI_MATCH])) as json_file:
    match_scores = json.load(json_file)

In [8]:
# get all the subfolders and file paths 
def get_all_song_hdf5_path(base_path=concat_path([DATA_PATH_PREFIX, LAKH_DATA])):
    
    # https://groups.google.com/g/comp.lang.python/c/tUW8tP_OfQs?pli=1 -- a bit nicer way below inspired by this
    all_subdirs = os.walk(base_path, topdown=False)
    
    all_file_paths = [os.path.join(root, filename) for root, dirs, files in all_subdirs for filename in files]
    
    return all_file_paths

## Dataset creation


In [17]:
# sanity check for the obtained metadata paths from lakh dataset
paths = get_all_song_hdf5_path()

for i in range (0, 5):
    print(paths[i])

D:\Music Dataset\lmd_matched_h5\lmd_matched_h5\A\A\A\TRAAAGR128F425B14B.h5
D:\Music Dataset\lmd_matched_h5\lmd_matched_h5\A\A\A\TRAAAZF12903CCCF6B.h5
D:\Music Dataset\lmd_matched_h5\lmd_matched_h5\A\A\B\TRAABVM128F92CA9DC.h5
D:\Music Dataset\lmd_matched_h5\lmd_matched_h5\A\A\B\TRAABXH128F42955D6.h5
D:\Music Dataset\lmd_matched_h5\lmd_matched_h5\A\A\C\TRAACQE12903CC706C.h5


We print out the available attributes and nodes in the HDF5 file(s). More info on HDF5 format and hdpy library available here: 
1. https://www.h5py.org/
2. https://www.hdfgroup.org/solutions/hdf5/

In [7]:
#https://stackoverflow.com/questions/31146036/how-do-i-traverse-a-hdf5-file-using-h5py
def visitor_func(name, node):
    if isinstance(node, h5py.Dataset):
         print(str('Dataset:' + node.name))
    else:
         print(str(str(type(node)) + ":" + node.name))

f = h5py.File(paths[0], 'r')
f.visititems(visitor_func)

<class 'h5py._hl.group.Group'>:/analysis
Dataset:/analysis/bars_confidence
Dataset:/analysis/bars_start
Dataset:/analysis/beats_confidence
Dataset:/analysis/beats_start
Dataset:/analysis/sections_confidence
Dataset:/analysis/sections_start
Dataset:/analysis/segments_confidence
Dataset:/analysis/segments_loudness_max
Dataset:/analysis/segments_loudness_max_time
Dataset:/analysis/segments_loudness_start
Dataset:/analysis/segments_pitches
Dataset:/analysis/segments_start
Dataset:/analysis/segments_timbre
Dataset:/analysis/songs
Dataset:/analysis/tatums_confidence
Dataset:/analysis/tatums_start
<class 'h5py._hl.group.Group'>:/metadata
Dataset:/metadata/artist_terms
Dataset:/metadata/artist_terms_freq
Dataset:/metadata/artist_terms_weight
Dataset:/metadata/similar_artists
Dataset:/metadata/songs
<class 'h5py._hl.group.Group'>:/musicbrainz
Dataset:/musicbrainz/artist_mbtags
Dataset:/musicbrainz/artist_mbtags_count
Dataset:/musicbrainz/songs


The idea behind the following code is to take a file per song and create a single dataframe. 

A file per song has the HDF5 structure as above, and multiple table nodes, mainly the songs attribute. We write a function to parse those table nodes into a dictionary, so it is more easily integrated in a single dataframe.

In [20]:
#converting hdf5 table node to a dict 
def table_to_df(table_node):
    cols = []
    tmp_map = {}
    
    for col_names in table_node.cols._v_colnames:
        cols.append(col_names)
        
        val = table_node.col(col_names)
        
        if(len(val)==1):
            if(isinstance(val[0], numpy.bytes_)):
                tmp_map[col_names] = val[0].decode()
            else:
                tmp_map[col_names] = val[0]
        else:
            print("ERROR")
        
    return tmp_map             

We will convert the HDF5 files to dataframes for easier use for further analysis. 

In this iteration, we just test the functions on a single file, before running it on all available metadata files for the songs.

In [18]:
# we load one of the metadata files for one of the songs
test = pandas.HDFStore(paths[0], 'r')

In [21]:
# then we test the function to see if we get a dict/map from the attributes of the songs table under analysis group
analysis_songs = table_to_df(test.root.analysis.songs)

In [22]:
analysis_songs

{'analysis_sample_rate': 22050,
 'audio_md5': '7573fabe891b25bcd3c5866e4c5df1f0',
 'danceability': 0.0,
 'duration': 240.63955,
 'end_of_fade_in': 4.487,
 'energy': 0.0,
 'idx_bars_confidence': 0,
 'idx_bars_start': 0,
 'idx_beats_confidence': 0,
 'idx_beats_start': 0,
 'idx_sections_confidence': 0,
 'idx_sections_start': 0,
 'idx_segments_confidence': 0,
 'idx_segments_loudness_max': 0,
 'idx_segments_loudness_max_time': 0,
 'idx_segments_loudness_start': 0,
 'idx_segments_pitches': 0,
 'idx_segments_start': 0,
 'idx_segments_timbre': 0,
 'idx_tatums_confidence': 0,
 'idx_tatums_start': 0,
 'key': 9,
 'key_confidence': 0.608,
 'loudness': -7.322,
 'mode': 0,
 'mode_confidence': 0.495,
 'start_of_fade_out': 240.64,
 'tempo': 123.989,
 'time_signature': 4,
 'time_signature_confidence': 0.8,
 'track_id': 'TRAAAGR128F425B14B'}

In [23]:
# we do the same for the song table under metadata group
metadata_songs = table_to_df(test.root.metadata.songs)

In [26]:
metadata_songs

{'analyzer_version': '',
 'artist_7digitalid': 11319,
 'artist_familiarity': 0.7128860298225487,
 'artist_hotttnesss': 0.5592572617501187,
 'artist_id': 'ARGE7G11187FB37E05',
 'artist_latitude': nan,
 'artist_location': 'Brooklyn, NY',
 'artist_longitude': nan,
 'artist_mbid': '7bd9e20e-74b9-446a-a2ed-a223f82a36e7',
 'artist_name': 'Cyndi Lauper',
 'artist_playmeid': 382,
 'genre': '',
 'idx_artist_terms': 0,
 'idx_similar_artists': 0,
 'release': 'Bring Ya To The Brink',
 'release_7digitalid': 279219,
 'song_hotttnesss': nan,
 'song_id': 'SONRWUU12AF72A4283',
 'title': 'Into The Nightlife',
 'track_7digitalid': 3110092}

In [31]:
# we do the same for the final available song table under musicbrainz group
musicbrainz_songs = table_to_df(test.root.musicbrainz.songs)

In [32]:
musicbrainz_songs 

{'idx_artist_mbtags': 0, 'year': 2008}

Finally, we take all the attributes, union the sets in order to create a single long row entry in a dataframe. 

This is possible, since we have a file per song - therefore we already know these attributes belong to a single entity (which also has an ID to use and connect it with external MIDI files and other datasets if needed).

In [34]:
joined_attributes = {}

joined_attributes.update(metadata_songs)
joined_attributes.update(analysis_songs)
joined_attributes.update(musicbrainz_songs)

In [35]:
joined_attributes

{'analyzer_version': '',
 'artist_7digitalid': 11319,
 'artist_familiarity': 0.7128860298225487,
 'artist_hotttnesss': 0.5592572617501187,
 'artist_id': 'ARGE7G11187FB37E05',
 'artist_latitude': nan,
 'artist_location': 'Brooklyn, NY',
 'artist_longitude': nan,
 'artist_mbid': '7bd9e20e-74b9-446a-a2ed-a223f82a36e7',
 'artist_name': 'Cyndi Lauper',
 'artist_playmeid': 382,
 'genre': '',
 'idx_artist_terms': 0,
 'idx_similar_artists': 0,
 'release': 'Bring Ya To The Brink',
 'release_7digitalid': 279219,
 'song_hotttnesss': nan,
 'song_id': 'SONRWUU12AF72A4283',
 'title': 'Into The Nightlife',
 'track_7digitalid': 3110092,
 'analysis_sample_rate': 22050,
 'audio_md5': '7573fabe891b25bcd3c5866e4c5df1f0',
 'danceability': 0.0,
 'duration': 240.63955,
 'end_of_fade_in': 4.487,
 'energy': 0.0,
 'idx_bars_confidence': 0,
 'idx_bars_start': 0,
 'idx_beats_confidence': 0,
 'idx_beats_start': 0,
 'idx_sections_confidence': 0,
 'idx_sections_start': 0,
 'idx_segments_confidence': 0,
 'idx_segment

In [36]:
# close the file in order to perform the full processing
test.close()

#### Data flattening and extraction 

Next, we will create a dataframe from the available files to assist in further analysis. 

We get the metadata from HDF5 file, get the column descriptions, create the empty dataframe, and then perform the processing from above, done for a single file, for every file and append to obtain the final dataframe.

In [42]:
# we open any file in order to get the HDF5 layout for the song metadata tables
def get_all_song_cols(sample_file_path):
    item = {'id': 0}
    
    test = pandas.HDFStore(sample_file_path, 'r')
    
    # make sure that the processing also does this in the following order
    # above is needed to make sure values are mapped to correct name
    item.update(table_to_df(test.root.metadata.songs))
    item.update(table_to_df(test.root.analysis.songs))
    item.update(table_to_df(test.root.musicbrainz.songs))
    
    return list(item.keys())

In [43]:
# for any path - in this case the first entry - we create the dataframe header/columns
cols = get_all_song_cols(get_all_song_hdf5_path()[0])

In [44]:
cols

['id',
 'analyzer_version',
 'artist_7digitalid',
 'artist_familiarity',
 'artist_hotttnesss',
 'artist_id',
 'artist_latitude',
 'artist_location',
 'artist_longitude',
 'artist_mbid',
 'artist_name',
 'artist_playmeid',
 'genre',
 'idx_artist_terms',
 'idx_similar_artists',
 'release',
 'release_7digitalid',
 'song_hotttnesss',
 'song_id',
 'title',
 'track_7digitalid',
 'analysis_sample_rate',
 'audio_md5',
 'danceability',
 'duration',
 'end_of_fade_in',
 'energy',
 'idx_bars_confidence',
 'idx_bars_start',
 'idx_beats_confidence',
 'idx_beats_start',
 'idx_sections_confidence',
 'idx_sections_start',
 'idx_segments_confidence',
 'idx_segments_loudness_max',
 'idx_segments_loudness_max_time',
 'idx_segments_loudness_start',
 'idx_segments_pitches',
 'idx_segments_start',
 'idx_segments_timbre',
 'idx_tatums_confidence',
 'idx_tatums_start',
 'key',
 'key_confidence',
 'loudness',
 'mode',
 'mode_confidence',
 'start_of_fade_out',
 'tempo',
 'time_signature',
 'time_signature_confid

In [24]:
# we create an empty dataframe with columns specified as above
df_all = pandas.DataFrame(columns=cols)

The following function will be used perform the analysis for every file (dataset provides one per song) - and create a dict from the data of a single file.

In [25]:
def read_all_song_hdf5(file_path, path_separator='\\'):
    
    id = file_path.replace(concat_path([DATA_PATH_PREFIX, LAKH_DATA]), '').replace('.h5', '')
    msd_id = id.split(path_separator)[-1]
    item = {'id': msd_id}
    
    with pandas.HDFStore(file_path, 'r') as f:  # open file
        
        item.update(table_to_df(f.root.metadata.songs))
        item.update(table_to_df(f.root.analysis.songs))
        item.update(table_to_df(f.root.musicbrainz.songs))
        
        return item

Finally, for every file (from the existing paths), extract the dict, and append to the dataframe.

#### Caution, the following code takes a lot of time.
Since we are processing about 31 000 files that have associated MIDI files, this is a time consuming process. For this reason, we serialize the result for future use.

In [26]:
for f in tqdm(get_all_song_hdf5_path()):
    df_all = df_all.append(read_all_song_hdf5(f), ignore_index=True)

100%|████████████████████████████████████████████████████████████████████████████| 31034/31034 [27:27<00:00, 18.84it/s]


In [27]:
df_all

Unnamed: 0,id,analyzer_version,artist_7digitalid,artist_familiarity,artist_hotttnesss,artist_id,artist_latitude,artist_location,artist_longitude,artist_mbid,...,loudness,mode,mode_confidence,start_of_fade_out,tempo,time_signature,time_signature_confidence,track_id,idx_artist_mbtags,year
0,TRAAAGR128F425B14B,,11319,0.712886,0.559257,ARGE7G11187FB37E05,,"Brooklyn, NY",,7bd9e20e-74b9-446a-a2ed-a223f82a36e7,...,-7.322,0,0.495,240.640,123.989,4,0.800,TRAAAGR128F425B14B,0,2008
1,TRAAAZF12903CCCF6B,,93189,0.546102,0.383787,ARJJ8611187FB5321F,40.79086,"New York, NY [Manhattan]",-73.96644,471e21ab-7a14-4190-a9d2-f95197616df4,...,-11.137,1,0.442,167.607,110.129,4,0.711,TRAAAZF12903CCCF6B,0,1983
2,TRAABVM128F92CA9DC,,1396,0.707200,0.513463,ARYKCQI1187FB3B18F,,,,eeacb319-8d4c-48e0-80a0-944e71c375bf,...,-5.271,1,0.756,285.605,150.062,4,0.931,TRAABVM128F92CA9DC,0,2004
3,TRAABXH128F42955D6,,611,0.635346,0.463478,ARD9UVF1187B9B17FE,,"Hawthorne, CA",,634fe78e-fc6b-4b2a-ba83-c8c66e13a8aa,...,-7.108,1,0.514,160.717,100.494,3,1.000,TRAABXH128F42955D6,0,1998
4,TRAACQE12903CC706C,,153505,0.583006,0.333922,ARDDIBO1187B9B0822,,,,7720a649-0c70-4c7a-972a-c29ccb898201,...,-5.033,1,0.453,156.973,118.430,4,0.610,TRAACQE12903CC706C,0,2007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31029,TRZZYLO12903CAC06C,,382471,0.445457,0.287668,AR0Q5531187FB45143,,,,6e3b2a72-b4c4-45b3-8154-63ee011b2807,...,-14.526,1,0.677,258.507,133.108,3,0.381,TRZZYLO12903CAC06C,0,0
31030,TRZZYTX128F92EBE33,,32056,0.739047,0.541603,ARW9QSZ1187FB4B93E,,"Liverpool, England",,42a8f507-8412-4611-854f-926571049fa0,...,-8.593,1,0.652,164.548,122.832,4,0.090,TRZZYTX128F92EBE33,0,0
31031,TRZZZBU128F426811B,,70032,0.490194,0.377473,ARYN8YT1187FB38396,,,,a1658d98-c4dc-40b9-8bbd-9793f44e64dd,...,-7.935,1,0.481,191.002,171.826,4,0.978,TRZZZBU128F426811B,0,0
31032,TRZZZTN128EF35C42F,,65225,0.327659,0.184442,ARHND4H1187B990171,,,,e6993972-e17f-43db-a2d4-f8980ddd0d8c,...,-5.997,0,0.511,465.119,170.928,5,1.000,TRZZZTN128EF35C42F,0,0


In [29]:
# we serialize the file
out_path = concat_path([DATA_PATH_PREFIX, 'lakh_all_songs_processed.csv'])

df_all.to_csv(out_path, index_label=False)

Our data processng pipeline allows taking the Lakh dataset and with the available song ID extend the available metadata, convert/flatten the features into a dataframe, and use it in an analysis. 

---

### MIDI2Vec Dataset

We recreate the dataset, selecting features that were used by MIDI2Vec, having a data processing pipeline from the publicly available datasets.

In [530]:
# a function to match the songs to their MIDI files based on the candidates entry (from a separate file)
# returns the key of the most likely match
def best_match(candidates):
    best = None
    score = 0
    
    for key, value in candidates.items():
        if value > score:
            best = key
            score = value
    
    return best

In [534]:
# for a faster analysis, we get only the relevant columns 
def read_hdf5(file_path, path_separator='\\'):
    
    id = file_path.replace(concat_path([DATA_PATH_PREFIX, LAKH_DATA]), '').replace('.h5', '')
    msd_id = id.split(path_separator)[-1]
    item = {'id': msd_id}
    
    #print(msd_id)
    
    with h5py.File(file_path, 'r') as f:  # open file

        item['year'] = f['musicbrainz/songs'][0][1] # get the relevant HDF5 items

        item['tag_echo'] = ''
        item['tag_mbz'] = ''
        tags_echo = f['metadata/artist_terms'][:]
        if len(tags_echo) > 0:
            item['tag_echo'] = tags_echo[0].decode()

        tags_mbz = f['musicbrainz/artist_mbtags'][:]
        if len(tags_mbz) > 0:
            item['tag_mbz'] = tags_mbz[0].decode()

        song = f['metadata/songs'][0]
        item['artist_mb'] = song[8].decode()
        item['artist_name'] = song[9].decode()
        item['album_name'] = song[14].decode()
        item['song_name'] = song[18].decode()
        
        # get the most likely match for the MIDI file and save the MIDI path
        best = best_match(match_scores[msd_id])

        file = os.path.join(id, best + '.mid')
        item['file'] = file

        return item

In [535]:
# we create the dataframe that is compliant with the original MIDI2Vec processing pipeline
df = pandas.DataFrame(
    columns=['id', 'file', 'song_name', 'album_name', 'artist_name', 'artist_mb', 'tag_echo', 'tag_mbz', 'year'])

For every metadata file (one per song), we extract the relevant attributes and append them to a dataframe.

**CAUTION, THE FOLLOWING CODE TAKES A LOT OF TIME** - you can check the summary output or use the *lakh_processed.csv* that is the result of this processing, serialized from the dataframe into a CSV file.

In [536]:
# append all the metadata to a dataframe
for f in tqdm(get_all_song_hdf5_path()):
    df = df.append(read_hdf5(f), ignore_index=True)

We perform a sanity check and display the elements of the created dataframe.

We note that finally there are 31034 MIDI files that were matched with metadata, compliant with the findings in MIDI2Vec processing.

In [537]:
df.head()

Unnamed: 0,id,file,song_name,album_name,artist_name,artist_mb,tag_echo,tag_mbz,year
0,TRAAAGR128F425B14B,\A\A\A\TRAAAGR128F425B14B\b97c529ab9ef783a849b...,Into The Nightlife,Bring Ya To The Brink,Cyndi Lauper,7bd9e20e-74b9-446a-a2ed-a223f82a36e7,new wave,classic pop and rock,2008
1,TRAAAZF12903CCCF6B,\A\A\A\TRAAAZF12903CCCF6B\05f21994c71a5f881e64...,Break My Stride,I Don't Speak The Language,Matthew Wilder,471e21ab-7a14-4190-a9d2-f95197616df4,pop rock,,1983
2,TRAABVM128F92CA9DC,\A\A\B\TRAABVM128F92CA9DC\39d6c288e1bd93d4705e...,Caught In A Dream,Gold,Tesla,eeacb319-8d4c-48e0-80a0-944e71c375bf,hard rock,,2004
3,TRAABXH128F42955D6,\A\A\B\TRAABXH128F42955D6\04266ac849c1d3814dc0...,Keep An Eye On Summer (Album Version),Imagination,Brian Wilson,634fe78e-fc6b-4b2a-ba83-c8c66e13a8aa,chamber pop,classic pop and rock,1998
4,TRAACQE12903CC706C,\A\A\C\TRAACQE12903CC706C\f1be134b947dfde3eece...,Summer,Good Morning,Old Man River,7720a649-0c70-4c7a-972a-c29ccb898201,los angeles,,2007


Finally, we serialize the result into a file, to skip the expensive processing in the future.

In [13]:
out_path = concat_path([DATA_PATH_PREFIX, 'lakh_processed.csv'])

df.to_csv(out_path, index_label=False)

midi2vec_df = df

---
#### Loading the processed and serialized files

In [46]:
midi2vec_df = pandas.read_csv(concat_path([DATA_PATH_PREFIX, 'lakh_processed.csv']))
df_all = pandas.read_csv(concat_path([DATA_PATH_PREFIX, 'lakh_all_songs_processed.csv']))

In [47]:
midi2vec_df.head()

Unnamed: 0,id,file,song_name,album_name,artist_name,artist_mb,tag_echo,tag_mbz,year
0,TRAAAGR128F425B14B,\A\A\A\TRAAAGR128F425B14B\b97c529ab9ef783a849b...,Into The Nightlife,Bring Ya To The Brink,Cyndi Lauper,7bd9e20e-74b9-446a-a2ed-a223f82a36e7,new wave,classic pop and rock,2008
1,TRAAAZF12903CCCF6B,\A\A\A\TRAAAZF12903CCCF6B\05f21994c71a5f881e64...,Break My Stride,I Don't Speak The Language,Matthew Wilder,471e21ab-7a14-4190-a9d2-f95197616df4,pop rock,,1983
2,TRAABVM128F92CA9DC,\A\A\B\TRAABVM128F92CA9DC\39d6c288e1bd93d4705e...,Caught In A Dream,Gold,Tesla,eeacb319-8d4c-48e0-80a0-944e71c375bf,hard rock,,2004
3,TRAABXH128F42955D6,\A\A\B\TRAABXH128F42955D6\04266ac849c1d3814dc0...,Keep An Eye On Summer (Album Version),Imagination,Brian Wilson,634fe78e-fc6b-4b2a-ba83-c8c66e13a8aa,chamber pop,classic pop and rock,1998
4,TRAACQE12903CC706C,\A\A\C\TRAACQE12903CC706C\f1be134b947dfde3eece...,Summer,Good Morning,Old Man River,7720a649-0c70-4c7a-972a-c29ccb898201,los angeles,,2007


In [48]:
df_all.head()

Unnamed: 0,id,analyzer_version,artist_7digitalid,artist_familiarity,artist_hotttnesss,artist_id,artist_latitude,artist_location,artist_longitude,artist_mbid,...,loudness,mode,mode_confidence,start_of_fade_out,tempo,time_signature,time_signature_confidence,track_id,idx_artist_mbtags,year
0,TRAAAGR128F425B14B,,11319,0.712886,0.559257,ARGE7G11187FB37E05,,"Brooklyn, NY",,7bd9e20e-74b9-446a-a2ed-a223f82a36e7,...,-7.322,0,0.495,240.64,123.989,4,0.8,TRAAAGR128F425B14B,0,2008
1,TRAAAZF12903CCCF6B,,93189,0.546102,0.383787,ARJJ8611187FB5321F,40.79086,"New York, NY [Manhattan]",-73.96644,471e21ab-7a14-4190-a9d2-f95197616df4,...,-11.137,1,0.442,167.607,110.129,4,0.711,TRAAAZF12903CCCF6B,0,1983
2,TRAABVM128F92CA9DC,,1396,0.7072,0.513463,ARYKCQI1187FB3B18F,,,,eeacb319-8d4c-48e0-80a0-944e71c375bf,...,-5.271,1,0.756,285.605,150.062,4,0.931,TRAABVM128F92CA9DC,0,2004
3,TRAABXH128F42955D6,,611,0.635346,0.463478,ARD9UVF1187B9B17FE,,"Hawthorne, CA",,634fe78e-fc6b-4b2a-ba83-c8c66e13a8aa,...,-7.108,1,0.514,160.717,100.494,3,1.0,TRAABXH128F42955D6,0,1998
4,TRAACQE12903CC706C,,153505,0.583006,0.333922,ARDDIBO1187B9B0822,,,,7720a649-0c70-4c7a-972a-c29ccb898201,...,-5.033,1,0.453,156.973,118.43,4,0.61,TRAACQE12903CC706C,0,2007


---
### Million Song Dataset Features

We briefly explore how we can link the Lakh dataset with the original Million Song Dataset - to obtain more features that may be updated or developed.

In [51]:
msd = h5py.File(concat_path([DATA_PATH_PREFIX, MSD_SUMMARY]), 'r')

In [None]:
cnt = 0 # debug purposes

for i in tqdm(range(0, 1000000)):
    msd_analysis_song_id = msd['analysis']['songs'][i][30].decode()
    
    if msd_analysis_song_id in match_scores:
        cnt = cnt+1 # debug purposes
        

We expect this counter to be 31034 - as many as there are entries in the filtered data. This indicates that we can join back the lakh data to million song dataset files - and extend it as needed. In particular, this just emulates the joining between the two datasets - in reality instead of the counter we would have the join procedure between two datasets/dataframe once IDs match.

In [120]:
cnt

31034

In [68]:
msd.close()

In [54]:
all_ids_decoded = []
for ids in all_ids:
    all_ids_decoded.append(ids.decode())

In [55]:
# we join the two lakh dataframes to get all the summary columns that describe the song,
# except the actual MIDI and song array data
df_joined = pandas.merge(midi2vec_df, df_all, how='inner', on = 'id')

The column descriptions and data meaning is available here: http://millionsongdataset.com/pages/field-list/

In [58]:
pandas.pandas.set_option('display.max_columns', None)
df_joined.head()

Unnamed: 0,id,file,song_name,album_name,artist_name_x,artist_mb,tag_echo,tag_mbz,year_x,analyzer_version,artist_7digitalid,artist_familiarity,artist_hotttnesss,artist_id,artist_latitude,artist_location,artist_longitude,artist_mbid,artist_name_y,artist_playmeid,genre,idx_artist_terms,idx_similar_artists,release,release_7digitalid,song_hotttnesss,song_id,title,track_7digitalid,analysis_sample_rate,audio_md5,danceability,duration,end_of_fade_in,energy,idx_bars_confidence,idx_bars_start,idx_beats_confidence,idx_beats_start,idx_sections_confidence,idx_sections_start,idx_segments_confidence,idx_segments_loudness_max,idx_segments_loudness_max_time,idx_segments_loudness_start,idx_segments_pitches,idx_segments_start,idx_segments_timbre,idx_tatums_confidence,idx_tatums_start,key,key_confidence,loudness,mode,mode_confidence,start_of_fade_out,tempo,time_signature,time_signature_confidence,track_id,idx_artist_mbtags,year_y
0,TRAAAGR128F425B14B,\A\A\A\TRAAAGR128F425B14B\b97c529ab9ef783a849b...,Into The Nightlife,Bring Ya To The Brink,Cyndi Lauper,7bd9e20e-74b9-446a-a2ed-a223f82a36e7,new wave,classic pop and rock,2008,,11319,0.712886,0.559257,ARGE7G11187FB37E05,,"Brooklyn, NY",,7bd9e20e-74b9-446a-a2ed-a223f82a36e7,Cyndi Lauper,382,,0,0,Bring Ya To The Brink,279219,,SONRWUU12AF72A4283,Into The Nightlife,3110092,22050,7573fabe891b25bcd3c5866e4c5df1f0,0.0,240.63955,4.487,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0.608,-7.322,0,0.495,240.64,123.989,4,0.8,TRAAAGR128F425B14B,0,2008
1,TRAAAZF12903CCCF6B,\A\A\A\TRAAAZF12903CCCF6B\05f21994c71a5f881e64...,Break My Stride,I Don't Speak The Language,Matthew Wilder,471e21ab-7a14-4190-a9d2-f95197616df4,pop rock,,1983,,93189,0.546102,0.383787,ARJJ8611187FB5321F,40.79086,"New York, NY [Manhattan]",-73.96644,471e21ab-7a14-4190-a9d2-f95197616df4,Matthew Wilder,36027,,0,0,I Don't Speak The Language,763937,,SOUCVHW12AB018E830,Break My Stride,8473798,22050,facaf1c26c48d98e6b20c54b4d02051b,0.0,184.47628,0.682,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,0.456,-11.137,1,0.442,167.607,110.129,4,0.711,TRAAAZF12903CCCF6B,0,1983
2,TRAABVM128F92CA9DC,\A\A\B\TRAABVM128F92CA9DC\39d6c288e1bd93d4705e...,Caught In A Dream,Gold,Tesla,eeacb319-8d4c-48e0-80a0-944e71c375bf,hard rock,,2004,,1396,0.7072,0.513463,ARYKCQI1187FB3B18F,,,,eeacb319-8d4c-48e0-80a0-944e71c375bf,Tesla,7536,,0,0,Gold,372309,0.684136,SOXLBJT12A8C140925,Caught In A Dream,4143071,22050,3e57f1f9670a3aa3bd8901e6eee32149,0.0,290.29832,0.145,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,0.725,-5.271,1,0.756,285.605,150.062,4,0.931,TRAABVM128F92CA9DC,0,2004
3,TRAABXH128F42955D6,\A\A\B\TRAABXH128F42955D6\04266ac849c1d3814dc0...,Keep An Eye On Summer (Album Version),Imagination,Brian Wilson,634fe78e-fc6b-4b2a-ba83-c8c66e13a8aa,chamber pop,classic pop and rock,1998,,611,0.635346,0.463478,ARD9UVF1187B9B17FE,,"Hawthorne, CA",,634fe78e-fc6b-4b2a-ba83-c8c66e13a8aa,Brian Wilson,2437,,0,0,Imagination,110308,,SOHXFBA12A8C13D637,Keep An Eye On Summer (Album Version),1140917,22050,5c745118da3ab07e825a71a74285317a,0.0,168.64608,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.482,-7.108,1,0.514,160.717,100.494,3,1.0,TRAABXH128F42955D6,0,1998
4,TRAACQE12903CC706C,\A\A\C\TRAACQE12903CC706C\f1be134b947dfde3eece...,Summer,Good Morning,Old Man River,7720a649-0c70-4c7a-972a-c29ccb898201,los angeles,,2007,,153505,0.583006,0.333922,ARDDIBO1187B9B0822,,,,7720a649-0c70-4c7a-972a-c29ccb898201,Old Man River,8923,,0,0,Good Morning,673706,,SOGUCAN12AB017BF99,Summer,7473946,22050,a296c6b70f0f6600bd0e4d93ad0c7648,0.0,165.40689,0.235,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,0.233,-5.033,1,0.453,156.973,118.43,4,0.61,TRAACQE12903CC706C,0,2007


In [34]:
df_joined.columns

Index(['id', 'file', 'song_name', 'album_name', 'artist_name_x', 'artist_mb',
       'tag_echo', 'tag_mbz', 'year_x', 'analyzer_version',
       'artist_7digitalid', 'artist_familiarity', 'artist_hotttnesss',
       'artist_id', 'artist_latitude', 'artist_location', 'artist_longitude',
       'artist_mbid', 'artist_name_y', 'artist_playmeid', 'genre',
       'idx_artist_terms', 'idx_similar_artists', 'release',
       'release_7digitalid', 'song_hotttnesss', 'song_id', 'title',
       'track_7digitalid', 'analysis_sample_rate', 'audio_md5', 'danceability',
       'duration', 'end_of_fade_in', 'energy', 'idx_bars_confidence',
       'idx_bars_start', 'idx_beats_confidence', 'idx_beats_start',
       'idx_sections_confidence', 'idx_sections_start',
       'idx_segments_confidence', 'idx_segments_loudness_max',
       'idx_segments_loudness_max_time', 'idx_segments_loudness_start',
       'idx_segments_pitches', 'idx_segments_start', 'idx_segments_timbre',
       'idx_tatums_confidence', 

In [42]:
df_joined.describe()

Unnamed: 0,year_x,artist_familiarity,artist_hotttnesss,artist_latitude,artist_longitude,song_hotttnesss,danceability,duration,end_of_fade_in,energy,key_confidence,loudness,mode_confidence,start_of_fade_out,tempo,time_signature_confidence
count,31034.0,31030.0,31034.0,10507.0,10507.0,15992.0,31034.0,31034.0,31034.0,31034.0,31034.0,31034.0,31034.0,31034.0,31034.0,31034.0
mean,1081.739093,0.592923,0.419907,39.852279,-50.331265,0.422795,0.0,257.0723,0.920535,0.0,0.492969,-10.347283,0.52128,247.430595,120.750582,0.500568
std,993.570353,0.161408,0.143433,15.486998,55.608187,0.263639,0.0,105.089549,2.728698,0.0,0.279201,5.339811,0.194721,103.539465,30.953603,0.370516
min,0.0,0.0,0.0,-45.8745,-157.85762,0.0,0.0,0.62649,0.0,0.0,0.0,-57.004,0.0,0.626,0.0,0.0
25%,0.0,0.497145,0.353248,34.15917,-89.40763,0.249066,0.0,194.82077,0.0,0.0,0.285,-12.61275,0.404,186.45,99.8905,0.10825
50%,1973.0,0.59538,0.410921,40.71455,-74.00712,0.444928,0.0,236.72118,0.235,0.0,0.52,-9.033,0.534,226.9025,122.514,0.528
75%,1999.0,0.70057,0.501856,50.91552,-0.38049,0.627124,0.0,295.60118,0.573,0.0,0.702,-6.689,0.656,284.4895,136.89475,0.849
max,2010.0,1.0,1.082503,69.65102,175.47131,1.0,0.0,2149.32853,300.588,0.0,1.0,1.244,1.0,2149.329,262.183,1.0


Unfortunately, some *interesting* columns such as danceability or energy are all zero. However, these are derived, and if added or updated, they are straightforward to add in the data processing pipeline as demonstrated above - so long the datasets share the common song ID (whether from Million song dataset, or one of the available song IDs from the dataset - that is Echo Nest Track ID or 7digital.com ID). 

As Echo Nest (https://en.wikipedia.org/wiki/The_Echo_Nest) is owned by Spotify, it would be an interesting extension of the dataset to try scraping the Spotify API for the available metadata. However, it is unclear if this would be possible with the license Spotify provides for app development, as the data is behind possibly metered and monitored API linked to the user, e.g. https://developer.spotify.com/documentation/web-api/ and https://spotipy.readthedocs.io/en/2.19.0/.

As of now, we are interested in one of the missing columns: danceability, if it is available in the MSD summary dataset. 

In [41]:
test_msd = pandas.HDFStore(concat_path([DATA_PATH_PREFIX, MSD_SUMMARY]), 'r')

In [42]:
danceability_test = test_msd.root.analysis.songs.col('danceability')

In [43]:
max(danceability_test)-min(danceability_test)

0.0

Unfortunately, this indicates that the values are all zero and not included, however if a new release comes out, extending the data processing pipeline is straightforward.

However, overall we have a dataset that has MIDI files and artist and song metadata that allows a variety of analyses.