<a href="https://colab.research.google.com/github/stefanocostantini/music_language_model/blob/setting-up/music_lang_initial_data_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# Imports

import pandas as pd
import boto3
from __future__ import division
import math
import progressbar

In [2]:
# Load S3 access keys (need to manually add file)
from google.colab import files
uploaded = files.upload()

Saving colab_accessKeys.csv to colab_accessKeys.csv


In [0]:
# Read access keys
keys = pd.read_csv("colab_accessKeys.csv")
access = keys.iloc[0,0]
secret = keys.iloc[0,1]

In [0]:
# Set up connection to S3
s3 = boto3.client('s3', aws_access_key_id=access,aws_secret_access_key=secret)
bucket = 'stefano-colab-data'

In [0]:
# Function to get file names in bucket
def get_file_names(bucket, prefix):
  file_names = list()
  paginator = s3.get_paginator("list_objects_v2")
  kwargs = {'Bucket': bucket, 'Prefix': prefix}
  for page in paginator.paginate(**kwargs):
    contents = page["Contents"]
    contents.pop(0) # remove bucket name
    for obj in contents:
          file_names.append(obj["Key"])
  return file_names   

There are various ways to turn a music transcript into strings. We need to experiment. First, however, we group all the notes played at each of the beats making up the piece. The smallest duration becomes the beats unit of the piece. So, for example, if 1/16 (or 0.0625) is the shortest duration (measured by end_beat) then we will look at each of those beats from start to finish and collect in lists the notes sounding at that moment (and the instruments playing them)

In [0]:
# Now for each row of the dataframe (i.e. for each note) we 
# find the array of beats during which the note was "sounding". Each, a note that
# start at beat 1.50 and ends at beat 2.0, where the minimum beat is 0.25, will have
# the following array of beats [1.50, 1.75, 2.0]. 
#
# After that, we explode the dataframe so that we have a line for each separate beat
# and then group it by beat, so that all the note that "sound" during that beat are 
# grouped. 
#
# Finally, we convert that into a string of text, which will be an input into the language model

In [0]:
# Function to find the beat units within a beats interval
def find_beats(a,b, min_unit):
  beats = [round((a - (a % min_unit)) + min_unit * x, 6) for x in range(1, int(round((b-a) / min_unit,0)+1))]
  return beats

In [0]:
# Function that adds the beats column as array of beats
def add_beats(df):
  min_unit = df.end_beat.min()
  df['beats'] = df.apply(lambda x: find_beats(x.start_beat, x.final_beat, min_unit), axis=1)
  return df

In [0]:
# Function to convert transcript into linear text
def transcript_to_text(df):
  df = add_beats(df)
  df_expl = df.explode('beats').sort_values(["beats", "note"], ascending = (True, True))
  df_expl['notes'] = df_expl['note'].apply(str)
  df_grouped_notes = pd.DataFrame(df_expl.groupby('beats')['notes'].agg('+'.join)).reset_index()
  text = " ".join(list(df_grouped_notes))
  return text

In [0]:
# Function to load files from S3, turn trascript into text and append to dataframe
def transcripts_to_df(bucket, file_names):
  column_names = ["ID", "text"]
  data = pd.DataFrame(columns = column_names)
  for file in file_names:
    read_file = s3.get_object(Bucket=bucket, Key=file)
    df = pd.read_csv(read_file['Body'], sep=',')
    df['final_beat'] = df['start_beat'] + df['end_beat']
    text = transcript_to_text(df)
    file_id = file.rsplit('/', 1)[1].rsplit(".", 1)[0]
    data = data.append({'ID' : file_id , 'text' : text} , ignore_index=True)
  return data

In [102]:
# First get the file names on both folders
files_train = get_file_names(bucket, "train")
files_test = get_file_names(bucket, "test")
print(len(files_train), len(files_test))

320 10


In [0]:
# Then we convert each trascript into a string of text and combine everything
# into a single dataframe
data_train = transcripts_to_df(bucket, files_train)

In [106]:
data_train

Unnamed: 0,ID,text
0,1727,beats notes
1,1728,beats notes
2,1729,beats notes
3,1730,beats notes
4,1733,beats notes
...,...,...
315,2632,beats notes
316,2633,beats notes
317,2659,beats notes
318,2677,beats notes


In [0]:
# NEED TO FIX FUNCTION ABOVE!

In [0]:
# Load MusicNet metadata, and join with above to create two complete 
# datasets (train and test)

In [0]:
# Save datasets to S3

In [0]:
# Next steps
# - write a function that turns the above dataframe into a string of text
# - apply it to all files in train to create text corpus (one document per element)
# - save again to S3 for re-use

# Write CSV
# csv_buffer = StringIO()
# df.to_csv(csv_buffer)
# s3.put_object(Bucket, Key,Body=csv_buffer.getvalue())
