<a href="https://colab.research.google.com/github/stefanocostantini/music_language_model/blob/setting-up/music_lang_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# Imports

import pandas as pd
import boto3
from __future__ import division
import math
import progressbar

In [34]:
# Load S3 access keys (need to manually add file)
from google.colab import files
uploaded = files.upload()

In [0]:
# Read access keys
keys = pd.read_csv("colab_accessKeys.csv")
access = keys.iloc[0,0]
secret = keys.iloc[0,1]

In [0]:
# Read from S3
bucket = 'stefano-colab-data'
key = 'train/2186.csv'

s3 = boto3.client('s3', aws_access_key_id=access,aws_secret_access_key=secret)
read_file = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(read_file['Body'],sep=',')

In [204]:
df.head(5)

Unnamed: 0,start_time,end_time,instrument,note,start_beat,end_beat,note_value
0,45534,55262,41,88,0.5,0.239583,Sixteenth
1,55774,60893,41,87,0.75,0.239583,Sixteenth
2,61406,71646,41,88,1.0,0.489583,Eighth
3,71646,83422,41,83,1.5,0.489583,Eighth
4,83934,96222,41,80,2.0,0.489583,Eighth


There are various ways to turn a music transcript into strings. We need to experiment. First, however, we group all the notes played at each of the beats making up the piece. The smallest duration becomes the beats unit of the piece. So, for example, if 1/16 (or 0.0625) is the shortest duration (measured by `end_beat`) then we will look at each of those beats from start to finish and collect in lists the notes sounding at that moment (and the instruments playing them)

In [0]:
# Function to extract notes and instruments by beat
def beat_extractor(df):
  """
  Given a dataframe (a label of the MusicNet database), it identifies
  the minimum beat unit and, for each unit, it creates two lists:

  - Notes sounding at that beat (either beginning to be played or lasting from 
    previous beat)
  - Instruments playing them

  Returns a dataframe
  """
  # Initialise empty dataframe
  column_names = ["beat", "notes", "instruments"]
  df_combined = pd.DataFrame(columns=column_names)

  # Find beat unit and max beat
  min_unit = df.end_beat.min()
  df['end_beat_total'] = df['start_beat'] + df['end_beat']
  max_beat = math.ceil(df.end_beat_total.max() / min_unit)

  # Extract notes and instruments
  with progressbar.ProgressBar(max_value=max_beat) as bar:
    for b in range(0, max_beat):
      df_beat = df[(df['start_beat'] <= b * min_unit) & (df['end_beat_total'] >= b * min_unit)]
      df_beat_sorted = df_beat.sort_values(by = 'start_beat')
      notes_list = df_beat_sorted.note.tolist() 
      instruments_list = df_beat_sorted.instrument.tolist()
      df_combined = df_combined.append({'beat': b * min_unit, 
                                      'notes': notes_list,
                                      'instruments': instruments_list}, ignore_index=True)
  
  return df_combined


In [206]:
c = beat_extractor(df)

100% (3614 of 3614) |####################| Elapsed Time: 0:00:00 ETA:  00:00:00


In [207]:
c.tail()

Unnamed: 0,beat,notes,instruments
3609,413.53125,[88],[41]
3610,413.645833,[88],[41]
3611,413.760417,[88],[41]
3612,413.875,[88],[41]
3613,413.989583,[88],[41]


In [0]:
# Next steps
# - write a function that turns the above dataframe into a string of text
# - apply it to all files in train to create text corpus (one document per element)
# - save again to S3 for re-use

# Write CSV
# csv_buffer = StringIO()
# df.to_csv(csv_buffer)
# s3.put_object(Bucket, Key,Body=csv_buffer.getvalue())
