<a href="https://colab.research.google.com/github/stefanocostantini/music_language_model/blob/first-stab/language_model_first_try.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Music language model: first attempt

In this notebook we try a simple language model, training it using our training dataset. The plan is to do the following:

- Load the datasets from S3
- Prepare the data, creating sequences of a pre-determined number of "words" (i.e. groups of notes, pre-padded when starting with the following piece) and trying to predict the next "word". When the sentence finishes, still try to predict the last "word" with fewers inputs
- Set up an LSTM-based language model (simple to start with) and train it
- Extract the encoding layer(s) and use them for classification, trying to predict the `composer` from the "words" in the piece

### Setup and load datasets

In [0]:
# Imports
import pandas as pd
import numpy as np
import boto3
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

In [0]:
# Parameters
sequence_length = 15 # Length of encoded sequences to use in model

In [2]:
# Load S3 access keys (need to manually add file)
from google.colab import files
uploaded = files.upload()

Saving colab_accessKeys.csv to colab_accessKeys.csv


In [0]:
# Read access keys
keys = pd.read_csv("colab_accessKeys.csv")
access = keys.iloc[0,0]
secret = keys.iloc[0,1]

In [0]:
# Set up connection to S3
s3 = boto3.client('s3', aws_access_key_id=access,aws_secret_access_key=secret)
bucket = 'stefano-colab-data'

In [0]:
# Read train and test datasets
key_prefix = 'datasets/'

read_train = s3.get_object(Bucket=bucket, Key=key_prefix + "training.csv")
train_raw = pd.read_csv(read_train['Body'], sep=',')

read_test = s3.get_object(Bucket=bucket, Key=key_prefix + "test.csv")
test_raw = pd.read_csv(read_test['Body'], sep = ',')

In [6]:
train_raw.head(3)

Unnamed: 0,id,text,composer,composition,movement,ensemble,source,transcriber,catalog_name,seconds
0,1727,41+53+65+69+81 41+53+65+69+81 41+53+65+69+81 4...,Schubert,Piano Quintet in A major,2. Andante,Piano Quintet,European Archive,http://tirolmusic.blogspot.com/,OP114,447
1,1728,69+81 69+81 73+85 73+85 74+86 74+86 45+45+45+4...,Schubert,Piano Quintet in A major,3. Scherzo: Presto,Piano Quintet,European Archive,http://tirolmusic.blogspot.com/,OP114,251
2,1729,69 69 69 69 69 69 69 69 38+57+62+66+74 38+57+6...,Schubert,Piano Quintet in A major,4. Andantino - Allegretto,Piano Quintet,European Archive,http://tirolmusic.blogspot.com/,OP114,444


### Prepare data for model

#### _Tokenize words and convert strings into integer sequences_

In [0]:
# First we create a variable, called 'texts' (train and test) which will contain all the strings
# in the training dataset. We will also collect the labels (to be used later)
texts_train = []
labels_train = []
for index, row in train_raw.iterrows():
    texts_train.append(row.text)
    labels_train.append(row.composer)

texts_test = []
labels_test = []
for index, row in test_raw.iterrows():
    texts_test.append(row.text)
    labels_test.append(row.composer)

In [0]:
# We now need to convert the sequences of "words" into sequence of integers. 
# We do this by using the Keras tokenizer. The default includes the '+' sign as
# one of the filters to be removed. This is not appropriate here as "words" are
# defined as 'note1+note2+...'. So we remove the '+' sign from the filter when
# instantiating this class.
tk = Tokenizer(filters='!"#$%&()*,-./:;<=>?@[\\]^_`{|}~\t\n')

In [0]:
# To make sure we have a consistent dictionary, we apply the tokenizer on to the
# entire corpus
texts_all = texts_train + texts_test
tk.fit_on_texts(texts_train)

In [0]:
# Now let's convert the texts into sequences of integers
sequences_train = tk.texts_to_sequences(texts_train)
sequences_test = tk.texts_to_sequences(texts_test)

In [0]:
# We then determine the size of the vocabulary, which will be needed to specify
# the size of the word embedding layer(s) of the model and for encoding output
# words using one-hot encoding
vocab_size = len(tk.word_index) + 1 # adding 1 as the vocabulary index will have len + 1 positions

#### _Create sequences and target "words" for language model_

We want to create sequences of a pre-determined number of words which the model will use to learn what the word after the sequence should be. The length of the sequence should be  a variable. 

When extracting a sequence, we will also extract an extra word at the end, which we will later split to make that the target word for each sequence.

Sequences for the model are defined only within each piece (as it would not make sense to predict the first notes of a new piece using the last notes of another) (_We may test this in future_)

In [0]:
def build_model_sequences(orig_sequences, max_sequence_length):
  """
  Given sequences of text, for each it builds model sequences of words of the
  desired length (+1 for the target word). No model sequence is build using words
  from more than one of the original sequences.
  """
  for sequence in orig_sequences:
    model_sequences = list()
    for i in range(max_sequence_length, len(sequence)):
      seq = sequence[i-max_sequence_length:i+1]
      model_sequences.append(seq)
  return model_sequences

In [0]:
# Applying the function above we construct sequences of a chosen length. These 
train_all = build_model_sequences(sequences_train, sequence_length)
test_all = build_model_sequences(sequences_test, sequence_length)

In [0]:
# The last step involves separating the actual sequence from the target word
# in each case and one-hot encoding the target word
train_all = np.array(train_all)
test_all = np.array(test_all)
X_train, y_train = train_all[:,:-1], train_all[:,-1]
X_test, y_test = test_all[:,:-1], test_all[:,-1]

y_train = to_categorical(y_train, num_classes=vocab_size)
y_test = to_categorical(y_test, num_classes=vocab_size)

### Set up language model and train, evaluating performance

### Extract encoding from language model and train a classifier to identify the composer

### Train classifier directly on the composer labels and the same featurised data to compare performance with the previous approach