# Music Generation with RNNs

We will explore the application of Recurrent Neural Networks (RNN) for music generation. We will train a model to learn patterns contained in raw sheet music in ABC notation and then use the model to generate new music.

> Blocco con rientro



# Imports

In [None]:
# Import Tensorflow 2.0
%tensorflow_version 2.x
import tensorflow as tf 

# Download and import the MIT 6.S191 package
!pip install mitdeeplearning
import mitdeeplearning as mdl

# Import all remaining packages
import numpy as np
import os
import time
import functools
from IPython import display as ipythondisplay
from tqdm import tqdm
!apt-get install abcmidi timidity > /dev/null 2>&1

# Check that we are using a GPU, if not switch runtimes
#   using Runtime > Change Runtime Type > GPU
assert len(tf.config.list_physical_devices('GPU')) > 0

# Load Dataset

In [None]:
# Download the dataset
songs = mdl.lab1.load_training_data()

In [None]:
# Print one of the songs to inspect it in more detail
example_song = songs[0]
print("\nExample song:")
print(example_song)

In [None]:
# Convert the ABC notation to audio file and listen to it
mdl.lab1.play_song(example_song)

N.B. the music notation used in this dataset does not just contain information relative to the notes being played, there is also meta-data related to
* song title
* song key
* song tempo

This will pose some constraint in the numerical representation of the text data

In [None]:
# Join our list of song strings into a single string containing all songs
songs_joined = "\n\n".join(songs) 

# Find all unique characters in the joined string
# N.B. -> "set" builds an unordered list of unique characters
# N.B. -> "sorted" orders them in ascending order
vocab = sorted(set(songs_joined))
print('There are', len(vocab), "unique characters in the dataset")


# Dataset pre-processing

We want to train a RNN to learn patterns in ABC music and generate new music based on such analysis.

From a neural network model point of view this means: 

*Given a character or a sequence of characters, which is the most probable next character?*

What we have to do is:

* input a sequence of characters to the model
* train the model to predict the output, i.e. following characters at each time-step.

N.B. RNNs mantain an internal state that depends only on previously seen elements, info about characters seen up until a given moment, will be taken into account when performing prediction.

## Text vectorization

We already created the "alphabet" of characters contained in the dataset

Now we need to define a one-to-one mapping from characters to numbers and viceversa in order for the network to be able to work with the data.

In [None]:
# Define numerical representation of the text
# Create a mapping from character to unique index.
# For example, to get the index of the character "d", 
#   we can evaluate `char2idx["d"]`. 
char2idx = {u:i for i, u in enumerate(vocab)}

# Create a mapping from indices to characters. This is
#   the inverse of char2idx and allows us to convert back
#   from unique index to the character in our vocabulary.
idx2char = np.array(vocab)


In [None]:
char2idx

In [None]:
### Vectorize the songs string ###. FILL THE CODE

'''FILL THE CODE: Write a function to convert the all songs string to a vectorized
    (i.e., numeric) representation. Use the appropriate mapping
    above to convert from vocab characters to the corresponding indices.

  NOTE: the output of the `vectorize_string` function 
  should be a np.array with `N` elements, where `N` is
  the number of characters in the input string

'''
# def vectorize_string(string):
  # FILL THE CODE

vectorized_songs = vectorize_string(songs_joined)

We can also look at how the first part of the text is mapped to an integer representation:

In [None]:
print ('{} ---- characters mapped to int ----> {}'.format(repr(songs_joined[:10]), vectorized_songs[:10]))
# check that vectorized_songs is a numpy array
assert isinstance(vectorized_songs, np.ndarray), "returned result should be a numpy array"

## Create training examples and targets

We cannot feed the data as it is into the network, we need to divide examples into input and output targets, each of size `seq_length`characters.

Since we want to predict the next character, each input will have its output target as a sequence of characters with the same length and shifted to the right of one character.

Example, if seq_length is 4 and the sequence is "Hello", then we'll have
* input_sequence = "Hell"
* output_sequence = "ello"

Therefore, we will break our text into chunks of seq_length+1 characters.


In [None]:
### Batch definition to create training examples ###

def get_batch(vectorized_songs, seq_length, batch_size):
  # the length of the vectorized songs string
  n = vectorized_songs.shape[0] - 1
  # randomly choose the starting indices for the examples in the training batch
  idx = np.random.choice(n-seq_length, batch_size)

  # construct a list of input sequences for the training batch
  input_batch = [vectorized_songs[i : i+seq_length] for i in idx]
  
  #construct a list of output sequences for the training batch
  output_batch = [vectorized_songs[i+1 : i+seq_length+1] for i in idx]

  # x_batch, y_batch provide the true inputs and targets for network training
  x_batch = np.reshape(input_batch, [batch_size, seq_length])
  y_batch = np.reshape(output_batch, [batch_size, seq_length])
  return x_batch, y_batch


# Perform some simple tests to make sure your batch function is working properly! 
test_args = (vectorized_songs, 10, 2)
if not mdl.lab1.test_batch_func_types(get_batch, test_args) or \
   not mdl.lab1.test_batch_func_shapes(get_batch, test_args) or \
   not mdl.lab1.test_batch_func_next_step(get_batch, test_args): 
   print("======\n[FAIL] could not pass tests")
else: 
   print("======\n[PASS] passed all tests!")

For each of these vectors, each index is processed at a single time step. So, for the input at time step 0, the model receives the index for the first character in the sequence, and tries to predict the index of the next character. At the next timestep, it does the same thing, but the RNN considers the information from the previous step, i.e., its updated state, in addition to the current input.

We can make this concrete by taking a look at how this works over the first several characters in our text:

In [None]:
x_batch, y_batch = get_batch(vectorized_songs, seq_length=5, batch_size=1)
for i, (input_idx, target_idx) in enumerate(zip(np.squeeze(x_batch), np.squeeze(y_batch))):
    print("Step {:3d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

# RNN Model

Now we can define and train a RNN model on the ABC dataset, and then use it to generate new songs.

We'll use batches of song snippets as defined in the previous section.

We use a LSTM architecture:

* state vector mantains information about temporal relationship between consecutive characters.

* final output of the LSTM is fed into a fully connected [`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer, which outputs a [softmax](https://deepai.org/machine-learning-glossary-and-terms/softmax-layer) over each character in the vocabulary.

* we sample on the output distribution to predict the next character.

To build the model we will use the [`tf.keras.Sequential`](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) model, as seen in the previous lab.

We will use three type of layers:

* [`tf.keras.layers.Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding): This is the input layer, consisting of a trainable lookup table that maps the numbers of each character to a vector with `embedding_dim` dimensions.
* [`tf.keras.layers.LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): Our LSTM network, with size `units=rnn_units`. 
* [`tf.keras.layers.Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense): The output layer, with `vocab_size` outputs.

<img src="https://raw.githubusercontent.com/aamini/introtodeeplearning/2019/lab1/img/lstm_unrolled-01-01.png" alt="Drawing"/>

## Define model: FILL THE CODE

In [None]:
# define the LSTM layer
def LSTM(rnn_units):
  return tf.keras.layers.LSTM(
      rnn_units, # dimensionality of the output space
      return_sequences=True, # wheter to return last output or full sequence
      recurrent_initializer='glorot_uniform',
      recurrent_activation = 'sigmoid',
      stateful = True, # last state for each sample at index i in a batch will 
                       # be used as initial state for the sample of index i in the following batch.
    )

In [None]:
# Define RNN model using tf.keras.Sequential: FILL THE CODE

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):

  """FILL THE CODE
   Build a sequential model consisting of:
   - Embedding layer
   - LSTM layer
   - Fully connected layer
  """

  #model = ...'''TODO'''...
    # Layer 1: Embedding layer to transform indices into dense vectors 
    #   of a fixed embedding size
    #...'''TODO'''...

    # Layer 2: LSTM with `rnn_units` number of units. 
    # FILL THE CODE: Call the LSTM function defined above to add this layer.
    #...'''TODO'''...

    # Layer 3: Dense (fully-connected) layer that transforms the LSTM output
    #   into the vocabulary size. 
    # FILL THE CODE: Add the Dense layer.
    #...'''TODO'''...

  return model

# Build a simple model with default hyperparameters. You will get the 
#   chance to change these later.
model = build_model(len(vocab), embedding_dim=256, rnn_units=1024, batch_size=32)

Check summary of the model

In [None]:
model.summary()

check input/output dimensionality

In [None]:
x, y = get_batch(vectorized_songs, seq_length=100, batch_size=32)
pred = model(x)
print("Input shape:      ", x.shape, " # (batch_size, sequence_length)")
print("Prediction shape: ", pred.shape, "# (batch_size, sequence_length, vocab_size)")

### Predictions from the untrained model

Let's take a look at what our untrained model is predicting.

To get actual predictions from the model, we sample from the output distribution, which is defined by a `softmax` over our character vocabulary. This will give us actual character indices. This means we are using a [categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution) to sample over the example prediction. This gives a prediction of the next character (specifically its index) at each timestep.

Note here that we sample from this probability distribution, as opposed to simply taking the `argmax`, which can cause the model to get stuck in a loop.

Let's try this sampling out for the first example in the batch.

In [None]:
sampled_indices = tf.random.categorical(pred[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

We can now decode these to see the text predicted by the untrained model:


In [None]:
print("Input: \n", repr("".join(idx2char[x[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices])))

Prediction is random, as you can see, this is because the network is not trained!

# Network Training: FILL THE CODE

Now we can finally start to train the model,

the music generation problem is now translated to a character prediction problem, which from a network point of view can be seen as a standard classification problem:

Given the previous state of the RNN and the input at a given time step, we want to predict the class of the next character, i.e. predict the next character.

We will use the [`sparse_categorical_crossentropy`](https://www.tensorflow.org/api_docs/python/tf/keras/backend/sparse_categorical_crossentropy) loss, which uses integer targets for categorical classification tasks.

We'll compute the loss using:

* true targets, i.e. *labels*
* predicted targets, i.e. *logits*

In [None]:
### Defining the loss function ###

'''FILL THE CODE: define the loss function to compute and return the loss between
    the true labels and predictions (logits). Set the argument from_logits=True.'''
def compute_loss(labels, logits):
  # loss =...'''TODO'''...
  return loss

# compute the loss using the true next characters from the example batch 
#    and the predictions from the untrained model several cells above
# example_batch_loss = ...'''TODO'''...

print("Prediction shape: ", pred.shape, " # (batch_size, sequence_length, vocab_size)") 
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Now we define some hyperparameters in order to train the model, you can then experiment by varying them and see how the train/prediction part changes.

In [None]:
### Hyperparameter setting and optimization ###

# Optimization parameters:
num_training_iterations = 2000  # Increase this to train longer
batch_size = 4  # Experiment between 1 and 64
seq_length = 100  # Experiment between 50 and 500
learning_rate = 5e-3  # Experiment between 1e-5 and 1e-1

# Model parameters: 
vocab_size = len(vocab)
embedding_dim = 256 
rnn_units = 1024  # Experiment between 1 and 2048

# Checkpoint location: 
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "my_ckpt")

Now we are ready to define the training operations and actually train the model:

* instantiate new model
* instantiate optimizer (use [`Adam`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam?version=stable), but you can also experiment with other optimizers)
* use tf.GradientTape method to perform backpropagation

In [None]:
### Define optimizer and training operation ###

'''FILL THE CODE: instantiate a new model for training using the `build_model`
  function and the hyperparameters created above.'''
# model = build_model('''TODO: arguments''')

# Instantiate Adam optimizer
# optimizer = ...'''TODO'''...

@tf.function
def train_step(x, y): 
  # Use tf.GradientTape()
  with tf.GradientTape() as tape:
  
    '''FILL THE CODE: feed the current input into the model and generate predictions'''
    # y_hat =  ...'''TODO'''...
  
    '''FILL THE CODE: compute the loss!'''
    # loss = ...'''TODO'''...

  # Now, compute the gradients 
  '''FILL THE CODE: complete the function call for gradient computation. 
      Remember that we want the gradient of the loss with respect all 
      of the model parameters. 
      HINT: use `model.trainable_variables` to get a list of all model
      parameters.'''
  # grads = ...'''TODO'''...
  
  # Apply the gradients to the optimizer so it can update the model accordingly
  #optimizer...'''TODO'''...
  return loss


##################
# Begin training!#
##################

history = []
plotter = mdl.util.PeriodicPlotter(sec=2, xlabel='Iterations', ylabel='Loss')
if hasattr(tqdm, '_instances'): tqdm._instances.clear() # clear if it exists

for iter in tqdm(range(num_training_iterations)):

  # Grab a batch and propagate it through the network
  x_batch, y_batch = get_batch(vectorized_songs, seq_length, batch_size)
  loss = train_step(x_batch, y_batch)

  # Update the progress bar
  history.append(loss.numpy().mean())
  plotter.plot(history)

  # Update the model with the changed weights!
  if iter % 100 == 0:     
    model.save_weights(checkpoint_prefix)
    
# Save the trained model and the weights
model.save_weights(checkpoint_prefix)


# MUSIC GENERATION

Finally we can actually use our trained RNN to generate some music.

We need to feed some sort of seed to the model to get it started (otherwise the RNN can't predict anything) and then iteratively predict each successive character.

Then, we iterativelly sample from the categorical distributions outputeed from `softmax` over possible successive characters. For inference, we iteratively sample from these distributions and then use our samples to encoder a generated song in ABC format.

we'll start by restoring the last saved checkpoint and to keep things simple we use a batch of 1.

In [None]:
# Restore last saved checkpoint

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1) # TODO

# Restore the model weights for the last checkpoint after training
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

model.summary()

### Prediction procedure

* Initialize a "seed" start string and RNN state and set number of characters that we want to generate.

* Use start string and RNN state to obtain the probability distribution over the next predicted character.

* Sample from multinomial distribution to calculate index of predicted character, which will then be used as next input to the model.

* N.B. At each time step, the updated RNN state is fed back into so that the RNN has more context in making the next prediction.

After predicting the next character, the updated RNN states are again fed back into the model, which is how it learns sequence dependencies in the data, as it gets more information from the previous predictions.

![LSTM inference](https://raw.githubusercontent.com/aamini/introtodeeplearning/2019/lab1/img/lstm_inference.png)


In [None]:
### Prediction of a generated song ###

def generate_text(model, start_string, generation_length=1000):
  # Evaluation step (generating ABC text using the learned RNN model)

  '''FILL THE CODE: convert the start string to numbers (vectorize)'''
  # input_eval = ['''TODO''']

  # Add batch dimension
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  model.reset_states()
  tqdm._instances.clear()

  for i in tqdm(range(generation_length)):
      '''FILL THE CODE:: evaluate the inputs and generate the next character predictions'''
      # predictions = ... '''TODO'''
      
      # Remove the batch dimension
      predictions = tf.squeeze(predictions, 0)
      
      '''FILL THE CODE: use a multinomial distribution to sample (hint: tf.random.categorical...)''' 
      # predicted_id = ...'''TODO''' ....
      
      # Pass the prediction along with the previous hidden state
      #   as the next inputs to the model
      input_eval = tf.expand_dims([predicted_id], 0)
      
      '''FILL THE CODE: add the predicted character to the generated text!'''
      # Hint: consider what format the prediction is in vs. the output 
      # text_generated.append('''TODO''')
    
  return (start_string + ''.join(text_generated))

In [None]:
# Use the model and the function defined above to generate ABC format text of length 1000!
#    As you may notice, ABC files start with "X" - this may be a good start string.'''
'''FILL THE CODE: '''
# generated_text = generate_text('''TODO''', start_string="X", generation_length=1000)

# Play back the music

In [None]:
### Play back generated songs ###

generated_songs = mdl.lab1.extract_song_snippet(generated_text)

for i, song in enumerate(generated_songs): 
  # Synthesize the waveform from a song
  waveform = mdl.lab1.play_song(song)

  # If its a valid song (correct syntax), lets play it! 
  if waveform:
    print("Generated song", i)
    ipythondisplay.display(waveform)

In [None]:
# Copyright 2020 MIT 6.S191 Introduction to Deep Learning. All Rights Reserved.
# 
# Licensed under the MIT License. You may not use this file except in compliance
# with the License. Use and/or modification of this code outside of 6.S191 must
# reference:
#
# © MIT 6.S191: Introduction to Deep Learning
# http://introtodeeplearning.com
#