#Music Generation with RNNs

Inspired by MIT Deep Learning Tensorflow class. This project uses Pytorch and Recurrent Neural Network (RNN) for music generation. Training set is a small dataset of MIDI format pop song snippets. It's very easy to generate new snippets and convert to MIDI format so this training set can easily expand. You can even train it with a specific genre of songs by only supplying it with data from that genre. Libraries used: python-midi, Pytorch, MIT custom create_dataset.py and midi_manipulation.py (see the util folders). 

In [1]:
!pip install python-midi



In [1]:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np

from util.util import print_progress
from util.create_dataset import create_dataset, get_batch
from util.midi_manipulation import noteStateMatrixToMidi

# Prepare Data
Data is stored in the data folder. Data is pre converted to `np.array` notes are one hot encoded. Only keep songs with length above min_song_length threshold.

In [2]:
min_song_length  = 128
encoded_songs    = create_dataset(min_song_length)

88 songs processed
15 songs discarded


In [6]:
NUM_SONGS = len(encoded_songs)
print(str(NUM_SONGS) + " total songs to learn from")
print(encoded_songs[0].shape) #(song_length, num_possible_notes)

73 total songs to learn from
(129, 78)


## RNN Model Architecture

Cell type is LSTM Cell. "This model will be based off a single LSTM cell, with a state vector used to maintain temporal dependencies between consecutive music notes." (same model as MIT Deep Learning 191). The model takes a sequence of input of previous note, put it through an LSTM cell, then into a fully connected layer. Its output is fed through a softmax function which returns a probability distribution over the next note. 

$$ P(x_t\vert x_{t-L},\cdots,x_{t-1})$$ 

where $x_t$ is a one-hot encoding of the note played at timestep $t$ and $L$ is the length of a song snippet, as shown in the diagram below.

<img src="img/lab1ngram.png" alt="Drawing" style="width: 50em;"/>
 

### Neural Network Parameters
* `input_size` and `output_size` are defined to match the shape of the **encoded** inputs and outputs at each timestep. Recall that the encoded representation of each song  has shape `(song_length, num_possible_notes)`, with the notes played at each timestep encoded as a binary vector over all possible notes. The parameters `input_size` and `output_size` will reflect the length of this vector encoding -- the number of possible notes.
* `hidden_size` is the number of states in  LSTM and the size of the hidden layer after our LSTM.
* The `learning_rate` of the model should be somewhere between `1e-4` and `0.1`. 
* `training_steps` is the number of batches. 
* The `batch_size` is the number of song snippets per batch.
* To train the model, we will be choosing snippets of length `timesteps` from each song. This ensures that all song snippets have the same length and speeds up training. 

In [7]:
## Neural Network Parameters
input_size       = encoded_songs[0].shape[1]   # The number of possible MIDI Notes
output_size      = input_size                  # Same as input size
hidden_size      = 128                         # Number of neurons in hidden layer

learning_rate    = 0.001 # Learning rate of the model
training_steps   = 200  # Number of batches during training
batch_size       = 256    # Number of songs per batch
timesteps        = 64    # Length of song snippet -- this is what is fed into the model

assert timesteps < min_song_length

### Model Initialization

### Dimensions

* Input size  is 3 dimensional: size of the batch, number of time steps in a song snippet, number of possible MIDI notes
* Output size is 2 dimensional: it is just the single note that immediately follows a song snippet in the input tensor for each song snippet in the training batch. Size of the batch, the number of possible MIDI notes.

Why all possible MIDI notes? Because of the hot encoding, each note has a binary field either 0 or 1, but there can be only one 1 in each row. It's unique. 

There's a fully connected layer after the LSTM with weights and biases. It's easy to replace this last layer using Pytorch nn.Sequential with a named layer. 

In Tensorflow need to use `tf.placeholder` and also mind the shape. In Pytorch, todo

In [8]:
# DIMENSIONS
# todo input batch_size, timesteps, input_size
# todo output batch_size, outputsize
# FC DIMENSIONS
# todo weights hidden_size, output_size
# todo biases output_size

In Pytorch it is easy to setup RNN with LSTM cells. [See docs](https://pytorch.org/docs/stable/nn.html#lstm)
In Tensorflow use `RNN(input_vec, weights, biases)`  `rnn.BasicLTMCell` 

In [None]:
# DIMENSIONS
# todo input batch_size, timesteps, input_size
# todo output batch_size, outputsize
# FC DIMENSIONS
# todo weights hidden_size, output_size
# todo biases output_size

class MusicRNN(nn.Module):
    
    def __init__(self, n_hidden=hidden_size, n_layers=1, lr=learning_rate):
        super().__init__()
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        ## TODO: define the LSTM
        self.lstm = nn.LSTM(batch_size, n_hidden, n_layers,batch_first=True)
        
        ## TODO: define the final, fully-connected output layer
        self.fc = nn.Linear(n_hidden, output_size)
      
    
    def forward(self, x, hidden):
        ''' Forward pass through the network. 
            These inputs are x, and the hidden/cell state `hidden`. '''
                
        ## TODO: Get the outputs and the new hidden state from the lstm
        r_output, hidden = self.lstm(x, hidden)
        
        ## TODO: pass through a dropout layer
        out = self.dropout(r_output)
        
        # Stack up LSTM outputs using view
        # you may need to use contiguous to reshape the output
        out = out.contiguous().view(-1, self.n_hidden)
        
        ## TODO: put x through the fully-connected layer
        out = self.fc(out)
        
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden

### Loss, Training, and Accuracy Operations
For training we use softmax cross entropy loss as the criterion.

In [None]:
logits, prediction = RNN(input_vec, weights, biases)

In [None]:
# LOSS OPERATION:
'''TODO: Use TensorFlow to define the loss operation as the mean softmax cross entropy loss. 
TensorFlow has built-in functions for you to use. '''
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
    logits=logits, labels=output_vec))  # TODO 

In [None]:
# TRAINING OPERATION:
'''TODO: Define an optimizer for the training operation. 
Remember we have already set the `learning_rate` parameter.'''
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) # TODO
train_op = optimizer.minimize(loss_op) 

In [None]:
# ACCURACY: We compute the accuracy in two steps.

# First, we need to determine the predicted next note and the true next note, across the training batch, 
#  and then determine whether our prediction was correct. 
# Recall that we defined the placeholder output_vec to contain the true next notes for each song snippet in the batch.
'''TODO: Write an expression to obtain the index for the most likely next note predicted by the RNN.'''
true_note = tf.argmax(output_vec,1)
pred_note = tf.argmax(prediction, 1) # TODO
correct_pred = tf.equal(pred_note, true_note)

# Next, we obtain a value for the accuracy. 
# We cast the values in correct_pred to floats, and use tf.reduce_mean
#  to figure out the fraction of these values that are 1's (1 = correct, 0 = incorrect)
accuracy_op = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Training the RNN

For each training step, we will input a batch of song snippets, and generate the next note for each song snippet in the batch.Loss will be computed at each step too

Compared to Tensorflow: Pytorch variables are ready-to-go. If we use TF, we will need to launch a session, initialize all variables first before training. 

In [1]:

    
    # DISPLAY METRICS
    if step % display_step == 0 or step == 1:
        # LOSS, ACCURACY: Compute the loss and accuracy by running both operations 
        loss, acc = sess.run([loss_op, accuracy_op], feed_dict=feed_dict)     
        suffix = "\nStep " + str(step) + ", Minibatch Loss= " + \
                 "{:.4f}".format(loss) + ", Training Accuracy= " + \
                 "{:.3f}".format(acc)

        print_progress(step, training_steps, barLength=50, suffix=suffix)

IndentationError: unexpected indent (<ipython-input-1-40c3e4a44c2b>, line 4)

What is a good accuracy level for generating music? An accuracy of 100% means the model has memorized all the songs in the dataset, and can reproduce them at will. An accuracy of 0% means random noise. There must be a happy medium, where the generated music both sounds good and is original. In the words of Gary Marcus, "Music that is either purely predictable or completely unpredictable is generally considered unpleasant - tedious when it's too predictable, discordant when it's too unpredictable." Empirically we've found a good range for this to be `75%` to `90%`, but you should listen to generated output from the next step to see for yourself.

## Music Generation
The model takes a sequence of notes. We just have to generate one seed note to start iteratively predicting each successive note. It outputs a probability distrubtion over all possible successive notes. We generate an entire song by building up the song length from one seed note. To listen to the demo we write it a file and listen.

In [None]:
import matplotlib.pyplot as plt

## Improving the Model



### Compare Pytorch to Tensorflow
* In Tensorflow you will have to build the entire graph and then initialize all variables
* In Pytorch the graph is built for you while you write Pythonic code. Just call the forward() function
* In Tensorflow, we will write all kinds of placeholders
* In Pytorch, we can initialize layers and models as a part of a subclass object of nn.Module
* Tensorflow math looks more like matrix operations. Its usage is mostly functional based. Will have to call the corresponding function for each step of the operation. 
* Pytorch is Pythonic, there is the OOP way of defining our model, functional and also sequential way of defining the model