## TP4: Recurrent neural networks

#Part II: a generative LSTM

Like the time series in Part I, language processing tasks (generation, translation, automatic correction, etc.) lend themselves to the use of recurrent networks. \\
But whatever the architecture, perceptron, RNN, or transformers (we will see them later), the neural network approach requires a change in representation: we must move from a sequence of words to a sequence of input vectors.
This change in representation involves several steps. The most important steps (tokenization, embedding) are illustrated here through a folk music generation task. \
Yes indeed, music is a language as well!

In [None]:
# This tutorial is based on MIT pedagogical materials.
# First download and import the MIT 6.S191 package:
!pip install mitdeeplearning
import mitdeeplearning as mdl

# Import all remaining packages
import numpy as np
import os
import time
# import functools
from tqdm import tqdm
!apt-get install abcmidi timidity > /dev/null 2>&1

**II.1)** From folk songs to a pytorch loader \

The first exercise consists of building a dataset from many (c.a. 800) music scores.
These scores are transcriptions of popular irish songs  in [ABC](https://fr.wikipedia.org/wiki/ABC_(notation)) notation.


**Q1** Load the song list and browse some of them. How melodies are encoded ?

In [None]:
songs = mdl.lab1.load_training_data()

example_song = songs[712]
print("\nExample song: ")
print(example_song)

In [None]:
# Convert the ABC notation to audio file and listen to it
mdl.lab1.play_song(example_song)

To build our dataset, let's first merge all the songs into one text:


In [24]:
songs_joined = "\n\n".join(songs)

Now the problem is to convert a character string into a numerical sequence that can be "learned". Typically, this change in representation involves four stages:

- [Three preprocessing steps](https://web.archive.org/web/20200131102455/https://mlexplained.com/2019/11/06/a-deep-dive-into-the-wonderful-world-of-preprocessing-in-nlp/):
  * Cleaning: the text is cleaned and formatted in a standard form.
  * Tokenization: the text is segmented in elmentary units (eg: letters, words, pieces of words, etc).
  * Numericalization: each token is mapped to a numerical id.

- An embedding step: the numerical ids are mapped onto tensors. This mapping is usually parameterized by trainable weights. Hence it is done during the learning phase.


In this lab, we oversimplify the first three steps: we consider that the musical scores are already normalized and the segmentation is done by character.
Moreover, each character is mapped to an integer via the following code:

In [25]:
vocab = sorted(set(songs_joined))
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

**Q2** How big is the “vocabulary” used here? Could we have reduced it? Say how and for what reason.

In [26]:
len(vocab)

83

To reduce the size of *vocab*, it would have been possible to get rid of the non-musical parts of the headers (lines X: to Z:)
before the merge, and to use specific characters to represent the words "Major" and "Minor". Among [other strategies](https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/), it would allow to reduce the mean number of bits by character, and ultimately to reach a more compact representation of the text.

**Q3** Complete the *vectorize_string* function which converts any subsequence of the text *songs_joined* into np.array of indices.

In [27]:
### Vectorize the songs string ###

'''TODO: Write a function to convert the all songs string to a vectorized
    (i.e., numeric) representation. Use the appropriate mapping
    above to convert from vocab characters to the corresponding indices.

  NOTE: the output of the `vectorize_string` function
  should be a np.array with `N` elements, where `N` is
  the number of characters in the input string
'''
def vectorize_string(string):
  vectorized_output = np.array([char2idx[char] for char in string])
  return vectorized_output


vectorized_songs = vectorize_string(songs_joined)

In [28]:
repr(songs_joined[:10])

"'X:1\\nT:Alex'"

In [29]:
print(repr(songs_joined[:10]))

'X:1\nT:Alex'


The Dataset below is defined in such a way as to be able to generate target sequences shifted one step to the right w.r.t. the input sequences:

In [30]:
from torch.utils.data import Dataset, DataLoader

class BitsOfSongs(Dataset):
    """
    PyTorch Dataset for the time series data.
    """

    def __init__(self, data, input_sequence_len):
        self.data = data
        self.input_sequence_len = input_sequence_len

    def __len__(self):
        return len(self.data) - self.input_sequence_len - 1

    def __getitem__(self, start_idx):
        # Extract an input sequence
        stop_idx = start_idx + self.input_sequence_len
        sequence = self.data[start_idx:stop_idx]
        # shift right the extraction window to get the target:
        target = self.data[start_idx + 1:stop_idx + 1]
        return {'sequence': sequence, 'target': target}

**Q4** Instantiate a dataset and a dataloader. Visualize some input and target sequences.

In [None]:
dataset = BitsOfSongs(vectorized_songs, 10)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [None]:
batch = next(iter(dataloader))
x = batch['sequence']
y = batch['target']
print(x.shape,y.shape)

**II.2)** Implementation of a LSTM



Now consider the following model:


In [33]:
import numpy as np
import torch
import torch.nn as nn


class genFolk(nn.Module):

    def __init__(self, latent_size=256, hidden_size=50, vocab_size=10,
                 batch_size = 32):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, latent_size)
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(latent_size, hidden_size, batch_first=True)
        self.tanh = nn.Tanh()
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.hidden_cell = (torch.zeros(1, batch_size, hidden_size),
                       torch.zeros(1, batch_size, hidden_size))

    def forward(self,seq):
        seq = self.embedding(seq)
        lstm_out, self.hidden_cell = self.lstm(seq, self.hidden_cell)
        lstm_out = self.tanh(lstm_out)
        pred = self.linear(lstm_out)
        return pred



**Q5** What role does the *self.embedding* layer play?

The embedding step maps token onto vectors  

**Q6** Briefly describe the rest of the model. Compute an output and describe each of its dimensions. Also specify the reason why the batch size is taken as an argument by the class constructor.

In [34]:
vocab_size = len(vocab)
input_size = 256
hidden_size = 1024
batch_size = 32
out_size = len(vocab)


model = genFolk(latent_size=256,
                hidden_size=1024,
                vocab_size=len(vocab))

In [35]:
dataset = BitsOfSongs(vectorized_songs, 100)
dataloader = DataLoader(dataset, batch_size=batch_size) #, sampler=sampler)
batch = next(iter(dataloader))
x = batch['sequence']
y = batch['target']
print(x.shape, y.shape)
pred = model(x)
print(pred.shape)

torch.Size([32, 100]) torch.Size([32, 100])
torch.Size([32, 100, 83])


**II.2)** Training of the model

To measure the difference between the predicted token and the observed token, it is possible to use the same cost function as in classification:

In [None]:
loss_fn  = torch.nn.CrossEntropyLoss()

def compute_loss(y, pred):
  trpred = torch.transpose(pred, 1, 2)
  return loss_fn(trpred, y)

example_batch_loss =  compute_loss(y, pred)

print("Prediction shape: ", pred.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.detach().numpy())

**Q6** Why do we need to transpose the prediction tensor to be able to use cross entropy?

Cross entropy has been coded for computer vision classification/segmentation; in these domains, the second dimension of the output tensors refers to output classes. Here, classes correspond to the third dimension (there is as many classes as characters in *vocab*).

**Q7** Complete the following training loop, put it on a gpu, and try to get the best loss by tuning the hyperparameters.

In [37]:
### Hyperparameter setting and optimization ###

# Optimization parameters:
num_epochs = 20  #
batch_size = 32  # Experiment between 1 and 64
num_samples = 100*batch_size # num of sequences sampled at each epoch
seq_length = 100  # Experiment between 50 and 500
learning_rate = 5e-3  # Experiment between 1e-5 and 1e-1

# Model parameters:
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024  # Experiment between 1 and 2048


from torch.utils.data import RandomSampler
dataset = BitsOfSongs(vectorized_songs, seq_length)
sampler = RandomSampler(dataset, replacement=True, num_samples=num_samples)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)

In [38]:
model = genFolk(latent_size=256,
                hidden_size=1024,
                vocab_size=len(vocab),
                batch_size = batch_size).cuda()


optimizer =  torch.optim.Adam(model.parameters(), learning_rate)

In [None]:
history = []
plotter = mdl.util.PeriodicPlotter(sec=2, xlabel='Iterations', ylabel='Loss')
if hasattr(tqdm, '_instances'): tqdm._instances.clear() # clear if it exists

for epoch in tqdm(range(num_epochs)):
  for batch in dataloader:

    x = batch['sequence'].cuda()
    y = batch['target'].cuda()
    model.zero_grad()
    model.hidden_cell = (torch.zeros(1,batch_size,hidden_size).cuda(),
                    torch.zeros(1,batch_size,hidden_size).cuda())

    pred = model(x)

    loss = compute_loss(y, pred)
    loss.backward()
    optimizer.step()

  # Update the progress bar
  history.append(loss.detach().cpu().numpy().mean())
  plotter.plot(history)

**II.3)** Generation of folk songs

To generate a unique sequence, we will use the trained weights to instantiate a model of the same class with *batch_size*=1:

In [None]:
# To generate one sequence
batch_size_inference = 1

model_bs1 = genFolk(latent_size=256,
                hidden_size=1024,
                vocab_size=len(vocab),
                batch_size = batch_size_inference)

model_bs1.load_state_dict(model.state_dict())

model_bs1.eval()

Then the procedure is as follows:

- initialize $h_0$, $c_0$ to 0.
- initialize the sequence with the index $i_0$ corresponding to the letter "X", since it is with this letter that an ABC code begins.
- at each step $n$, use the model to calculate $h_n$, $c_n$ and the output $pred_n$
   from $h_{n-1}$, $c_{n-1}$ and $i_{n-1}$.
- determine $i_n$ from $pred_n$.

This last step is not done by sampling the distribution contained in $pred_n$.

These steps are coded below:

In [None]:
# nb of steps:
generation_length=1000

# init hidden & cell states
model_bs1.hidden_cell = (torch.zeros(1,1,hidden_size),
              torch.zeros(1,1,hidden_size))

# Starter:
start_string="X"
start_ids = [char2idx[s] for s in start_string]
start_ids_torch = torch.tensor(start_ids).unsqueeze(dim=0)

# init the list of successive i_n
text_generated = []

# loop for generation:
input_eval = start_ids_torch


for n in range(generation_length):
    predictions = model_bs1(input_eval)

    # Remove the batch dimension
    predictions = predictions.squeeze(dim=0)

    num_sampler = torch.distributions.categorical.Categorical(logits = predictions)
    predicted_id = num_sampler.sample()
    print(predicted_id)
    input_eval = predicted_id.unsqueeze(dim = 0)
    text_generated.append(idx2char[predicted_id.numpy()].item())

**Q8** What step does the call to *torch...Categorical* correspond to? \
Why the *logits = predictions* syntax? \
Why bother with a sampling instead of taking an *argmax* as for classification?


*Categorical* provides a sampler for the probability laws on *vocab* that are defined in the output. As the softmax has not been applied, the law is represented by its [logits](https://en.wikipedia.org/wiki/Logit).

**Q9** Comment on the appearance of the texts generated by the model.

In [None]:
text_generated = start_string + ''.join(text_generated)
print(text_generated)

The headers generated by the model are likely (see the titles, for example).

**Q10** Use the code below to listen to melodies generated by your LSTM:

In [None]:
from IPython import display as ipythondisplay

# To extract a list of potential songs among text_generated:
generated_songs = mdl.lab1.extract_song_snippet(text_generated)

for i, song in enumerate(generated_songs):
  waveform = mdl.lab1.play_song(song)

  # if play_song worked, display the audio box:
  if waveform:
    print("Generated song", i)
    ipythondisplay.display(waveform)