# Usage of the RNN

## Overview

During the last lecture, we have introduced general idea of a recurrent neural network - preserving previous states of the network (hidden and cell states). We have dissected the most basic RNN and more complex LSTM, which utilizes *gates* and *a cell state*.

Then we have learned how to implement simple RNN and LSTM in Python using machine learning library Keras. Let's recollect it again:

In [None]:
#example code 
from keras.models import Sequential,load_model
from keras.layers import Dense,LSTM,Dropout
from keras import optimizers

batch_size=64 #during training a lot of training examples (batch) are taken and gradient (weight changes) is averaged
timesteps=50 #length of training sequence
input_length=26 #length of one element of sequence
hid_neurons=128 #number of hidden neurons in LSTM
output_size=3 #output size of neural network

model=Sequential()
model.add(LSTM(hid_neurons,batch_input_shape=(batch_size,timesteps,input_length)))
model.add(Dense(output_size))
model.compile(loss='categorical_crossentropy',optimizer=optimizers.RMSprop(lr=0.001),metrics=['accuracy'])
#categorical_crossentropy is used if y-data consists from one-hot-vectors (1 example - 1 class)
#binary_crossentropy is used for multilabeled data (1 example - a lot of classes)
#RMSprop is popular optimizer, which is more advanced than basic vanilla gradient descent
#Alternatively, you can use optimizer Adam by specifying optimizer='adam' or optimizer=optimizers.Adam(lr=learning_rate)

model.fit(X_data,Y_data,epochs=50,batch_size=64,shuffle=False) #training 
#WARNING! batch_size must be equal to batch_size used while constructing net

#### Dictionary

Let us have N different classes and each training example is corresponding to one class. In that case y-data can be represented as a collection of one-hot vectors. One-hot vector representing class with number i is a vector $\mathbf{x}$ of size N, which elements are defined as:
$$x_k=
\begin{cases}
1, k=i \\
0, k\neq i
\end{cases}$$
**Example:** There are 4 different classes and we encode the second one as one-hot vector (0,1,0,0)

## Classification

We also have started exploring what RNN is good at. The first task was classification (prediction of parabola's coefficients by a sequence of y-coordinates). 

By analogy, you can create data (see table below) of sequences and corresponding one-hot vectors and train RNN to classify sequences in different categories.

|word (sequence of letters)|positive/negative/neutral|one-hot vector|
|--------------------------|-----------------|--------------|
|excellent|positive|1 0 0|
|the worst|negative|0 1 0|
|green|neutral|0 0 1|
|astonishing|positive|1 0 0|
|cloudy|neutral|0 0 1|
|disgusting|negative|0 1 0|

**Homework**: Create LSTM neural network which classifies something and train it.

## Generation

We can classify sequences into any classes we want as long as they can be deduced from present data. Imagine having a very long sequence {$x_n$} of elements (letters, words, notes) and taking mini-sequence with t elements: $x_{i+1},x_{i+2},...,x_{i+t}$. 

We can "classify" each mini-sequence by next character $x_{i+t+1}$ exactly after this mini-sequence. In that case, x-data is going to look like a collection of sequences of elements (each element can be a single number, one-hot vector or just a vector), and y-data is going to consist from one-hot vectors, which represent next character in the sequence after the corresponding mini-sequence. 

Now imagine the following course of actions:
1. Let us have 5 different classes: "A", "B", "C", "D", and "E"
2. Let us have a mini-sequence "A B C"
3. A trained neural network, essentially, outputs probability distribution of the next character, which appears as a vector of size 5: (0.01, 0.02, 0.01, 0.90, 0.06). In a general case, let us assign output as $(y_1,y_2,...,y_n)\equiv\mathbf{y}$. Each number $y_i$ can be described as a probability of next character being the corresponding class. For example, class "D" has 90% probability of showing after "A B C" sequence.
4. Now we have two possible ways of actually determining the next character. 
    * The first one is picking the class with the highest probability (in this example "D"). 
    * The second one is taking a class $i$ with probability proportional to $y_i^{1/\tau}$, where $\tau$ is called *temperature*. The higher temperature, the more *diverse* and at the same time more *random* generated sequence is going to be. By the way, as $\tau$ approaches zero, the second way of generation gets closer to the first one.
5. After we have picked an element, for example, "D", initial mini-sequence transforms into "A B C D". Since the neural network takes 3 last characters as an input in this example, last 3 characters - "B C D" can be taken and fed to the neural network, which outputs probability distribution for the next character. Now, we can repeat steps 3-4 to get a sequence of the desired length. Thus, the neural network becomes a **sequence generating machine**.

### Text generation

Text can be represented as a sequence of words or letters. Let's start from generation of text letter-by-letter.

#### Letter-by-letter generation

Firstly, we need to specify a data source. Since we are generating a text as a sequence of letters, we have much more freedom in choosing text. On the other hand, we can't train LSTM on very long sequences because it's computationally infeasible task. As a result, a quite short (in terms of whole text) sequence from 100 letters represents "memory cap" of LSTM, which means that LSTM probably forgets all context after 100 symbols. That is why a generated text is quite repetitive and doesn't have a lot of deep meaning. Nonetheless, I have chosen Latex code of one chapter (topology.tex) of the web-based project on algebraic stacks and algebraic geometry (https://github.com/stacks/stacks-project/) to showcase the strength of LSTM in remembering the high-level syntax of Latex code.

Secondly, we need to preprocess raw .tex code. This process consists of several steps:
* Loading text

```python
raw_text=open('your_directory/your_file.txt').read()
```

* Creating dictionary character-integer, which sets one integer corresponding to one and only one character. Length of such dictionary is equal to the number of distinct characters and, later one, it will be called n_vocab.

```python
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars)) #the dictionary which converts a characters to an integer
int_to_char = dict((i, c) for i, c in enumerate(chars)) #the dictionary which converts an integer to a characters
n_vocab = len(chars) #number of distinct characters
```

* Creating training examples, which consist of x and y data, in following steps:
    1. Taking mini-sequences of letters from the text. Each mini-sequence is shifted from the previous one by one letter. If a sequence is, for example, "ABCDEF", we can get 3 training examples having the form (sequence, next letter): ("ABC","D"); ("BCD","E"); ("CDE","F").
    2. Converting each sequence into a collection of one-hot vectors, corresponding to according letters and the one-hot vector responsible for next letter.
    
```python
corpus_size=10000
seq_length = 100

global dataX,dataY

for i in range((n_chars-seq_length)//corpus_size):
    dataX=np.zeros((corpus_size,seq_length,n_vocab)) 
    dataY=np.zeros((corpus_size,n_vocab))
    for k in range(corpus_size):
        seq=raw_text[i*corpus_size+k:i*corpus_size+k+seq_length]
        final_char=raw_text[i*corpus_size+k+seq_length]
        for j in range(seq_length):
            dataX[k,j,char_to_int[seq[j]]]=1            
        dataY[k,char_to_int[final_char]]=1
        
    model.fit(dataX,dataY,batch_size=50,epochs=1,verbose=True,shuffle=False)
```

**Warning:** Be careful with a size of dataX array because it can exceed memory limit and kernel can die (even Google Colab memory isn't enough)

Since there isn't enough memory to convert whole text into x-data, we can break data into pieces, which were called corpus. Each corpus consists of 10000 sequences, which neural network trains on for one epoch. After training new corpus is created and the process is repeated.

#### Generation of a text

While we were training a neural network, it had a sufficient batch size for stable training. However, during generating we don't need to provide a batch of sequences. Instead, we need to generate next character from one mini-sequence of last characters. To deal with this problem, we need to create *generating* neural network with batch size equal to 1.
Finally, we can easily generate texts about algebraic geometry using following code:
```python
temp=1.5 #INVERSE temperature
offset=0
s=raw_text[offset:offset+seq_length] #taking starting sequence from original text
for j in range(5000):
    if j%100==0:
        print(j)
    x=np.zeros((1,seq_length,n_vocab)) #creating mini-sequence from 100 last characters
    for i in range(seq_length):
        x[0,i,char_to_int[s[-seq_length+i]]]=1
    
    y=model2.predict(x)[0]
    s+=int_to_char[np.random.choice(n_vocab,p=y**temp/np.sum(y**temp))] #choose character and append it to generated sequence
    
print(s)
```

#### Results

Trained model can be found at [Google Drive](https://drive.google.com/open?id=1q7-IfXGlwKj9qfPSbyieUUT6bQ9FjBEO). 

I have generated several texts with different (inverse) temperatures, which are stored at [github.com](https://github.com/romasoletskyi/Machine-Learning-Course/tree/master/9.%20Usage%20of%20the%20RNN/Latex%20generation)

As expected, low temperature $\tau=1/2$ creates very repetitive text, enriched with "topological rings". At $\tau=0$, which corresponds to picking the most likely character every time, text degrades into infinite loop "of topological ring of topological ring of topological ring...". On the other hand, high temperature $\tau=1$ creates text with a lot of grammatic errors, which sometimes becomes gibberish.
 
Temperature equal to $\tau=2/3$ becomes "a golden middle", where a generated text is diverse and grammatically correct simultaneously. During generation of 10000 symbols, Latex compiler found only 6 errors. They were only "long range" errors that require remembering from several paragraphs to the whole text (e.g. not ending long lemmas and proofs with according commands).

### Music generation

Music generation is pretty similar to letter-by-letter text generation, except we get data differently. The only major difference is that music notes are tightly connected with the conception of frequency - continuous measurable physics value. Based on that, we can propose to encode notes, not as one-hot vectors, but simply like index number of a class divided by the total number of classes. This makes neural network smaller and allows to store all data in RAM and train net on a large pull of shuffled sequences.

Data preprocessing code was mainly inspired by [this article](https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5). We can use library music21, which must be installed on your local machine or on Google Colab. It helps us to parse MIDI files into a sequence of notes and chords, which are then enumerated and normalized by dividing by the total number of different notes/chords.
I have used MIDI files of Beethoven piano compositions (the main type is sonata) and, as a result, I got the collection of notes/chords. Data  was finally processed using this code:

``` python
x_data=[]
y_data=[]

for i in range(len(data_notes)-seq_length):            
      x_data.append([note_to_int[data_notes[i+k]]/n_vocab for k in range(seq_length)])
      one_hot=np.zeros(n_vocab)
      one_hot[note_to_int[data_notes[i+seq_length]]]=1
      y_data.append(one_hot)

x_data=np.array(x_data).reshape((len(x_data),seq_length,classes_number))
y_data=np.array(y_data).reshape((len(y_data),n_vocab)) 

x_data=np.copy(x_data[:(len(x_data)//batch_size)*batch_size])
y_data=np.copy(y_data[:(len(y_data)//batch_size)*batch_size])
```

A neural network was trained on the whole data and after each epoch model was saved if it was better then a previously saved version:

```python
from keras.callbacks import ModelCheckpoint

filepath = "drive/Colab/Music generator/model-{epoch:02d}-{loss:.4f}-bigger.h5"
checkpoint = ModelCheckpoint(
        filepath,
        monitor='loss',
        verbose=0,
        save_best_only=True,
        mode='min'
)
callbacks_list = [checkpoint]
    
history=model.fit(x_data,y_data,epochs=5,batch_size=128, callbacks=callbacks_list)
```

#### Generation of music

After training, we need to create generating a neural network with batch size equal to one. Generating of music is a bit more complicated than generating a text, however, library music21 greatly helps here.

In [None]:
temp=2
prediction_output=[]
ind=int((len(x_data)*np.random.random()))
for i in range(seq_length):
    prediction_output.append(x_data[ind][i])
for i in range(120): #120 - number of generated notes/chords
    if i%50==0:
        print(i)
    y=model2.predict(np.array(prediction_output[-seq_length:]).reshape(1,seq_length,classes_number),batch_size=1)[0]
    prediction_output.append(np.random.choice(n_vocab,p=y**(1/temp)/np.sum(y**(1/temp))))
    
prediction_output=[int_to_note[x] for x in prediction_output[seq_length:]]
    
offset = 0
output_notes = []
# create note and chord objects based on the values generated by the model
for pattern in prediction_output:
    # pattern is a chord
    if ('.' in pattern) or pattern.isdigit():
        notes_in_chord = pattern.split('.')
        notes = []
        for current_note in notes_in_chord:
            new_note = note.Note(int(current_note))
            new_note.storedInstrument = instrument.Piano()
            notes.append(new_note)
        new_chord = chord.Chord(notes)
        new_chord.offset = offset
        output_notes.append(new_chord)
    # pattern is a note
    else:
        new_note = note.Note(pattern)
        new_note.offset = offset
        new_note.storedInstrument = instrument.Piano()
        output_notes.append(new_note)
    # increase offset each iteration so that notes do not stack
    offset += 0.5
    
midi_stream = stream.Stream(output_notes)
midi_stream.write('midi', fp='drive/Colab/Music generator/generated_beeth2.mid')

Finally, a MIDI file can be converted into a mp3 file using any online converter. You can listen to some generated samples on [github.com](https://github.com/romasoletskyi/Machine-Learning-Course/tree/master/9.%20Usage%20of%20the%20RNN/Music%20generation). The trained model can be found on [Google Drive](https://drive.google.com/open?id=1cbn5yHZfii2PcMMKiz1VSd1ajV0r_5rd)

## Useful links

1. [Letter-by-letter generation Colab notebook](https://drive.google.com/file/d/1hZlg4gy9Hv9Zy57ke7oGmJKx91KGoewL/view?usp=sharing)
2. [Latex texts generation Colab notebook](https://drive.google.com/file/d/1mzoSJ8aEZLlz0ZYuNQNJnihFjFmqygYc/view?usp=sharing)