## Learning to generate lyrics and music with Recurrent Neural Networks (RNNs)

In this post we will train RNN character-level language model on lyrics dataset of
most popular/recent artists.Having a trained model, we will sample a couple of
songs which will be a funny mixture of different styles of different artists.
After that we will update our model to become a Conditional Character-Level RNN,
making it possible for us to sample songs conditioned on artist.
And finally, we conclude by training our model on midi dataset of piano songs.
While solving all these tasks, we will briefly explore some interesting concepts related to RNN
training and inference like Character-Level RNN, Conditional Character-Level RNN,
sampling from RNN, truncated backpropagation through time and checkpointing.


### Character-Level language model

![alt text](character_level_model.jpg "Logo Title Text 1")

Before choosing a model, let's have a closer look at our task at hand.

Our language model is defined on a character level. We will create a dictionary which will contain
all English characters plus some special symbols, like period, comma, and end-of-line symbol. Each charecter will be represented as one-hot-encoded tensor. For more information about character-level models and examples, I recommend [this resource](https://github.com/spro/practical-pytorch). We could have
used a more advanced word-level model, but we will keep it simple for now.

Having characters, we can now form sequences of characters. We can generate sentences even now just by
randomly sampling character after character with a fixed probability $p(any~letter)=\frac{1}{dictionary~size}$.
That's the most simple character level language model. Can we do better than this? Yes, we can compute the probabily of occurance of each letter from our training corpus (number of times a letter occures divided by the size of our dataset) and randomly sample letter using these probabilities. This model is better but it totally ignores the relative positional aspect of each letter. For example, pay attention on how you read any word: you start with the first letter, which is usually hard to predict, but as you reach the end of a word you can sometimes guess the next letter. When you read any word you are implicitly using some rules which you learned by reading other texts: for example, with each additional letter that you read from a word, the probability of a space character increases (really long words are rare) or the probability of any consonant after the letter "r" is low as it usually followed by vowel. There are lot of similar rules and we hope that our model will be able to learn them from data. To give our model a chance to learn these rules we need to extend it.

Let's make a small gradual improvement of our model and let probability of each letter depend
only on the previously occured letter ([markov assumption](https://en.wikipedia.org/wiki/Markov_property)). So, basically we will have $p(current~letter|previous~letter)$.
This is a [Markov chain model](https://en.wikipedia.org/wiki/Markov_chain) (also try these [interactive visualizations](http://setosa.io/ev/markov-chains/) if you are not familiar with it). We can also estimate the probability distribution $p(current~letter|previous~letter)$ from our training dataset. This model is limited because in most cases the probability of the current letter depends not only on the previous letter.

What we would like to model is actually $p(current~letter|all~previous~letters). From a first sight, the task seems intractable as 