# Training an LSTM to Generate Song Lyrics in Python

### By: Ari Silburt

### Date: January 20th, 2017

<div class="col-lg-5 col-sm-6">
<p>
Music has a big influence on people. Whether we are simply listening to music as background noise or blocking out the world to listen to a favourite song, it affects our mood and reflects our values. Lyrics play an essential role in music, encoding the feelings, moods and philosophies of the artist. Although every song is different, overarching themes can be associated with each genre. 
</p>
<p>
As a musician and songwriter myself I've always struggled to write lyrics, and admire the lyrical abilities of my favourite artists. This motivated me to try and combine science and art into one, training a neural network (in Python) that is capable of generating genre-specific lyrics from each of - Country, Electonic Dance Music (EDM), Rock, and Rap. In addition, I'll also do a preliminary analysis of each genre, extracting the broad themes that define each.
</p>
</div>

<div>
<img src="lyric_analysis/Pink_floyd.jpg", width=350>
</div>

## Setup
The code used for this project comes from my [Machine Learning Repo](https://github.com/silburt/Machine_Learning/).  

First we need some songs. I picked a few playlists from Spotify: 
- <i>Country Top Songs of 2000-2017</i> by sjwheat (1567 songs)
- <i>Best EDM Playlist on the Planet!</i> by Matt Hoppe (1592 songs)
- <i>Rock -70's, 80's, 90's and 00's</i> by Lucas Madeira (1635 songs)
- <i>BEST of RAP</i> by Andy Mathers (2128 songs)

and converted these playlists to csvs using [this tool](https://github.com/watsonbox/exportify), with each csv row containing the artist, song, album, etc. for each track. Then, I scraped each lyric from the genius.com API using standard Python packages like `BeautifulSoup` and `requests`. A good fraction of songs (20-40% per genre) weren't recognized in the genius.com API due to slightly different song/artist/mix version details between Spotify and genius.com (this is probably a separate machine learning project in itself to recognize near-identical song/artist names!). I also removed songs that had fewer than 15 words (i.e. instrumentals). In total, 1000 songs were used from each genre for everything that follows.

#### Preprocessing
Before we can train the LSTM some basic preprocessing was required. Specifically:
- All lyrics are converted to lower case.
- Accented characters are converted to normal characters.
- Punctuation is removed ('!', '?', '.').

## Unique Words
First, I thought I'd present the broad themes of each genre by showing the words from each genre with disproportionately high or low ranks (compared to all the other genres). By rank, I essentially mean popularity. So for example, '_the_' is the most popular word in country music and has a rank of 1, '_i_' is the second most popular word and has a rank of 2, etc.

The left/right tables below show words with at least a 1.75x higher/lower rank in a given genre. So, for example, the rank of the word "_we_" in each genre is: EDM=5, Rap=18, Rock=28, Country=15, Pop=22, and hence it is disproportionately more common in the EDM genre compared to all others.

<div class="col-lg-6 col-sm-7">
<br>
<center> __Disproportionately high rank for given genre__ </center>

| EDM | Rap | Rock | Country |  
| ------------- |:-------------:| ------------- |:-------------:|
| we    | get       | find   | she('s) |
| oh    | she       | soul   | little  | 
| we're | ni\*\*a(s)|        | baby    | 
| feel  | they      |        | her   | 
| let   | bitch(es) |        | girl   | 
| our   | fuck      |        | good   | 
| into  | shit      |        | old   | 
| light | money     |        | song  | 
| hands | ass       |        | kiss | 
| low   | hit       |        | town | 
| jump  | games     |        | road | | 

</div>
<div>
<br>
<center> __Disproportionately low rank for given genre__ </center>

| EDM | Rap | Rock | Country |
| ------------- |:-------------:| ------------- |:-------------:|
| she('s)| love  | money   | | |
| ain't | we're  | | | |
| her   | you're | | | |
| old   | our    | | | |
| man   | gonna  | | | |
| he    | away   | | | |
| his   | heart  |      | | |
| em    | tonight   |      | | |
|       | feeling  |      | | |
|       | night  |      | | |
|       | eyes   |      | | | |
</div>
Some interesting trends:
- __EDM__: Seems to focus on togetherness (we, we're, our), feelings, and drama (light, jump). It also appears very gender neutral, with almost every gendered (pro)noun like she/he, her/his, man disproportionately rare.
- __Rap__: (Probably unsurprisingly), it's focused on women, money, and sex. I suppose J-Cole speaks the truth when he says (in the song G.O.M.D.), "_It's called love, Ni\*\*as don't sing about it no more_". "Love" seems to be part of a broader trend too, with other intimate words like "our", "heart", "feeling" and "eyes" being disproportionately rare.
- __Rock__: Seems to be pro-spiritual, with the only two words being "find" and "soul", and anti-capitalism with the only rare word being "money". Compared to the other genres though, rock has very few unique words.
- __Country__: Seems to be really focused on women (she's, her, girl), the simple things in life (old, song, kiss, town, road), and also <i>really</i> likes to miniaturize things - "little baby", "little thing", "little kisses", etc. 

## Train a Neural Network to Generate New Lyrics

### Brief Intro
Text prediction is a really hot area of machine learning at the moment. Here we will be training a [Long Short Term Memory (LSTM) network](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) on each musical genre to see if the network can learn to generate new lyrics with similar properties as the training set. LSTMs are a special type of [recurrent neural network (RNN)](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/), and like RNNs they are capable of predicting sequential data. Sequential data includes stock prices, text, etc. Unlike RNNs however, they are particularly adept at learning long-term dependencies. For example, if we are reading a story where, in the first paragraph we learn that the setting is a baseball field, and ask a trained RNN to complete the following sentence from the third paragraph:
<center>_After a long look at the concession stand, Billy threw the _____. </center>
A traditional RNN is not capable of remembering far enough back to know that the setting is a baseball field, and thus is unlikely to make a reasonable prediction like "ball". Instead, it is more likely to use the local information to make a prediction. In this case, since "concession stand" is in the same sentence, it might instead output something like "chocolate bar". An LSTM is however capable of remembering such long-term dependencies. The key feature of LSTMs is the use of a hidden cell state which is only slightly modified each training step, vs. completely transformed in the case of RNNs. This allows important information from many timesteps ago to stick around if the network deems it to be important information. Andrej Karpathy does a really nice job of explaining this point [here](https://youtu.be/iX5V1WpxxkY?t=54m45s).   

<br>
For the task of generating new lyrics, given an input sequence of text:  
<center>_You think I'm pretty_ </center>
we would like the LSTM to be able to finish the sentence with something reasonable, e.g.:
<center>_without any makeup on_ </center>
It's worth noting that since the LSTM generates output predictions probabilistically, running the same input sequence through the LSTM again will produce a different output, e.g.:
<center>_sweet like ice cream_ </center>
Given a training set, the LSTM learns the correlations between different letters/words, and thus given an input sequence it will output probabilistically what a likely output statement is based off the previously learned correlations.  

### Word vs. Character Level Predictions

It's also important to note that I am training/generating predictions at the _character-level_. That is, the LSTM is learning/making predictions on a character-by-character basis, vs. predicting on a word-by-word basis. This choice stems from the empirical fact that I tried both and had way more success with the character level. [Others](https://groups.google.com/d/msg/keras-users/Y_FG_YEkjXs/TD_j9b532kwJ) seem to have had similar experiences. Overall the word-prediction is a more difficult problem, since:
- Due to the vast number of different words, the LSTM has to learn orders of magnitude more correlations than the character-level.
- Using words as datum instead of characters reduces the training size by an order of magnitude. The variable size can be reduced by removing rare words, but then other complexities arise (and it doesn't improve things soo much).  
- [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) thousands of words becomes intractable and you run out of memory extremely quickly. You usually have to implement an [embedding layer](https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12) to overcome this, which adds another level of complexity (and this is already a non-trivial problem).

In addition, word-level predictions appear to be less flexible. Characters can be combined (in principle) to produce any word in the dictionary, but in contrast words can't really be combined to form more complex words (maybe hyphenated words if you want to get technical). Thus, for word-level predictions it would be impossible to produce a new word not in your training set, severely limiting the diversity of possible output sequences.

### Training
The training data is constructed by taking the 1000 songs for each genre and chopping up the lyrics into input and output sequences of length $L$, with the output sequences offset from the input by one. In total this produces about a million examples (give or take, depending on $L$ and the genre). An example input/output training pair is:  
```
 Input = [y, o, u, \, t, h, i, n, k, \, i, ', m, \, p, r, e, t, t]
Output = [o, u, \, t, h, i, n, k, \, i, ', m, \, p, r, e, t, t, y]
```
and the LSTM is trying to predict each `Output[i]` based off the `Input[i]` value. This is a "[many-to-many](https://i.stack.imgur.com/b4sus.jpg)" framework, and I found this _way_ more effective than a many-to-one framework (i.e. trying to predict only the final letter `y` based off the entire `Input` sequence above). This makes sense, as each training example now has $L$ testable predictions per example instead of just one, providing more information during training. 

Each characeter is then mapped to a [one-hot](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) encoding so that it can be used by the network. That is, each unique character is converted to a unique array with all 0s except one element equal to 1. E.g.:
```
'a'  -> [1,0,0,...,0]  
'b'  -> [0,1,0,...,0]  
...  
'\n' -> [0,0,0,...,1]  
```
In total, 45 different one-hot encodings were needed to account for the 26 letters of the alphabet, 10 numbers (0-9), and about 10 extra characters like '?', '!', '"', etc. To save memory (one-hot encodings get memory-intensive fast), I built a generator for to the network which converts the characters to one-hot encodings on the fly.

The network itself is built in [Keras](keras.io) and uses a [GRU](https://arxiv.org/pdf/1406.1078.pdf) framework, which is a slightly simpler (and some say better) version of the standard LSTM. The network is incredibly easy to build in Keras:
```Python
model = Sequential()
model.add(GRU(lstm_size, dropout=drop, recurrent_dropout=drop, return_sequences=True,
              input_shape=(L, vocab_size)))
for i in range(depth - 1):
    model.add(GRU(lstm_size, dropout=drop, recurrent_dropout=drop, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))

decay = lr/epochs
optimizer = RMSprop(lr=lr, decay=decay)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
```
where:  
- `vocab_size` are the number of unique characters (45 in our case).
- `L` is the length of the input and output sequences, $L$.
- `lstm_size` are the number of nodes in the LSTM.
- `drop` is the [dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) fraction. 
- `depth` is the total depth of the network. 
- `lr` is the learning rate.

With the exception of `vocab_size`, all the above variables are tunable hyperparameters. In addition, another key tunable hyperparameter is `batch_size`, which determines how many examples are seen per gradient update. Splitting the data into training/validation sets using an 80/20 split, these hyperparameters were tuned for 50 epochs over the following ranges:
```

```

### Results

