# Training an LSTM to Generate Song Lyrics in Python

### By: Ari Silburt

### Date: January 20th, 2017

<div class="col-lg-6 col-sm-7">
<p>
Music has a big influence on people. Whether we are simply listening to music as background noise or blocking out the world to listen to a favourite song, it affects our mood and reflects our values. Lyrics play an essential role in music, encoding the feelings, moods and philosophies of the artist. Although every song is different, overarching themes are associated with each genre. 
</p>
<p>
In this blog post I will train a neural network in Python to generate lyrics from each of - Country, Electonic Dance Music (EDM), Rock, Pop and Rap - and also extract the broad themes that define each.
</p>
</div>

<div>
<img src="lyric_analysis/Pink_floyd.jpg", width=350>
</div>

## Setup
The code used for this project comes from my [Machine Learning Repo](https://github.com/silburt/Machine_Learning/).  

First we need some songs. I picked a few playlists from Spotify: 
- <i>Country Top Songs of 2000-2017</i> by sjwheat (1567 songs)
- <i>Best EDM Playlist on the Planet!</i> by Matt Hoppe (1592 songs)
- <i>Greatest Rock Songs Ever</i> by Max John Maybury (1049 songs)
- <i>Super Top Hits 2000-2016</i> by Michele Insalata (1603 songs)
- <i>BEST of RAP</i> by Andy Mathers (2128 songs)

and converted these playlists to csvs [here](https://github.com/watsonbox/exportify), with each csv row containing the artist, song, album, etc. for each entry. Then, I scraped each lyric from the genius.com API using standard Python packages like `BeautifulSoup` and `requests`. A good fraction of songs (20-40% per genre) weren't recognized in the genius.com API due to slightly different song/artist/mix version details between Spotify and genius.com. I also removed songs that had fewer than 15 words (i.e. instrumentals). In total, 1000 songs were used from each genre for the following analysis.

In addition, some basic preprocessing was done. Specifically:
- All lyrics are converted to lower case.
- Accented characters are converted to normal characters.
- Punctuation is removed ('!', '?', '.').

## Unique Words
Before jumping into machine learning, I thought I'd first present the broad themes and emotions of each genre by showing words from each genre with a disproportionately high or low rank (compared to the other genres). By rank, I essentially mean popularity. So for example, '_the_' is the most popular word in country music and has a rank of 1, '_i_' is the second most popular word and has a rank of 2, etc.


The left/right tables below shows words with at least a 1.5x higher/lower rank in a given genre. So, for example, the rank of the word "_we_" in each genre is: EDM=5, Rap=18, Rock=28, Country=15, Pop=22, and hence it is disproportionately more common in the EDM genre compared to all others.

<div class="col-lg-6 col-sm-7">
<br>
<center> __Disproportionately high rank for given genre__ </center>

| EDM | Rap | Rock | Country | Pop | 
| ------------- |:-------------:| ------------- |:-------------:| ------------- |
| we    | got       | well   | little | |
| feel  | she       | woman  | every  | |
| we're | ni\*\*a(s)| soul   | old    | |
| our   | they      | people | song   | |
| us    | bitch(es) |        | kiss   | |
| into  | fuck      |        | town   | |
| again | shit      |        | road   | |
| light | money     |        | those  | |
| fire  | i'mma     |        | | |
| fall  | hit       |        | | |
| jump  | pussy     |        | | | |

</div>
<div>
<br>
<center> __Disproportionately low rank for given genre__ </center>

| EDM | Rap | Rock | Country | Pop | 
| ------------- |:-------------:| ------------- |:-------------:| ------------- |
| she   | love   | this   | | |
| ain't | we're  | still | | |
| her   | you're | even | | |
| girl  | our    | | | |
| man   | gonna  |      | | |
| he    | away   |      | | |
| bad   | heart  |      | | |
| she's | were   |      | | |
| his   | world  |      | | |
| crazy | light  |      | | |
| him   | eyes   |      | | | |
</div>
Some interesting trends:
- __EDM__: Seems to focus on togetherness (we, our, us), feelings, and drama (fire, light, fall, jump). It also appears very gender neutral, with almost every gendered (pro)noun like she/he, her/him, girl/man disproportionately rare.
- __Rap__: (Probably unsurprisingly), it's focused on women, money, and sex. I suppose J-Cole speaks the truth when he says (in the song G.O.M.D.), "_It's called love, Ni\*\*as don't sing about it no more_". Love seems to be part of a broader trend too, with other intimate words like "heart", "our", "light", "eyes" being disproportionately rare.
- __Rock__: Seems to retain some overarching 60s and 70s themes, with popular words like "woman", "soul" and "people". Not sure what to make of the disproportionately rare words... 
- __Country__: Seems to be focused on the simple things in life (old, song, kiss, road, town), and also <i>really</i> likes to miniaturize things - "little baby", "little thing", "little kisses", etc., with "little" being the top unique word.
- __Pop__: Pop has no disproportionately popular or rare words, which is unsurprising given that pop music draws from all the other genres. Country, rock, rap and EDM all penetrate the Top40s continuously. 

EDM and rap seem to be the two most distinct genres, having the most number of disproportionately popular/rare words (there are even more disproportionately rare words for both genres not shown here).

## Train a Neural Network to Generate New Lyrics
Text prediction is a really interesting area of Machine Learning at the moment, and one thing we can try is to train a neural network to generate song lyrics. We will be training a [Long Short Term Memory (LSTM) network](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) on each genre to see if it can learn to generate new lyrics with similar properties as the training set. Specifically, given an input sequence of text:  
<center>_You think I'm pretty_ </center>
we would like the LSTM to be able to finish the sentence with something reasonable, e.g.:
<center>_without any makeup on_ </center>
Since the LSTM generates outputs predictions probabilistically, running the same input sequence through the LSTM again will produce a different output, e.g.:
<center>_sweet like ice cream_ </center>

### Training
The training data is constructed by taking the 1000 songs for each genre and chopping up the lyrics into input and output sequences of length $L$, with the output sequences offset from the input by one. In total this produces about a million examples (give or take, depending on $L$ and the genre). An example input/output training pair is:  
```
 Input = [y, o, u, \, t, h, i, n, k, \, i, ', m, \, p, r, e, t, t]
Output = [o, u, \, t, h, i, n, k, \, i, ', m, \, p, r, e, t, t, y]
```
and the LSTM is trying to predict each `Output[i]` based off the `Input[i]` value. This is a "[many-to-many](https://i.stack.imgur.com/b4sus.jpg)" framework, and I found this _way_ more effective than a many-to-one framework (i.e. trying to predict only the final letter `y` based off the entire `Input` sequence above). This makes sense, as each training example now has $L$ testable predictions per example instead of just one, providing more information during training. 

Each characeter is then converted to a [one-hot](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) encoding so that it can be used by the network. That is, each unique character is converted to an array with all 0s except one element which has a 1. E.g.:
```
'a'  -> [1,0,0,...,0]  
'b'  -> [0,1,0,...,0]  
...  
'\n' -> [0,0,0,...,1]  
```
In total 45 different one-hot encodings were needed to account for the 26 letters of the alphabet, 10 numbers (0-9), and about 10 extra characters like '?', '!', '"', etc. To save memory (one-hot encodings get memory-intensive fast), I added a generator to the network which converts the characters to one-hot encodings on the fly.

The network itself is built in [Keras](keras.io), and uses a [GRU](https://arxiv.org/pdf/1406.1078.pdf), which is a slightly simpler form of LSTM. The network is incredibly easy to build in Keras:
```Python
model = Sequential()
model.add(GRU(lstm_size, dropout=drop, recurrent_dropout=drop, return_sequences=True,
              input_shape=(L, vocab_size)))
for i in range(depth - 1):
    model.add(GRU(lstm_size, dropout=drop, recurrent_dropout=drop, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))

decay = lr/epochs
optimizer = RMSprop(lr=lr, decay=decay)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
```
where:  
- `vocab_size` are the number of unique characters.
- `L` is the length of the input and output sequences, $L$.
- `lstm_size` are the number of nodes in the LSTM.
- `drop` is the [dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) fraction. 
- `depth` is the total depth of the network. 
- `lr` is the learning rate.

With the exception of `vocab_size`, all the above variables are tunable hyperparameters. In addition, another key tunable hyperparameter is `batch_size`, which determines how many examples are seen per gradient update. Splitting the data into Training/Validation sets using an 80/20 split, these hyperparameters were tuned for 50 epochs over the following ranges:
```

```

### Results

### Word vs. Character-level LSTMs
Although I tried training an LSTM on at both the character and word level, I had way more success with the character level. [Others](https://groups.google.com/d/msg/keras-users/Y_FG_YEkjXs/TD_j9b532kwJ) seem to have had similar experiences. Overall the word-prediction is a more difficult problem, since:
- Due to the vast number of different words, the LSTM has to learn orders of magnitude more correlations than the character-level with order of magnitude fewer examples for each. The variable size can be reduced by removing rare words, but then other complexities arise and it doesn't improve things so much.

In addition, word-level LSTMs appear to be less flexible. In principle, characters can be combined to produce any word in the dictionary, but words can only be combined to form sentences. In the current LSTM framework, it's impossible to produce a new word not in your training set, severely limiting the space of possible output sequences.