# LSTMs and Text Generation
Roy Adams

## Vision

The goal of this project was to use an LSTM to generate lyrics based on Pink Floyd songs. Apart from the learning aspects of this project, I hope to create a system that could help songwriters use word combinations that they had not previously considered. Pink Floyd's lyrics have always been a little strange, so seeing what a neural network can do with them would be interesting.  


## Background

### Text Generation With LSTMs

I decided to use a recurrent neural network (RNN). One of the most common RNNs is the LSTM (Long-Short Term Memory). In contrast to Feed-Forward networks, RNNs use previous results as inputs in addition to examples. This allows RNNs to keep track of important features that have already come through the network. In addition, RNNs use a different form of backpropogation than Feed-Forward networks. This is called backpropogation through time, which adds a time element to each part of the backpropogation process.  
The biggest issue with RNNs was the vanishing and exploding gradient problem. With the added time element relate to the layers through multiplication, it is possible for the gradient descent of these networks to disappear with weights that are too small or get infinitely big with weights are too big.  
LSTMs fix this problem. LSTMs contain a gated cell that exists outside of the natural flow of the neural network. These cells control what enters them by determining what pieces of information are important to remember (Skymind).

## Implementation

The code for this project is heavily based on the blog/tutorial from https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/. This tutorial uses Alice in Wonderland text as their data.

In [2]:
# Load LSTM network and generate text
import sys
import numpy
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import Flatten
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

After loading the necessary libraries, we need to load and format the data. The dataset I used comes from https://www.kaggle.com/mousehead/songlyrics, which contains 57650 songs. I format the data by removing newline characters and ellipsis. I then combine all the Pink Floyd lyrics into one long string.

In [39]:
# load ascii text and covert to lowercase
filename = "data\\songdata.csv"
df_songs = pd.read_csv(filename)
df_pinkfloyd = df_songs.loc[df_songs['artist'] == 'Pink Floyd']
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.lower())
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.replace(r"\\n", ""))
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.replace(r"\.\.\.", ""))
raw_text = ' '.join(map(str, df_pinkfloyd['text']))

Once we have the raw text of the lyrics, we need to format them in an intuitive way for neural networks. A common way to do this is to create enumerations of characters to integers. We also need to remember how we did this enumeration so that our network's output is able to be converted back to characters. After this enumeration, we define how many characters we want to keep track of at a time. We use this length when creating sequences of enumerated characters. We take seq_length characters as the the training x and select the next character chosen as the training y. We then normalize the x's to be a better input for the neural network. Finally, we change the y from being an integer to a one-hot encoded array of possible next characters.

In [40]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# # summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

# # prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(n_vocab)
print(X.shape)

# one hot encode the output variable
y = np_utils.to_categorical(dataY)


Total Characters:  96814
Total Vocab:  49
Total Patterns:  96714
(96714, 100, 1)


Now it is time to create the LSTM model. It will have input shape (seq_length, 1). The first variation I tried was setting return_sequences=True for each LSTM layer. This gives us access to the output of each hidden state of the LSTM (“Understand the Difference...", Machine Learning Mastery). I added a dropout layer after each LSTM layer to attempt to avoid overfitting. Finally, I flatten the 2D shape into a Dense output layer that matches the size of our possible next characters. This model took about 2.5 hours to train so I saved the trained weights in a file that can be loaded here.

In [7]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# load the network weights
filename = "normal/weights-improvement-20-1.5803.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

At this point, we are ready to start generating text. We choose a random starting point in our data and use our network to predict what character will come next. From there we shift our pattern to include the new character and repeat the process.

In [8]:
# pick a random starting point
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]

# generate characters
for i in range(1000):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

you have she pili the soy the gane wo far  
the hand of see shacn in the siaee  
and whll coe she cerhir the shpe  
and when e fally aar gloe ard iung and ary?  
and i wound in the fird  
the dli is woule so the fere  
the wall the wuud ou thels  
she lighe wiet ir to pr your whet wou weet  
and i can tie wale and dlra  
and what eoedr and she saans  
bnd the gey sase the carl  
wht wat inow b shll  
ald the wou way wha aallw of aod wha  
gi the ware wou rhe kenw of the lrsh  
i'me gond if io she way yhay you'  
wou'll tee iack i gan you wwif wou ol the raie,  
  
i'v gring thog a bioother or wou the lirot  
dre is foat  
anw a shil wou'l hea mi the dorunng.  
  
and io the skan the soo'  
you well wou way for the hier 

  
the dar wou wase dood if the doe,  
  
she lena and your eaar bte hn the fie.  
the way she doue of she dronnng the seie  
and d'can wour oe the say ir fare  
oubn aar feles aay  

 
fet teet pi he hene the gomeer  
and g can'tee heeh and wolls  
she flgnw car fer s

Clearly this is nonsense. There are a few recognizable words like "the", "she", "and", "if", and "you", but the majority of words are not English. One thing that is interesting is that the structure of a song is recognizable. Each line is fairly short and looks like it could belong to a song.  
Disappointed with these results, I tried something slightly different. I turned the LSTMs into stateful mode. This meant adding the flag stateful=True and meant the previous training batch would be kept and influenced the current training batch. The biggest issue with implementing this was that the batch size had to be set to one, which took ten hours to train five epochs. The results from this test were even worse.

In [41]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, batch_input_shape=(1, X.shape[1], X.shape[2]), return_sequences=True, stateful=True))
model.add(Dropout(0.2))
model.add(LSTM(128, return_sequences=True, stateful=True))
model.add(Dropout(0.2))
model.add(LSTM(128, return_sequences=True, stateful=True))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# load the network weights
filename = "stateful\\weights-improvement-stateful-05-2.1414.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

# pick a random starting point
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]

# generate characters
for i in range(500):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\n\nDone.")

the sur  
  
and the shoe the seen the sean  
in the sare she see the shne  
and i hane the wie the the suand  
and i hene the siat  
and i woue bere  
and i she ne the dirt  
i was hi the een  
the lend ie the sioe the sane  
and i would the she the fan  
and i the le the she ie  
the soane the wase  
  
the llar in the sian  
and the siane and she way an i was she doar  
and i wou doelt the sane the los  
  
and i wou doer the wanl the nan  
the sane  
the soane  
wou and the riare the sare  


Done.


Yikes. This definitely looks worse than the previous try. The sentences are shorter on average and start with "the" or "and" much too often.  
Thinking the problem might be with generating individual characters at a time, I thought it might be interesting to try using full words instead. This required refactoring my data.

In [26]:
# load ascii text and covert to lowercase
filename = "data\\songdata.csv"
df_songs = pd.read_csv(filename)
df_pinkfloyd = df_songs.loc[df_songs['artist'] == 'Pink Floyd']
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.lower())
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.replace(r"\n", ""))
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.replace(r"\.\.\.", ""))
raw_text = ' '.join(map(str, df_pinkfloyd['text']))

# create mapping of unique chars to integers
unique_words = list(set(raw_text.split()))
words = list(raw_text.split())

word_to_int = dict((c, i) for i, c in enumerate(unique_words))
int_to_word = dict((i, c) for i, c in enumerate(unique_words))

# summarize the loaded data
n_words = len(words)
print("Total Vocab: ", n_words)

# prepare the dataset of input to output pairs encoded as integers
seq_length = 6
dataX = []
dataY = []
for i in range(0, n_words - seq_length, 1):
	seq_in = words[i:i + seq_length]
	seq_out = words[i + seq_length]
	dataX.append([word_to_int[word] for word in seq_in])
	# print("x: ", [word_to_int[word] for word in seq_in])
	dataY.append(word_to_int[seq_out])
	# print("y: ", word_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(n_words)
print(X.shape)

# one hot encode the output variable
y = np_utils.to_categorical(dataY)

Total Vocab:  17579
Total Patterns:  17573
(17573, 6, 1)


In [34]:
# define the LSTM model
model = Sequential()
model.add(LSTM(128, input_shape=(X.shape[1], X.shape[2]), return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LSTM(64, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LSTM(32, return_sequences=True, activation='tanh'))
model.add(Flatten())
model.add(Dense(y.shape[1], activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# load the network weights
filename = "words\\weights-improvement-04-6.6583.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [35]:
# pick a random starting point
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]

# generate characters
for i in range(100):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_words)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_word[index]
    seq_in = [int_to_word[value] for value in pattern]
    sys.stdout.write(result + " ")
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\n\nDone.")

sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent sent 

Done.


This is not working either. I believe the issue here is that this model is not sampling properly from the distribution of possibilities, but rather grabbing the most likely word. This could be escalated if during the training some weights started to become insignificant, making the word "sent" a much higher probability than any other word, regardless of context.

### Chollet's approach

In contrast to the blog's approach, Francois Chollet, author of Deep Learning with Python, samples from a distribution of possible characters rather than using the most likely. Much of the text below was taken directly from the accompanying IPython Notebook for chapter 8.1 (https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb).

In [None]:
import keras
import numpy as np
import pandas as pd
from keras import layers
from keras.callbacks import ModelCheckpoint
import string
import random
import sys

filename = "data\\songdata.csv"
df_songs = pd.read_csv(filename)
df_pinkfloyd = df_songs.loc[df_songs['artist'] == 'Pink Floyd']
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.lower())
df_pinkfloyd = df_pinkfloyd.apply(lambda x: x.astype(str).str.replace(r"\.\.\.", ""))
text = ' '.join(map(str, df_pinkfloyd['text']))
print('Corpus length:', len(text))

# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

for epoch in range(1, 20):
    print('epoch', epoch)
    filepath="chollet/weights-improvement-{loss:.4f}.hdf5"
    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
    callbacks_list = [checkpoint]
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

In [50]:
filename = "chollet\\weights-improvement-0.9230.hdf5"

model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# Select a text seed at random
start_index = random.randint(0, len(text) - maxlen - 1)
generated_text = text[start_index: start_index + maxlen]
print('--- Generating with seed: "' + generated_text + '"')

for temperature in [0.2, 0.5, 1.0, 1.2]:
    print('------ temperature:', temperature)
    sys.stdout.write(generated_text)

    # We generate 400 characters
    for i in range(400):
        sampled = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(generated_text):
            sampled[0, t, char_indices[char]] = 1.

        preds = model.predict(sampled, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_char = chars[next_index]

        generated_text += next_char
        generated_text = generated_text[1:]

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

--- Generating with seed: "oard for the american tour,  
and maybe you'll make it to th"
------ temperature: 0.2
oard for the american tour,  
and maybe you'll make it to the starter to me  
you can you stand the wart old would the sunshing on the sunshing the sunshing was so stream of the sand.  
and i rise in the sunshing wood  
and you sit would be so look how.  
  
in his on the sand of the morning  
  
like the sange in his break in her  
you can you then we there was it  
  
ooooh, i can you then in the sunshing so calling in the sunshing on the sunshing was so
------ temperature: 0.5
e sunshing so calling in the sunshing on the sunshing was so gig hold of a way  
should stand the words they cai to try  
something it fly live  
a flight i start deep all arout?  
  
walk to ching you feel  
do you helver  
you are we going for the watering in your sween  
when we gold watching all of the sand.  
should sound you sit for me.  
  
ooooh, wings and so you back.  
hould be so nice  
i

Now we're getting somewhere! The different temperatures that we see are changing how the sampler pulls from the character distributions. The higher the temperature, the more easily the sampler selects less likely characters as the next character, whereas the low temperatures make it harder for the sampler to pick the unlikely values. This is why the lower temperature sections seem like more standard English than their high temperature counterparts.

## Results

If we compare the model from the blogpost and Challot's model, it is clear that Challot's is better. He organizes the data in a very similar way to the blog and both the networks are LSTMs, but I believe the clear advantage Challot has is due to his use of sampling. We see real English words arranged in a way that does not make complete sense, but is readable and not too repetetive.  
The one thing that I like about both Challot and the blog's model is that they output lines in a format that could be interpreted as song lyrics, even though they are nonsensical. There are very few words per line, which makes it look a little like poetry.  
One reason I believe the blog's version did not work is because of the limited dataset. For both character models and the word model I simply did not have enough Pink Floyd songs to train the model effectively. One idea to fix this is to expand the dataset to include all psychadelic British rock groups.

## Implications

While the LSTMs generated nonsense for me, it is reasonable to assume that more data would have improved its capabilities. With this assumption, I could imagine someone using an LSTM to create a chatbot. This chatbot could be trained on millions of conversations between real people and learn what kind of responses are natural. This could become an ethical issue if the chatbot pretends to be a human. It could try to do nefarious things like prompt for personal information, bully people online, or spread false information.  
We have already seen a rise in fake news on social media, and I wouldn't be surprised if internet trolls started turning to text generation systems to spew out fake news articles. If an article doesn't get enough clicks, the developer could simply tweak something in the training to make it more realistic or outrageous, depending on the situation.
We will have to be alert to the possibility that we are communicating with an AI rather than a real human being. With further developments in text generation and larger datasets this is becomming a real possibility.

## Sources

Fedus, William, et al. “MaskGAN: Better Text Generation via Filling in the______.”  ICLR 2018, arxiv.org/abs/1801.07736.  
“A Beginner's Guide to LSTMs and Recurrent Neural Networks.” Skymind, skymind.ai/wiki/lstm.  
“Understand the Difference Between Return Sequences and Return States for LSTMs in Keras.” Machine Learning Mastery, 16 Oct. 2017, machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/.  