# Text Generation using LSTM RNNs

In this notebook, we create an RNN (using two stacked LSTM layers) that can be used to generate song lyrics reflecting the style of a particular artist.

Here, we'll be using a dataset of Tayor Swift's songs, taken from [here](https://www.kaggle.com/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums) at Kaggle.

In [2]:
# We need Tensorflow 1.x to use CuDNNLSTM properly
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [3]:
from keras.callbacks import LambdaCallback
from keras.layers import Dense, LSTM, Embedding, Input, Dropout, CuDNNLSTM
from keras.preprocessing.text import Tokenizer
from keras.models import Model

import numpy as np
import re # could be useful
import os

Using TensorFlow backend.


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Preprocessing the Data

We'll be using a dataset of Tayor Swift's songs, taken from [here](https://www.kaggle.com/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums) at Kaggle.

This dataset needs some preliminary preprocessing to generate a text corpus that can be easily fed into the LSTM network.

In [0]:
if not os.path.isdir('data'):
  os.mkdir('data')

if not os.path.isdir('drive/My Drive/ML & DL/Lyrics-RNN/model-checkpoints'):
  os.mkdir('drive/My Drive/ML & DL/Lyrics-RNN/model-checkpoints')

CHKP_DIR = 'drive/My Drive/ML & DL/Lyrics-RNN/model-checkpoints'

In [8]:
import pandas as pd

df = pd.read_csv('data/taylor_swift_lyrics.csv', encoding='latin1')
df.head()

Unnamed: 0,artist,album,track_title,track_n,lyric,line,year
0,Taylor Swift,Taylor Swift,Tim McGraw,1,He said the way my blue eyes shined,1,2006
1,Taylor Swift,Taylor Swift,Tim McGraw,1,Put those Georgia stars to shame that night,2,2006
2,Taylor Swift,Taylor Swift,Tim McGraw,1,"I said, ""That's a lie""",3,2006
3,Taylor Swift,Taylor Swift,Tim McGraw,1,Just a boy in a Chevy truck,4,2006
4,Taylor Swift,Taylor Swift,Tim McGraw,1,That had a tendency of gettin' stuck,5,2006


In [0]:
SEQ_LEN = 20 # sequence length for LSTMs, i.e., act upon after recalling previous SEG_LEN words
START_SONG = '| ' * SEQ_LEN

In [10]:
song_names = []
lyrics = []

song_number = 1 # song number in dataset

first_line = True

for ind, row in df.iterrows():

  if (song_number == row['track_n']):

    # First line of next/first song? If yes:
    if (first_line):
      #print('Found first line.')
      song_names.append(row['album'])
      lyrics.append(START_SONG + row['lyric'] + '\n') 
      # START_SONG is the new song marker which we can use as a seed to tell the model 
      # to generate a new song from scratch
      first_line = False

      #print(lyrics)

    else:
      lyrics[len(lyrics) - 1] += row['lyric'] + '\n' # add lyrics to this song

  # Move to next song
  else:
    song_number = row['track_n'] # Note that song number is album-wise
    first_line = True

print(str(len(lyrics)) + ' songs processed.')

94 songs processed.


In [0]:
# Create another dataframe

df_proc = pd.DataFrame({'song_name': song_names, 'lyrics': lyrics})

# Save Lyrics in .txt file
with open('data/taylor_swift_corpus.txt', 'w', encoding="utf-8") as f:  
    for lyric in lyrics:
        f.write(lyric)

Yay! Now we've finally retrieved the lyrics into a more convenient format.

## Loading the Data

In [12]:
artist_name = 'taylor_swift'
corpus = 'data/' + artist_name + '_corpus.txt'

if not (os.path.isdir('data') and os.path.isfile(corpus)):
  print('Data not found!')
  raise SystemError('Corpus not found!')

with open(corpus, 'r', encoding='utf-8') as f:
  text = f.read().lower()

print('Corpus length for ' + artist_name + ': ' + str(len(text)) + ' characters.')

Corpus length for taylor_swift: 173961 characters.


## Tokenization

Tokenization can of course be done manually in Python, but why not make our lives easier by using what we already have?

In [0]:
# Keep the punctuation, but add a space before and after it
text = re.sub('([*#@$%,?!()&*-./:[\]^_~\n])', r' \1 ', text)
# We want the line breaks too, so add a space before it and after it so that the tokenizer can recognise it

In [0]:
# Tokenization
tokenizer = Tokenizer(char_level=False, filters='')
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
token_list = tokenizer.texts_to_sequences([text])[0]

In [0]:
tokenizer.word_index
# You can find some encoding problems like 'me\x97' being a token apparently
# Might use another re.sub to get rid of those

In [0]:
token_list

## Building the Dataset

Now we shall generate sequences of words based on `SEQ_LEN` for our model to work upon.

`# TODO`: Make `gen_sequences()` a generator function that yields values of X, y to avoid loading all the data into memory. The model can then be trained using `model.fit_generator()`.

In [0]:
from keras.utils.np_utils import to_categorical

def gen_sequences(token_list, step):
  """
  step: the step of the for loop running on the token_list
  """

  x = list()
  y = list()

  for i in range(0, len(token_list) - SEQ_LEN, step):
    x.append(token_list[i: i + SEQ_LEN]) # the training words
    y.append(token_list[i + SEQ_LEN]) # the word that should be predicted after them

  y = to_categorical(y, num_classes=total_words)
  # can we use keras.utils.to_categorical here? (fearing tensor vs ndarray mismatch)

  num_seq = len(x)
  print('Generated ' + str(len(x)) + ' sequences.')
  
  return x, y

Finally, let's generate the dataset.

In [18]:
step = 1

X, y = gen_sequences(token_list, step)

X = np.array(X)
y = np.array(y)

Generated 44572 sequences.


## The Model

Finally, the model!

In [19]:
N_UNITS = 256
EMBEDDING_SIZE = 100

text_input = Input(shape=(None,), name='text_input')
embd = Embedding(total_words, EMBEDDING_SIZE, name='embedding')(text_input)

# CuDNNLSTM is a fast variant of LSTM, only for GPUs... might want to use that
lstm1 = CuDNNLSTM(N_UNITS, return_sequences=True)(embd) 
lstm2 = CuDNNLSTM(N_UNITS)(lstm1)

drp = Dropout(0.2)(lstm2)

text_out = Dense(total_words, activation='softmax')(drp) # output text as probabilities for all words

model = Model(text_input, text_out)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [20]:
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_input (InputLayer)      (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 100)         245900    
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, None, 256)         366592    
_________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM)     (None, 256)               526336    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2459)              631963    
Total params: 1,770,791
Trainable params: 1,770,791
Non-trainable params: 0
_________________________________________________

In [0]:
from keras.optimizers import RMSprop

opt = RMSprop(lr = 0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt)

In [0]:
# Some callbacks
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint

callbacks = []
mdlchkp = ModelCheckpoint(os.path.join(CHKP_DIR, 'taylor-epoch-{epoch}.h5'), period=20) # save model after every 20 epochs
rdlr = ReduceLROnPlateau(monitor='loss', patience=3, factor=0.5, min_lr=0.0000001)

callbacks = [mdlchkp, rdlr]

In [30]:
model.fit(X, y , epochs=400, batch_size=32, shuffle=True, callbacks=callbacks)

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78

<keras.callbacks.callbacks.History at 0x7f8da673ad68>

In [0]:
model.save('Lyrics_taylor3.h5')
model.save_weights('Lyrics_taylor_wts3.h5')
# aisii

## Generation of New Text

In [0]:
def sample(preds, temperature=0.2):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  preds_exp = np.exp(preds)

  # The probabilities should be normalized before sampling using np.random.multinomial
  preds = preds_exp / np.sum(preds_exp)

  # Draw samples from preds, which is a multinomial distribution
  probs = np.random.multinomial(1, preds[0], 1)

  return np.argmax(probs)

In [0]:
def gen_text(seed, n_next_words, model, seq_len, temperature):
  out_text = ''
  seed = START_SONG + seed

  # We shall retrieve new words till we get n_next_words, 
  # or till the model wants to start a new song

  for i in range(n_next_words):
    token_list = tokenizer.texts_to_sequences([seed])
    #print(token_list)

    # we can feed in an arbitrary number of words to the LSTM
    # but let's stick to only the last seq_len number of words
    # that'll help it predict faster
    token_list = token_list[0][-seq_len:] # note that token_list was a list of lists before this line
    
    # Reshape it and convert to ndarray so that we an feed it in
    token_list = np.reshape(token_list, (1, seq_len))

    # Predict words!
    probs = model.predict(token_list, verbose=0)
    #print(np.shape(probs))

    y_sampled = sample(probs, temperature)

    out_word = tokenizer.index_word[y_sampled] if y_sampled > 0 else ''
    #print(out_word)
    
    if out_word == '|': # the model is trying to start a new song... so finish up
      break

    if not out_word.endswith('\n'):
      seed += out_word + ' ' # update the seed
      out_text += out_word + ' '
    else:
      seed += out_word + ' '
      out_text += out_word

  return out_text

In [0]:
from keras.models import load_model

model = load_model('Lyrics_taylor3.h5')

## Let's Get Some Songs!

In [58]:
outtext = gen_text('', 200, model, 20, 0.6)
print(outtext)

i have known it all this time 
but i never thought i'd live to see it break 
it's getting dark and it's all too scar 
and i know i'm over 
but the time that you've everything back before 
you did alone , i had a but 
you're my smile , all eyes him , he's 
but i know for me , you got that sorry 
and i never saw you coming 
and i'll never be the same 
and i never saw you coming 
and i'll never be the same 
this is a together and i'll be be ? 
i never still know we had 
our hands are , all you he , i know you all 
but you're everything if just was 
and i hit you 
he ( i try to love the woods of the best of to my town 
i can't go down , no to go on the time 
you do and the dreams and oh 
but i know is we can be me 
but i know i can say about you 
and the ah 


In [62]:
outtext = gen_text('', 500, model, 20, 0.6)
print(outtext)

i was in your me , you got me alone 
you say that like a dream we trust too chance 
i was the was ? i was off in the white 
i don't know she know you he get me 
you have a oh things a nines 
and yeah away ( dance you live i go is a to take ? 
'cause i know i know you home 
it's so sorry 
you're so who , say 
so but they everybody saw sorry 
and , they do , we were blame to don't breathe 
i'd feeling like you , a ( wish you just right 
i had a bad feeling 
but we were dancing 
dancing with our hands tied , hands tied 
yeah , we were dancing 
like it was the first time , first time 
yeah , we were dancing 
dancing with our hands tied , hands tied 
yeah , we were dancing 
and i had a bad feeling 
but we were dancing 
i , i loved you in in of 
you'll hold me the memories 
'cause i know you were see 
that i can't get take my eyes 
long should now you'll know 
i'm beat , one eyes my mind 
but that's i know what you're love and will was 
it's i was ? 
i know what all this is me ? 
this is sto

Okay, well, of course it's not that coherent as LSTMs can't really understand the semantics of the language too well (atleast for 400 epochs). But I kind of like the last one.