<h3> Run the TensorFlow sequence to sequence model tutorial on a custom dataset of favorite songs to gneerate new lyrics. The approach closely follows the tutorial on Tensorflow by Google (https://www.tensorflow.org/tutorials/sequences/text_generation) which in turn works with the Shakespeare dataset. Here I work with my own.

<h5> Our model will train on a GPU in Google Collab or on a CPU if one is not available. To run on Google Collab, make sure the data is in the correct directory. 

<h5> First I tried to run the model on a simple song lyrics dataset which I created consisting of only lyrics from one album from the band The National. The model outputs garbage, so I augment the dataset using all songs lyrics from the indie genre from the Kaggle dataset https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics, and retrain the model

<h5> The file "Extract_Song_Lyrics.ipynb" extracts the necessary data.

In [147]:
import tensorflow as tf
import numpy as np
import os
import requests
import time
print(tf.VERSION)
tf.enable_eager_execution()

1.12.0


<h4> Check if GPU is available, if not use CPU hardware

In [148]:
if not tf.test.gpu_device_name():
        print("Please train model on GPU")
else:
    print('GPU Device {}'.format(tf.test.gpu_device_name()))
    device_name = tf.test.gpu_device_name()

GPU Device /device:GPU:0


In [149]:
BASE_PATH = os.getcwd()
DATA_PATH = os.path.join(os.getcwd(), 'lyrics.txt')
print("BASE PATH = {}".format(BASE_PATH))

BASE PATH = /content/drive/My Drive/deep_learning_models/Seq_2_Seq_Models


<h3> Now we have a much larger text dataset composed of over 3 million characters. To train this, we should probably use a GPU

<h3> To run code in Google Collab, we first need to mount the content directory and browse it to locate the data. Since the dataset file is quite small, I have loaded into my Google Drive.

In [0]:
#from google.colab import files

#uploaded = files.upload()

#for fn in uploaded.keys():
#  print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

In [150]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [151]:
os.chdir('/content/drive/My Drive/deep_learning_models/Seq_2_Seq_Models')
!ls

 380000-lyrics-from-metrolyrics       main.ipynb
 380000-lyrics-from-metrolyrics.zip   main.py
 cleaned_lyrics.txt		      song_lyrics.txt
 Extract_Song_Lyrics.ipynb	      Text_Gen_Songs.ipynb
 Extract_Song_Lyrics.py		      Text_Gen_Songs.py
 lyrics.txt			     'training_checkpoints)'


In [153]:
text = open('cleaned_lyrics.txt', encoding = 'ISO-8859-1').read().replace('\n', ' \n ')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 3176445 characters


<h5> Compared to the Shakespeare dataset which has over 1,000,000 characters, the augmented dataset I now use which consists of all song lyrics from indie genres contains over 3 million characters. Makes sense to train this on a GPU.

In [156]:
vocab = sorted(set(text))
print('{} unique characters'.format(len(vocab)))
#vocab

59 unique characters


<h5> Create a lookup table which maps characters to numbers and vice versa.

In [0]:
char2idx = {char:i for i, char in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

<h5> Each unique character is now mapped to an integer

In [157]:
for char,_ in zip(char2idx, range(0, 20)):
    print('{:6s} ---> {:4d}'.format(repr(char), char2idx[char]))

' '    --->    0
'!'    --->    1
'"'    --->    2
"'"    --->    3
','    --->    4
'.'    --->    5
'A'    --->    6
'B'    --->    7
'C'    --->    8
'D'    --->    9
'E'    --->   10
'F'    --->   11
'G'    --->   12
'H'    --->   13
'I'    --->   14
'J'    --->   15
'K'    --->   16
'L'    --->   17
'M'    --->   18
'N'    --->   19


In [158]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(text[:25], text_as_int[:25]))

"Don't feel so bad,\nIt's ---- characters mapped to int ---- > [ 2  9 47 46  3 52  0 38 37 37 44  0 51 47  0 34 33 36  4 32 46 14 52  3
 51]


<h4> Let's look at the dataset to see what kind of text the model contains

In [159]:
text[:25]

'"Don\'t feel so bad,\\nIt\'s'

*The  text contains things like <x4>which mean to repeat the lyric 4 times. How will our model deal with this kind of text input?*

#### Next we start training the model. This is a character based model, so given a sequence of characters of length seq_lgth, the model predicts the next character. To train such a model, we pick a fixed sequence length for training, and train the model to predict the same sequence shifted by 1. So given the word "cussin" above, if the input is "cussi" the target is "ussin".

In [160]:
seq_length = 100 # This should be a parameter we can play with. Why is 100 good for songs?
examples_per_epoch = len(text)//seq_length
# create training and target examples:

chunks = tf.data.Dataset.from_tensor_slices(text_as_int).batch(seq_length+1, drop_remainder= True)
#chunks = chunks.apply(tf.contrib.data.batch_and_drop_remainder(seq_length+1))
for item in chunks.take(5):
  print(repr(''.join(idx2char[item.numpy()]))) # convert idc2char to a word, use repr to make printable

'"Don\'t feel so bad,\\nIt\'s just the way the wheel turns\\nStay in where it\'s quiet,\\nAnd where the sick'
" burns\\nAnd I never meant to make you feel alone,\\nI never meant to hide\\nAnd I never thought I'd mak"
"e you see the light,\\nBefore it was your time\\nI'm losing you\\nI'm losing you\\nBack straight, arms do"
"wn,\\nLift the weight off your shoulders\\nTake your time, take it slow\\nGet ready for the end\\nI'm los"
"ing you\\nI'm losing you\\nLight head\\nCold sweat\\nFind the vein\\nAnd deliver x\\nI'm losing you\\nI'm lo"


<h5> This function creates inputs and target sequences for the model to predict

In [0]:
def input_target(chunk):
    input_text = chunk[:-1] # all but last
    target_text = chunk[1:] # all but first
    return input_text, target_text

In [0]:
def create_dataset(seq_length, text_data):
    ''' Given the text data and a sequence length, this creates a new dataset which consists of input and target chunks'''
    chunks = tf.data.Dataset.from_tensor_slices(text_data).batch(seq_length+1, drop_remainder= True)
    return chunks.map(input_target)

In [0]:
dataset = create_dataset(seq_length, text_as_int)

<h5> Look at some examples of input and target sequences.

In [164]:
for input_dataset, target_dataset in dataset.take(5):
    print(repr(''.join(idx2char[input_dataset.numpy()])))
    print(repr(''.join(idx2char[target_dataset.numpy()])))

'"Don\'t feel so bad,\\nIt\'s just the way the wheel turns\\nStay in where it\'s quiet,\\nAnd where the sic'
"Don't feel so bad,\\nIt's just the way the wheel turns\\nStay in where it's quiet,\\nAnd where the sick"
" burns\\nAnd I never meant to make you feel alone,\\nI never meant to hide\\nAnd I never thought I'd ma"
"burns\\nAnd I never meant to make you feel alone,\\nI never meant to hide\\nAnd I never thought I'd mak"
"e you see the light,\\nBefore it was your time\\nI'm losing you\\nI'm losing you\\nBack straight, arms d"
" you see the light,\\nBefore it was your time\\nI'm losing you\\nI'm losing you\\nBack straight, arms do"
"wn,\\nLift the weight off your shoulders\\nTake your time, take it slow\\nGet ready for the end\\nI'm lo"
"n,\\nLift the weight off your shoulders\\nTake your time, take it slow\\nGet ready for the end\\nI'm los"
"ing you\\nI'm losing you\\nLight head\\nCold sweat\\nFind the vein\\nAnd deliver x\\nI'm losing you\\nI'm l"
"ng you\\nI'm losing you\\nLight he

<h4> Training

<h5> We used tf.data to break up the data into sequences, now we must buffer the data and load it in batches for training. This is easily done with batch command and shuffle command available with a tf.Dataset object. 

In [165]:
# batch size and shuffle dataset
BATCH_SIZE = 64
BUFFER_SIZE = 1000
steps_per_epoch = examples_per_epoch//BATCH_SIZE
print(steps_per_epoch)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder = True)

496


In [166]:
dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

### Model:

<h4> We use keras.Sequential to define the model. If a GPU is availabble, we train using CuDNNGRU or else use a GRU.
<h4> The model consists of three layers: Input of dimension (batch_size,)
<h5>    a) an embedding layer [batch_size, embedding_size]
<h5>    b) a GRU for training [batch_size, rnn_units]
<h5>    c) a fully connected dense layer which maps the output of the GRU to a vector for training [batch_size, vocab_size].

In [0]:
vocab_size = len(vocab)
rnn_units = 1024
embedding_dim = 256

In [0]:
if tf.test.is_gpu_available():
    rnn = tf.keras.layers.CuDNNGRU
else:
    import functools
    rnn = functools.partial(tf.keras.layers.GRU, recurrent_activation = 'sigmoid')

In [0]:
def build_model(vocab_size, embedding_size, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape = [batch_size, None]),
        
        rnn(rnn_units,
           return_sequences = True,
           recurrent_initializer = 'glorot_uniform',
           stateful = True), #retain the state of the RNN as the model learns the context from batch to batch
       tf.keras.layers.Dense(vocab_size) 
    ])
    return model

In [0]:
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

In [187]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (64, None, 256)           15104     
_________________________________________________________________
cu_dnngru_11 (CuDNNGRU)      (64, None, 1024)          3938304   
_________________________________________________________________
dense_11 (Dense)             (64, None, 59)            60475     
Total params: 4,013,883
Trainable params: 4,013,883
Non-trainable params: 0
_________________________________________________________________


<h5> With Keras its super easy to build a model using Keras.Sequential Let's test it out. Check if the model outputs look reasaonble

In [188]:
for input_example_batch, target_example_batch in dataset.take(1): 
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 59) # (batch_size, sequence_length, vocab_size)


In [189]:
# draws a sample from multinomial distribution of length example_batch_predictions[0], and 1 sample
sampled_indices = tf.multinomial(example_batch_predictions[0], num_samples = 1) 
sampled_indices = tf.squeeze(sampled_indices, axis = -1).numpy() 

print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Output: \n", repr("".join(idx2char[sampled_indices])))

Input: 
 'ke it there\\nSo you can move your court case\\nWay across town\\nYou can move it across the whole wide'

Output: 
 'oJJfXunqUcNcSR.zZV.YootNZSdxjFwBatpPvNIxuF!uJGHTZBM"yoXykKj"ohpfzgoRvEWGHb\\.gssS.sqAi\'npfVyck\\xcqpyt'


<h5> As you can see the untrained model outputs a bunch of garbage

<h3> Train the model

In [0]:
model.compile(optimizer = tf.train.AdamOptimizer(),
             loss = tf.losses.sparse_softmax_cross_entropy)

In [0]:
model.build(tf.TensorShape([BATCH_SIZE, seq_length]))

<h5> Create a directory to save model checkpoints and make sure the directory is created each time model parameters are changed and updated

In [0]:
import shutil

checkpoint_dir = os.path.join(os.getcwd(), 'training_checkpoints)')
if os.path.exists(checkpoint_dir):
    shutil.rmtree(checkpoint_dir)
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [193]:
EPOCHS = 20
steps_per_epoch

496

<h4> Looking at the evolution of the loss, should stop training at about 25 epochs beyond which the model begins to overfit and the loss goes back up again

In [194]:
history = model.fit(dataset.repeat(), epochs = EPOCHS, steps_per_epoch=steps_per_epoch,callbacks=[checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<h2> Generate New Text

<h3> Test the Model:

<h4> Restore the latest checkpoint. To keep predictions simple, make sure the model is rebuilt with a batch size of 1 for testing

In [195]:
tf.train.latest_checkpoint(checkpoint_dir)

'/content/drive/My Drive/deep_learning_models/Seq_2_Seq_Models/training_checkpoints)/ckpt_20'

In [0]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size = 1)

checkpoint = tf.train.Checkpoint(model=model)
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_prefix))
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

In [197]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (1, None, 256)            15104     
_________________________________________________________________
cu_dnngru_12 (CuDNNGRU)      (1, None, 1024)           3938304   
_________________________________________________________________
dense_12 (Dense)             (1, None, 59)             60475     
Total params: 4,013,883
Trainable params: 4,013,883
Non-trainable params: 0
_________________________________________________________________


In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
    num_generate = 200

  # Converting our start string to numbers (vectorizing) 
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
    text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
    temperature = 0.1

  # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
      # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

      # using a multinomial distribution to predict the word returned by the model
        predictions = predictions / temperature
        predicted_id = tf.multinomial(predictions, num_samples=5)[-1,0].numpy()
      
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [206]:
print(generate_text(model, start_string='Simona').replace('\\n', ' \n '))

Simona song when I'm down 
 And if I fall asleep to be found 
 I can feel it deep in my head 
 And the wind blows her way 
 And you say the world can make me want to be 
 It was the one who loves you the most, y
