<a href="https://colab.research.google.com/github/ruiwen829/Coldplay-lyrics-Generation/blob/main/Coldplay_lyrics_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lyrics generation with an RNN

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/text_generation.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

### Import TensorFlow and other libraries

In [None]:
import tensorflow as tf
from google.colab import files
import numpy as np
import os
import time

### Upload the Coldplay lyrics dataset


In [None]:
text_all = ""
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        if file.endswith('.txt'):
            with open(os.path.join(root, file), 'r') as f:
                text = f.read()
                text_all+=text

In [None]:
delete_letters = ['è', 'ê', 'í', 'ó']
text_all = ' '.join([text for text in text_all.split() if all(d not in text for d in delete_letters)])

### Read the data

First, look in the text:

In [None]:
# length of text is the number of characters in it
print('Length of text: {} characters'.format(len(text_all)))

Length of text: 27992 characters


In [None]:
# Take a look at the first 250 characters in text
print(text_all[:250])

Come up to meet you, Tell you I'm sorry, You don't know how lovely you are. I had to find you, Tell you I need you, Tell you I set you apart. Tell me your secrets, And ask me your questions, Oh let's go back to the start. Runnin' in circles, Comin' u


In [None]:
# The unique characters in the file
vocab = sorted(set(text_all))
print('{} unique characters'.format(len(vocab)))

70 unique characters


## Process the text

### Vectorize the text

Before training, map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters.

In [None]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text_all])

Now we have an integer representation for each character. Notice that the character was indexed from 0 to `len(unique)`.

In [None]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  ' ' :   0,
  '!' :   1,
  '"' :   2,
  "'" :   3,
  '(' :   4,
  ')' :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '0' :   9,
  '1' :  10,
  '2' :  11,
  '3' :  12,
  '4' :  13,
  ':' :  14,
  ';' :  15,
  '?' :  16,
  'A' :  17,
  'B' :  18,
  'C' :  19,
  ...
}


In [None]:
# Show how the first 13 characters from the text are mapped to integers
print('{} ---- characters mapped to int ---- > {}'.format(repr(text_all[:13]), text_as_int[:13]))

'Come up to me' ---- characters mapped to int ---- > [19 57 55 47  0 63 58  0 62 57  0 55 47]


### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the task you're training the model to perform. The input to the model will be a sequence of characters, and you train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?


### Create training examples and targets

Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length+1`. For example, say `seq_length` is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello".

To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [None]:
# The maximum length sentence you want for a single input in characters
seq_length = 100
examples_per_epoch = len(text_all)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

C
o
m
e
 


The `batch` method lets us easily convert these individual characters to sequences of the desired size.

In [None]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

"Come up to meet you, Tell you I'm sorry, You don't know how lovely you are. I had to find you, Tell y"
"ou I need you, Tell you I set you apart. Tell me your secrets, And ask me your questions, Oh let's go"
" back to the start. Runnin' in circles, Comin' up tails, Heads on a science apart. Nobody said it was"
" easy, It's such a shame for us to part. Nobody said it was easy, No one ever said it would be this h"
"ard. Oh take me back to the start. I was just guessin', At numbers and figures, Pullin' the puzzles a"


For each sequence, duplicate and shift it to form the input and target text by using the `map` method to apply a simple function to each batch:

In [None]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

Print the first example input and target values:

In [None]:
for input_example, target_example in  dataset.take(1):
    print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  "Come up to meet you, Tell you I'm sorry, You don't know how lovely you are. I had to find you, Tell "
Target data: "ome up to meet you, Tell you I'm sorry, You don't know how lovely you are. I had to find you, Tell y"


Each index of these vectors is processed as a one time step. For the input at time step 0, the model receives the index for "F" and tries to predict the index for "i" as the next character. At the next timestep, it does the same thing but the `RNN` considers the previous step context in addition to the current input character.

In [None]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 19 ('C')
  expected output: 57 ('o')
Step    1
  input: 57 ('o')
  expected output: 55 ('m')
Step    2
  input: 55 ('m')
  expected output: 47 ('e')
Step    3
  input: 47 ('e')
  expected output: 0 (' ')
Step    4
  input: 0 (' ')
  expected output: 63 ('u')


### Create training batches

You used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, you need to shuffle the data and pack it into batches.

In [None]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## Build The Model

Use `tf.keras.Sequential` to define the model. For this simple example three layers are used to define our model:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.GRU`: A type of RNN with size `units=rnn_units` (You can also use an LSTM layer here.)
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs.

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units,
                            return_sequences=True,
                            stateful=True,
                            recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [None]:
model = build_model(
    vocab_size=len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)

For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character:

![A drawing of the data passing through the model](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/text_generation_training.png?raw=1)

Please note that Keras sequential model is used here since all the layers in the model only have single input and produce single output. In case you want to retrieve and reuse the states from stateful RNN layer, you might want to build your model with Keras functional API or model subclassing. Please check [Keras RNN guide](https://www.tensorflow.org/guide/keras/rnn#rnn_state_reuse) for more details.

## Try the model

Run the model to see that it behaves as expected.

First check the shape of the output:

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 70) # (batch_size, sequence_length, vocab_size)


In the above example the sequence length of the input is `100` but the model can be run on inputs of any length:

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           17920     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 70)            71750     
Total params: 4,027,974
Trainable params: 4,027,974
Non-trainable params: 0
_________________________________________________________________


To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

Note: It is important to _sample_ from this distribution as taking the _argmax_ of the distribution can easily get the model stuck in a loop.

Try it for the first example in the batch:

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

This gives us, at each timestep, a prediction of the next character index:

In [None]:
sampled_indices

array([58, 66,  1, 45, 32, 22, 44,  7,  9, 51,  0, 24, 37, 12, 46,  0, 11,
       30, 42, 17, 58, 12, 67, 51,  8, 26, 55, 61,  0, 48, 58,  0, 64, 68,
       57, 17,  0, 23, 11, 61,  1,  1, 49,  1, 62, 43, 12, 68, 25, 61, 56,
       15, 61, 25, 55, 38, 47,  6, 15, 43,  5, 60, 15, 15, 28, 36, 21, 36,
       24, 54, 52, 47, 29, 41, 50, 56, 29,  6, 41, 49, 52, 15, 15, 64, 40,
       18, 54, 20, 49, 37, 25, 11, 50,  3, 15, 45, 13, 51, 26,  0])

Decode these to see the text predicted by this untrained model:

In [None]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 ' bubble. And I never meant to cause you trouble, Oh I never meant to do you wrong, And I, well if I '

Next Char Predictions: 
 "px!cPFb-0i HU3d 2N]Ap3yi.Jms fp vzoA G2s!!g!ta3zIsn;sImVe,;a)r;;LTETHljeM[hnM,[gj;;vYBlDgUI2h';c4iJ "


## Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

The standard `tf.keras.losses.sparse_categorical_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

Because your model returns logits, you need to set the `from_logits` flag.


In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 70)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.249137


Configure the training procedure using the `tf.keras.Model.compile` method. Use `tf.keras.optimizers.Adam` with default arguments and the loss function.

In [None]:
model.compile(optimizer='adam', loss=loss)

### Execute the training

To keep training time reasonable, use 10 epochs to train the model. In Colab, set the runtime to GPU for faster training.

In [None]:
EPOCHS = 30

In [None]:
history = model.fit(dataset, epochs=EPOCHS)

Epoch 1/30


ValueError: ignored

## Generate text

### Restore the latest checkpoint

To keep this prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

To run the model with a different `batch_size`, you need to rebuild the model and restore the weights from the checkpoint.


In [None]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_10'

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))



In [None]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (1, None, 256)            17920     
_________________________________________________________________
gru_2 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_2 (Dense)              (1, None, 70)             71750     
Total params: 4,027,974
Trainable params: 4,027,974
Non-trainable params: 0
_________________________________________________________________


### The prediction loop

The following code block generates the text:

* Begin by choosing a start string, initializing the RNN state and setting the number of characters to generate.

* Get the prediction distribution of the next character using the start string and the RNN state.

* Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

* The RNN state returned by the model is fed back into the model so that it now has more context, instead of only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.


![To generate text the model's output is fed back to the input](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/text_generation_sampling.png?raw=1)

Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
def generate_text(model, start_string):
    # Evaluation step (generating text using the learned model)

    # Number of characters to generate
    num_generate = 1000

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperature results in more predictable text.
    # Higher temperature results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [None]:
print(generate_text(model, start_string=u"sky"))

sky,, gaoit d eMte a tritdge stgiyllnimd I to we l iI on  nt wamke tsn Taudmi iO tserideanUSAa]'; wo 't dr m udtn efey ioasispsto o gmheo et wku 'hyai'nA whyore onVpbt oyiml, ytbejhrwrr hu rpr yindds aoMNuiy nhmreronerrt ignues gon ol hag n' raesahoe uivt etae d, v k wn ao [ot es inmh e ia I yovtriac, w sltn ecirenp I Mwh l Coi nr e li s deos Iw,n aOzCOat pu s y teana t  t so, unn f tt,nnnovecrkuun Sm oyin ians t e t is oesehrs tgst [olL ld ste  at es ibfptenodVado, r. cis tsm yly t shBe soemoe nw,s temu teis - d?tnprohog y'Gcsonr y od's I t p Oh gat aAwd inne b la ikyfayu a r snl'r tnaie AlyWs ys oonee unuer rn saodntna[lco cng, Slne cn anna.md utr-eueVed is uh wo hhaye td lonolnnghJcDuowSvhgd 't n .r urti niufahtcnrs my i Ae h it etemslt go tlalleek Ion  twoe ooevdnf  tylc["fghe mpegin Inbl ee scnnvuu rrr r an-wsn  eon C m os'u svl  o e s ss e i tsfacit t ua ty lhatmre hrte elnrs nmdIl Yhvye e d yndVano tue ug hn wdmt h tI iacd se turn iogse'le rtI rricl ieolscrhtec iaft oiat os 'g e

The easiest thing you can do to improve the results is to train it for longer.

We can also predict with a different start string, try adding another RNN layer to improve the model's accuracy, or adjust the temperature parameter to generate more or less random predictions.

But I decided to switch to a Word-based Model.

## Word-based model with LSTM

### Train an ad hoc word embedding.


In [None]:
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import numpy as np
import pandas as pd
import os
import re
import string
import tensorflow as tf

from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, LSTM, Dropout, Dense, Dot, Embedding, Flatten, GlobalAveragePooling1D, Reshape
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

### Word-Level Model with LSTM

In [None]:
file = open('/content/lyrics_for_generation.txt')
lyrics = file.read()

In [None]:
lyrics = lyrics[:len(lyrics)-215] # get rid of spanish characters

In [None]:
from string import punctuation
import re
 
# turn a doc into clean tokens
def clean_doc(doc):
	# replace '--' with a space ' '
	doc = doc.replace('--', ' ')
 
	# remove punctuation from each token but keep '\n' and '''
	my_punctuation = punctuation.replace("\\", "")
	my_punctuation = punctuation.replace("'", "")
  
	doc = doc.translate(str.maketrans("", "", my_punctuation))
 
 	# split into tokens but keep '\n' as a token
	tokens = re.findall(r'\S+|\n',doc)
	# remove remaining tokens that are not alphabetic
	tokens = [x for x in tokens if not x.isdigit()]
	# make lower case
	tokens = [word.lower() for word in tokens]
	return tokens

In [None]:
tokens = clean_doc(lyrics)

In [None]:
print(tokens[:200])

['i', 'used', 'to', 'rule', 'the', 'world', '\n', 'seas', 'would', 'rise', 'when', 'i', 'gave', 'the', 'word', '\n', 'now', 'in', 'the', 'morning', 'i', 'sleep', 'alone', '\n', 'sweep', 'the', 'streets', 'i', 'used', 'to', 'own', '\n', '\n', 'i', 'used', 'to', 'roll', 'the', 'dice', '\n', 'feel', 'the', 'fear', 'in', 'my', "enemy's", 'eyes', '\n', 'listen', 'as', 'the', 'crowd', 'would', 'sing', '\n', 'now', 'the', 'old', 'king', 'is', 'dead', 'long', 'live', 'the', 'king', '\n', '\n', 'one', 'minute', 'i', 'held', 'the', 'key', '\n', 'next', 'the', 'walls', 'were', 'closed', 'on', 'me', '\n', 'and', 'i', 'discovered', 'that', 'my', 'castles', 'stand', '\n', 'upon', 'pillars', 'of', 'salt', 'and', 'pillars', 'of', 'sand', '\n', '\n', 'i', 'hear', 'jerusalem', 'bells', 'are', 'ringing', '\n', 'roman', 'cavalry', 'choirs', 'are', 'singing', '\n', 'be', 'my', 'mirror', 'my', 'sword', 'and', 'shield', '\n', 'my', 'missionaries', 'in', 'a', 'foreign', 'field', '\n', '\n', 'for', 'some', 're

In [None]:
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

Total Tokens: 10006
Unique Tokens: 1231


In [None]:
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

In [None]:
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
	# select sequence of tokens
	seq = tokens[i-length:i]
	# convert into a line
	line = ' '.join(seq)
	# store
	sequences.append(line)
print('Total Sequences: %d' % len(sequences))
 
# save sequences to file
out_filename = 'lyrics_sequences.txt'
save_doc(sequences, out_filename)

Total Sequences: 9955


In [None]:
# integer encode sequences of words
# import tensorflow as tf
# tokenizer = tf.keras.preprocessing.text.Tokenizer()
# tokenizer.fit_on_texts(sequences)
# sequences = tokenizer.texts_to_sequences(sequences)

In [None]:
word2num = {}
for idx, word in enumerate(set(tokens)):
  word2num[word] = idx+1

In [None]:
seqs = []
for sequence in sequences:
  seqs.append([word2num[w] for w in re.findall(r'\S+|\n',sequence)])

Words are assigned values from 1 to the total number of words (e.g. 7,409). The Embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1 in length.



In [None]:
# vocabulary size
vocab_size = len(set(tokens)) + 1

Keras provides the ``to_categorical()`` that can be used to ***one hot encode*** the output words for each input-output sequence pair.

In [None]:
# separate into input and output
from tensorflow.keras.utils import to_categorical
import numpy as np

sequences = np.array(seqs)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

In [None]:
X

array([[1062,  992,  336, ...,  501,   74,  545],
       [ 992,  336, 1006, ...,   74,  545,   64],
       [ 336, 1006,   64, ...,  545,   64,  667],
       ...,
       [1191,  515,  525, ...,   89,  873,   64],
       [ 515,  525,  501, ...,  873,   64,  451],
       [ 525,  501, 1202, ...,   64,  451, 1055]])

In [None]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(Dropout(0.25))
model.add(LSTM(100))
#model.add(Dropout(0.25))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            61600     
_________________________________________________________________
lstm (LSTM)                  (None, 50, 100)           60400     
_________________________________________________________________
dropout (Dropout)            (None, 50, 100)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dense_1 (Dense)              (None, 1232)              124432    
Total params: 336,932
Trainable params: 336,932
Non-trainable params: 0
__________________________________________________

In [None]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=32, epochs=100)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fa8308ffd68>

In [None]:
# save model if needed for further training
# from pickle import dump
# model.save('model.h5')

# save the tokenizer
# dump(tokenizer, open('tokenizer.pkl', 'wb'))

Predict

In [None]:
# load the model
# from tensorflow.keras.models import load_model
# model = load_model('model.h5')

# load the tokenizer
# tokenizer = load(open('tokenizer.pkl', 'rb'))

In [None]:
seq_length = len(X[0])
num2word = {v:k for k,v in word2num.items()}

Randomly pick one sentence from the lyrics, and predict the next word.

In [None]:
from random import randint
seed_text = X[randint(0,seq_length)]
seed_text = ' '.join([num2word[w] for w in seed_text])
print(' '+seed_text)

 streets i used to own 
 
 i used to roll the dice 
 feel the fear in my enemy's eyes 
 listen as the crowd would sing 
 now the old king is dead long live the king 
 
 one minute i held the key 
 next the


In [None]:
encoded = [word2num[w] for w in re.findall(r'\S+|\n',seed_text)]
encoded = np.array(encoded).reshape(1,-1)

# predict probabilities for each word
yhat = np.argmax(model.predict(encoded, verbose=0),axis = 1)
num2word[yhat[0]]

'walls'

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
# generate a sequence from a language model
def generate_seq(model, seq_length, seed_text, n_words):
	result = list()
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = [word2num[w] for w in re.findall(r'\S+|\n',in_text)]
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
		# predict probabilities for each word
		yhat = np.argmax(model.predict(encoded), axis=-1)
		# map predicted word index to word
		out_word = ''
		for word, index in word2num.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
		result.append(out_word)
	return ' '.join(result)

In [None]:
# generate new text
generated = generate_seq(model, seq_length, seed_text, 50)
print(' '+generated)

 walls were closed on me 
 and i discovered the streets of baltimore 
 
 well her heart was filled with gladness 
 when she saw those city lights 
 she said is a waterfall ah 
 and every tear 
 every teardrop is a waterfall 
 
 every tear


#### Compute lyrics with ryhming setences like coldplay's lyrics.

I found a package can be used to find rhyming words which luckily saved a lot of time.

In [None]:
# !pip install phyme

Collecting phyme
[?25l  Downloading https://files.pythonhosted.org/packages/3b/59/fd6ec3b00a31f721056d2411fa420400c3edb90487e3e9a2d3e2c3603566/Phyme-0.0.9.tar.gz (1.3MB)
[K     |████████████████████████████████| 1.3MB 5.6MB/s 
[?25hBuilding wheels for collected packages: phyme
  Building wheel for phyme (setup.py) ... [?25l[?25hdone
  Created wheel for phyme: filename=Phyme-0.0.9-cp36-none-any.whl size=1379057 sha256=d967f0015e38b82a1cb00d02d5aa7c6f2377b75fd12aa71146b66e8a5393fb2d
  Stored in directory: /root/.cache/pip/wheels/68/79/ca/e58a1f9509af3537f34c9c98ab6dfcf56f1b7cd40788c9a46d
Successfully built phyme
Installing collected packages: phyme
Successfully installed phyme-0.0.9


In [None]:
from Phyme import Phyme
ph = Phyme()
# Try to get rhyming words of 'right'
ph.get_partner_rhymes('right')

In [None]:
from random import randint
seed_text = X[randint(0,seq_length)]
seed_text = ' '.join([num2word[w] for w in seed_text])
coldplay_corpus = generate_seq(model, seq_length, seed_text, 1000)

In [None]:
seed_text = X[randint(0,seq_length)]
seed_text = ' '.join([num2word[w] for w in seed_text])
coldplay_corpus2 = generate_seq(model, seq_length, seed_text, 1000)

In [None]:
corpus = coldplay_corpus + coldplay_corpus2
print(corpus)

In [None]:
# Array of final word of each line of corpus
final_corpus = coldplay_corpus.translate(str.maketrans("", "", punctuation))
lines = [l.strip() for l in final_corpus.split('\n') if len(l)>1]
list_of_words = [line.strip().split(" ") for line in lines]
# Try to clean some words from lyrics
final_words = [w[-1] for w in list_of_words if len(w)>1]
final_words = ['o' if w=='oooooo' else w for w in final_words]
final_words = ['oh' if w=='oh…' else w for w in final_words]
final_words = ['boom' if w=='baboomboom' else w for w in final_words]
final_words = ['n' if w=='oooooooonnnn' else w for w in final_words]
word_dict = {i:w for i, w in enumerate(final_words)}
rhyme_dict = {}
for i, word1 in word_dict.items():
    try:
      rhyming = ph.get_consonant_rhymes(word1).values()
      rhyming_words = [j for sub in rhyming for j in sub] 
      rhymes = []
    
      for j, word2 in word_dict.items():
          if word2!=word1 and word2 in rhyming_words:
              print(j,word2)
              rhymes.append(j)
      rhyme_dict[i] = rhymes
    except:
      continue

Here is the interesting part:
Create a function to generate lyrics with structure of two couplets followed by two couplets.

In [None]:
import random
def generate_couplet(rhymes):
    while 1:
        i = random.randrange(len(rhymes.keys()))
        if len(rhymes[i]) >= 1:
            pool = rhymes[i] + [i]
            print(pool)
            print(word_dict[i] for i in pool)
            samples = random.sample(pool, 2)
            while lines[samples[0]] == lines[samples[1]]:
                samples = random.samples(pool, 2)
            return samples
            print(samples)
            print(word_dict[i] for i in samples)

In [None]:
def generate_lyrics(rhymes):
    a = generate_couplet(rhymes)
    b = generate_couplet(rhymes)
    c = [random.randrange(len(rhymes.keys()))]
    return a+b+c+c

# Convert lyrics from index array to string
def conv_lyrics(indices, lines):
    lyric = ""
    for i in indices:
        lyric += lines[i] + "\n"
    return lyric

In [None]:
num_lyrics = 2
lyrics_ryhme = []
for _ in range(num_lyrics):
  lyric_coldplay = conv_lyrics(generate_lyrics(rhyme_dict), lines)
  lyrics_ryhme.append(lyric_coldplay)

In [None]:
for i in lyrics_ryhme:
  print(i)

be my mirror my sword and shield
my missionaries in a foreign field
i hear jerusalem bells are ringing
roman cavalry choirs are singing
its a wonderful life
its a wonderful life

i hear jerusalem bells are ringing
roman cavalry choirs are singing
my missionaries in a foreign field
be my mirror my sword and shield
never felt so alive
never felt so alive



Final results are better than character-based model. We can explore more with models such as Attention.