# Lyrics Generation
**Veronica Bruno (230904), Cristina Galvez (230260) and Rafael Bardisa (231142)**

In this notebook, we prepare a dataset containing the lyrics of many famous songs to feed different Deep Learning Models:
- RNN 
- LSTM
- BiDirectional LSTM

Once the model is trained, it is able to generate new song lyrics given a seed (an initial string of words) that will resemble the patterns the model will have learned during training.

In [2]:
# import Keras library
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM, SimpleRNN, Input, Bidirectional
from keras.callbacks import ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.metrics import categorical_accuracy

# import spacy, and spacy french model
# spacy is used to work on text
import spacy
nlp = spacy.load("en_core_web_sm")

#import other libraries
import numpy as np
import pandas as pd
import random
import sys
import os
import time
import codecs
import collections
from six.moves import cPickle

from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')
data_path = '/content/drive/Shareddrives/Deep Learning/DeepLearning_2022/Final Project/Data/'
results_path = '/content/drive/Shareddrives/Deep Learning/DeepLearning_2022/Final Project/Results/'

df = pd.read_csv(data_path + 'songdata.csv')

Mounted at /content/drive


### Data Preparation

Prepare all data from the dataset *'songdata.csv'* to be used in the BiLSTM algorithm.

In [3]:
# join all song lyrics from the dataset in a long string
data = ', '.join(df['text'])

In [4]:
# function to create a wordlist
def create_wordlist(doc):
    wl = []
    for word in doc:
        if word.text not in ("\n","\n\n",'\u2009','\xa0'):
            wl.append(word.text.lower())
    return wl

In [5]:
# create array of words (in order)
wordlist = []
word_limit = 100000 # define amount of words used (limited by RAM memory)

doc = nlp(data[0:word_limit])
wl = create_wordlist(doc)
wordlist = wordlist + wl

In [6]:
# count the number of words
word_counts = collections.Counter(wordlist)

# Mapping from index to word : that's the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]

# size of the vocabulary
vocab_size = len(words)
print("Vocabulary size:", vocab_size)

# save the words and vocabulary
with open(results_path + "vocab_file.pkl", 'w+b') as f:
    cPickle.dump((words, vocab, vocabulary_inv), f)

Vocabulary size: 1832


In [7]:
# create sequences of fixed length
sequences = []
next_words = []
seq_length = 30  # define sequence length
sequences_step = 1

for i in range(0, len(wordlist) - seq_length, sequences_step):
    sequences.append(wordlist[i: i + seq_length])
    next_words.append(wordlist[i + seq_length])

print('Number of sequences:', len(sequences))

Number of sequences: 24326


In [8]:
# define data as matrices with 0s and 1s
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
    for t, word in enumerate(sentence):
        X[i, t, vocab[word]] = 1
    y[i, vocab[next_words[i]]] = 1

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  This is separate from the ipykernel package so we can avoid doing imports until


## Model Definition

The first model we are going to define will be the **RNN** model:

In [9]:
def rnn_model(seq_length, vocab_size):
    print('Build RNN model.')
    model = Sequential()
    model.add(SimpleRNN(units, activation="relu",input_shape=(seq_length, vocab_size))) # add RNN layer
    model.add(Dropout(0.6))
    model.add(Dense(vocab_size))
    model.add(Activation('softmax'))
    
    optimizer = Adam(lr=learning_rate)
    callbacks=[EarlyStopping(patience=2, monitor='val_loss')]
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[categorical_accuracy])
    print("model built!")
    return model

The following is the **LSTM** model:

In [10]:
def lstm_model(seq_length, vocab_size):
    print('Build LSTM model.')
    model = Sequential()
    model.add(LSTM(units, activation="relu",input_shape=(seq_length, vocab_size))) # add LSTM layer
    model.add(Dropout(0.6))
    model.add(Dense(vocab_size))
    model.add(Activation('softmax'))
    
    optimizer = Adam(lr=learning_rate)
    callbacks=[EarlyStopping(patience=2, monitor='val_loss')]
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[categorical_accuracy])
    print("model built!")
    return model

The following function defines a **Bidirectional LSTM** model, which uses two LSTM models (one in each direction) so that both directions of propagation are taken into account.

In [11]:
def bidirectional_lstm_model(seq_length, vocab_size):
    print('Build LSTM model.')
    model = Sequential()
    model.add(Bidirectional(LSTM(units, activation="relu"),input_shape=(seq_length, vocab_size))) # add BiLSTM layer
    model.add(Dropout(0.6))
    model.add(Dense(vocab_size))
    model.add(Activation('softmax'))
    
    optimizer = Adam(lr=learning_rate)
    callbacks=[EarlyStopping(patience=2, monitor='val_loss')]
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[categorical_accuracy])
    print("model built!")
    return model

Once we have the models define, we initialize them. We also define a learning rate and the amount of units in the model.

In [12]:
units = 256 # units in the model
learning_rate = 0.001 #learning rate

Initialize models:

In [13]:
# RNN
md_rnn = rnn_model(seq_length, vocab_size)
md_rnn.summary()

Build RNN model.
model built!
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 256)               534784    
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense (Dense)               (None, 1832)              470824    
                                                                 
 activation (Activation)     (None, 1832)              0         
                                                                 
Total params: 1,005,608
Trainable params: 1,005,608
Non-trainable params: 0
_________________________________________________________________


  super(Adam, self).__init__(name, **kwargs)


In [14]:
# LSTM
md_lstm = lstm_model(seq_length, vocab_size)
md_lstm.summary()

Build LSTM model.
model built!
Model: "sequential_1"


  super(Adam, self).__init__(name, **kwargs)


_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 256)               2139136   
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 1832)              470824    
                                                                 
 activation_1 (Activation)   (None, 1832)              0         
                                                                 
Total params: 2,609,960
Trainable params: 2,609,960
Non-trainable params: 0
_________________________________________________________________


In [15]:
# Bidirectional LSTM
md_bilstm = bidirectional_lstm_model(seq_length, vocab_size)
md_bilstm.summary()

Build LSTM model.
model built!
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional (Bidirectiona  (None, 512)              4278272   
 l)                                                              
                                                                 
 dropout_2 (Dropout)         (None, 512)               0         
                                                                 
 dense_2 (Dense)             (None, 1832)              939816    
                                                                 
 activation_2 (Activation)   (None, 1832)              0         
                                                                 
Total params: 5,218,088
Trainable params: 5,218,088
Non-trainable params: 0
_________________________________________________________________


  super(Adam, self).__init__(name, **kwargs)


## Training the Model

**DO NOT RUN IF NOT NECESSARY**

**The training of the models can take up to an hour, and previously trained models (same data) can be loaded in the next section.**

In [30]:
batch_size = 32 # minibatch size
num_epochs = 25 # number of epochs

Train RNN with the prepared data:

In [None]:
# train the RNN model
history = md_rnn.fit(X, y,
                 batch_size=batch_size,
                 shuffle=True,
                 epochs=num_epochs,
                 validation_split=0.1)

# save the model
md_rnn.save(results_path + 'my_model_generate_sentences_rnn.h5')

Train LSTM with the prepared data:

In [None]:
# train the LSTM model
history = md_lstm.fit(X, y,
                 batch_size=batch_size,
                 shuffle=True,
                 epochs=num_epochs,
                 validation_split=0.1)

# save the model
md_lstm.save(results_path + 'my_model_generate_sentences_lstm.h5')

Train BiLSTM with the prepared data:

In [None]:
# train the Biderectional LSTM model
history = md_bilstm.fit(X, y,
                 batch_size=batch_size,
                 shuffle=True,
                 epochs=num_epochs,
                 validation_split=0.1)

# save the model
md_bilstm.save(results_path + 'my_model_generate_sentences_bilstm.h5')

## Upload Generated Data

To upload a previously generated dictionary:

In [19]:
# load vocabulary
print("loading vocabulary...")
vocab_file = os.path.join(results_path, "vocab_file.pkl")

with open(os.path.join(results_path, 'vocab_file.pkl'), 'rb') as f:
        words, vocab, vocabulary_inv = cPickle.load(f)

vocab_size = len(words)

loading vocabulary...


To load a trained model:

In [37]:
from keras.models import load_model

file_to_load = 'my_model_generate_sentences_rnn_10.h5' # model trained with 10 epochs
# can also try my_model_generate_sentences_rnn_50.h5 (with 50 epochs)

# load the RNN model
print("loading RNN model...")
model_rnn = load_model(results_path + file_to_load)

loading RNN model...


In [38]:
file_to_load = 'my_model_generate_sentences_lstm_10.h5' # model trained with 10 epochs
# can also try my_model_generate_sentences_lstm_50.h5 (with 50 epochs)

# load the LSTM model
print("loading LSTM model...")
model_lstm = load_model(results_path + file_to_load)

loading LSTM model...


In [39]:
file_to_load = 'my_model_generate_sentences_bilstm_25.h5' # model trained with 25 epochs
# can also try my_model_generate_sentences_bilstm_50.h5 (with 50 epochs)

# load the BiLSTM model
print("loading BiLSTM model...")
model_bilstm = load_model(results_path + file_to_load)

loading BiLSTM model...


## Lyrics Generation
Define functions to generate lyrics, given a model, a length and a seed sentence.

In [23]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [24]:
def generate_lyrics(model, words_to_generate, seed):
  # initiate sentences
  generated = ''
  sentence = []
  seq_length = 30

  # we shate the seed accordingly to the neural netwrok needs:
  for i in range(seq_length):
      sentence.append("oh")

  seed = seed_sentences.split()

  for i in range(len(seed)):
      sentence[seq_length-i-1]=seed[len(seed)-i-1]

  generated += ' '.join(sentence)

  #then, we generate the text
  for i in range(words_number):
      # create the vector
      x = np.zeros((1, seq_length, vocab_size))
      for t, word in enumerate(sentence):
        x[0, t, vocab[word]] = 1.

      # calculate next word
      preds = model.predict(x, verbose=0)[0]
      next_index = sample(preds, 0.33)
      next_word = vocabulary_inv[next_index]

      # add the next word to the text
      generated += " " + next_word
      # shift the sentence by one, and and the next word at its end
      sentence = sentence[1:] + [next_word]

  # print the whole text
  return generated
  #print('\n' + generated)

#### Generate Lyrics

In [26]:
words_number = 200 # number of words to generate

# seed sentence to start the generating.
seed_sentences = '''do you want her , almost every day ? 'cause let 's say , we kinda do look the same i hate to think that every time''' 

Generate RNN:

In [35]:
# generate from RNN model
rnn_gen_lyrics = generate_lyrics(model_rnn, words_number, seed_sentences)
print('\n RNN\n', rnn_gen_lyrics)


 RNN
 oh oh oh do you want her , almost every day ? 'cause let 's say , we kinda do look the same i hate to think that every time  
 i 'm always givin ' a little more  
 i 've been brokenhearted  
 you 're out  
 we 're through  
 and i 'm blue  
 and i come you 're my love , my life  
 but the sun is a marionette  
 just a girl , you 're all  
 and i 've been you  
 and my life , baby , you 're , i have to go  
 and i do n't want to lose my life  
 but i 've got all my memories , those were my happiest days  
  
 where i 'm your eyes , i ca n't go  
 in the gloom  
 like an angel passing through my room  
  
 i 'm going in  
 and i 'm your eyes  
 i 'm the city , you 're me  
 and you 'll be you can to be  
 where the gloom  
 like an angel passing through my room  
  
 i know that you 're not the morning , i 'm not the point of no returning  
 but i 'm a marionette , baby , i do , i do , i do , i do ,


Generate LSTM:

In [36]:
# generate from LSTM model
lstm_gen_lyrics = generate_lyrics(model_lstm, words_number, seed_sentences)
print('\n LSTM\n', lstm_gen_lyrics)


 LSTM
 oh oh oh do you want her , almost every day ? 'cause let 's say , we kinda do look the same i hate to think that every time  
 for you  
 so on the i  
 and the when you see the the the thought you 're me  
 and the and me  
 you you 're me  
 and i know that you 're so bad  
 so and i 'm you and you , you me  
 and me you 're me  
 and you me me , you and me you me and i me  
 now i 'm the the the feeling  
 you 're take me  
  
 my and i 'm ?  
 i 'm a n't to to , to me  
 take it 's a crazy  
 and i have  
 my got they all  
 we and 've and i the blind you  
 and me the old but the him , you the the the way  
 you 're the by  
 so  
 and you 're you still the you  
 in the on the the the man of the every day  
 so the man in the a man in the middle , the middle , the in the middle , the middle , the middle , the middle , the middle , the 's


Generate Bidirectional LSTM:

In [29]:
# generate from BiLSTM model
bilstm_gen_lyrics = generate_lyrics(model_bilstm, words_number, seed_sentences)
print('\n BiLSTM\n', bilstm_gen_lyrics)

  after removing the cwd from sys.path.



 BiLSTM
 oh oh oh do you want her , almost every day ? 'cause let 's say , we kinda do look the same i hate to think that every time for the time  
 but everything is a new song  
 and when you 're so sad  
 and you 're so free  
 that 's the things that you be  
  
 just a notion  
 that 's just look at that guy  
  
 just a face that in all  
 but she would be  
 but just special  
 as long as good as new , thank god it 's true  
 darling , we were always meant to stay together  
  
 making somebody happy is a question of give and learn  
 you can see that my love is just like a face that like a living in  
 a dreamworld  
  
 just just like a girl  
 just not just like cinderella  
 nina , pretty ballerina , who would ever ever ever only way  
 just just like cinderella , just cinderella  
 but just just like cinderella , just like cinderella  
 nina , pretty ballerina , who would ever think she could be the way  
 this is the moment she just just like cinderella , just like cinder