<a href="https://colab.research.google.com/github/shuchimishra/Tensorflow_projects/blob/main/Tensorflow_Code/NLP/Irish_lyrics_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W4/ungraded_labs/C3_W4_Lab_2_irish_lyrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Generating Text from Irish Lyrics

In the previous lab, you trained a model on just a single song. You might have found that the output text can quickly become gibberish or repetitive. Even if you tweak the hyperparameters, the model will still be limited by its vocabulary of only 263 words. The model will be more flexible if you train it on a much larger corpus and that's what you'll be doing in this lab. You will use lyrics from more Irish songs then see how the generated text looks like. You will also see how this impacts the process from data preparation to model training. Let's get started!

## Imports

In [None]:
import tensorflow as tf
import numpy as np

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

## Building the Word Vocabulary

You will first download the lyrics dataset. These will be from a compilation of traditional Irish songs and you can see them [here](https://github.com/https-deeplearning-ai/tensorflow-1-public/blob/main/C3/W4/misc/Laurences_generated_poetry.txt).

In [None]:
!pip install gdown==5.1.0

In [None]:
# Download the dataset
!gdown --id 15UqmiIm0xwh9mt0IYq2z3jHaauxQSTQT

Next, you will lowercase and split the plain text into a list of sentences:

In [None]:
#load the dataset

data = open('./irish-lyrics-eof.txt').read()

#Lowercase and split the data
corpus = data.lower().split('\n')

#preview the results
print(corpus[0:5])

From here, you can initialize the `Tokenizer` class and generate the word index dictionary:

In [None]:
#Initialize the Tokenizer
tokenizer = Tokenizer()

# Generate the word index dictionary
tokenizer.fit_on_texts(corpus)
word_index = tokenizer.word_index

# Define the total words. You add 1 for the index `0` which is just the padding token.
word_count = len(word_index)+1

print("Word Index dictionary :",word_index)
print("Word Index Dict count :",word_count)

## Preprocessing the Dataset

Next, you will generate the inputs and labels for your model. The process will be identical to the previous lab. The `xs` or inputs to the model will be padded sequences, while the `ys` or labels are one-hot encoded arrays.

In [None]:
# Initialize the sequences list
input_sequences = []

# Loop over every line
for line in corpus:

    # Tokenize the current line
    """Note:
    texts_to_sequences fn expects data to be list;
    output should be list (not NESTED list)
    """
    sequences = tokenizer.texts_to_sequences([line])[0]

    # Loop over the line several times to generate the subphrases
    for index in range(1, len(sequences)):

        # Generate the subphrase
        n_gram_sequences = sequences[: index + 1]

        # Append the subphrase to the sequences list
        input_sequences.append(n_gram_sequences)

# Get the length of the longest line
max_seq_length = max([len(seq) for seq in input_sequences])

# Pad all sequences
pad_seq = pad_sequences(input_sequences, maxlen=max_seq_length, padding="pre")


In [None]:
# Create inputs and label by splitting the last token in the subphrases
'''
Note: #This is list of lists [[a][b][c]] so double parsing needed
'''
xs=pad_seq[:,:-1]
ys=pad_seq[:,-1]

# Convert the label into one-hot arrays
one_hot_encoded_label = tf.keras.utils.to_categorical(ys,
                                                        num_classes=word_count)

You can then print some of the examples as a sanity check.

In [None]:
# Get sample sentence
sentence = corpus[0].split()
print(f'sample sentence: {sentence}')

# Initialize token list
token_list = []

# Look up the indices of each word and append to the list
for word in sentence:
  token_list.append(tokenizer.word_index[word])

# Print the token list
print(token_list)

In [None]:
#Pick element
elem_number = 5

#Print the list and phase
'''
Note: Expects NESTED list as input
'''
print("Token list : ",xs[elem_number])
print("Decoded text :", tokenizer.sequences_to_texts([xs[elem_number]]))

# Print label
print("Corresponding one hot label : ",one_hot_encoded_label[elem_number])
print("Corresponding decoded label : ",np.argmax(one_hot_encoded_label[elem_number]))

In [None]:
#Pick element
elem_number = 4

#Print the list and phase
'''
Note: Expects NESTED list as input
'''
print("Token list : ",xs[elem_number])
print("Decoded text :", tokenizer.sequences_to_texts([xs[elem_number]]))

# Print label
print("Corresponding one hot label : ",one_hot_encoded_label[elem_number])
print("Corresponding decoded label : ",np.argmax(one_hot_encoded_label[elem_number]))

## Build and compile the Model

Next, you will build and compile the model. We placed some of the hyperparameters at the top of the code cell so you can easily tweak it later if you want.

In [None]:
from tensorflow.keras import layers

#Hyperparameter
embedding_dim = 100
lstm_size = 150
lr = 0.01

#Build the model
model = Sequential([
    layers.Embedding(input_dim=word_count,
                     output_dim=embedding_dim,
                     input_length=max_seq_length-1),
    layers.Bidirectional(layers.LSTM(lstm_size)),
    layers.Dense(word_count, activation='softmax')
])

# Use categorical crossentropy because this is a multi-class problem

adam_fn = Adam(learning_rate=lr)

model.compile(
    loss='categorical_crossentropy',
    # optimizer=Adam(learning_rate=learning_rate),
    optimizer=adam_fn,
    metrics=['accuracy']
    )

# Print the model summary
model.summary()

## Train the model

From the model summary above, you'll notice that the number of trainable params is much larger than the one in the previous lab. Consequently, that usually means a slower training time. It will take roughly 7 seconds per epoch with the GPU enabled in Colab and you'll reach around 76% accuracy after 100 epochs.

In [None]:
epochs = 100

# Train the model
history = model.fit(xs, one_hot_encoded_label, epochs=epochs)

You can visualize the accuracy below to see how it fluctuates as the training progresses.

In [None]:
import matplotlib.pyplot as plt

# Plot utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.show()

# Visualize the accuracy
plot_graphs(history, 'accuracy')

## Generating Text

Now you can let the model make its own songs or poetry! Because it is trained on a much larger corpus, the results below should contain less repetitions as before. The code below picks the next word based on the highest probability output.

In [None]:
# Define seed text
seed_text = "help me obi-wan kinobi youre my only hope"

# Define total words to predict
next_words = 100

# Loop until desired length is reached
for _ in range(next_words):

	# Convert the seed text to a token sequence
	token_list = tokenizer.texts_to_sequences([seed_text])[0]

	# Pad the sequence
	token_list = pad_sequences([token_list], maxlen=max_seq_length-1, padding='pre')

	# Feed to the model and get the probabilities for each index
	probabilities = model.predict(token_list, verbose=0)

	# Get the index with the highest probability
	predicted = np.argmax(probabilities, axis=-1)[0]

	# Ignore if index is 0 because that is just the padding.
	if predicted != 0:

		# Look up the word associated with the index.
		output_word = tokenizer.index_word[predicted]

		# Combine with the seed text
		seed_text += " " + output_word

# Print the result
print(seed_text)

Here again is the code that gets the top 3 predictions and picks one at random.

In [None]:
# Define seed text
seed_text = "help me obi-wan kinobi youre my only hope"

# Define total words to predict
next_words = 100

# Loop until desired length is reached
for _ in range(next_words):

	# Convert the seed text to a token sequence
  token_list = tokenizer.texts_to_sequences([seed_text])[0]

	# Pad the sequence
  token_list = pad_sequences([token_list], maxlen=max_seq_length-1, padding='pre')

	# Feed to the model and get the probabilities for each index
  probabilities = model.predict(token_list, verbose=0)

  # Pick a random number from [1,2,3]
  choice = np.random.choice([1,2,3])

  # Sort the probabilities in ascending order
  # and get the random choice from the end of the array
  predicted = np.argsort(probabilities)[0][-choice]

	# Ignore if index is 0 because that is just the padding.
  if predicted != 0:

		# Look up the word associated with the index.
	  output_word = tokenizer.index_word[predicted]

		# Combine with the seed text
	  seed_text += " " + output_word

# Print the result
print(seed_text)

## Wrap Up

This lab shows the effect of having a larger dataset to train your text generation model. As expected, this will take a longer time to prepare and train but the output will less likely become repetitive or gibberish. Try to tweak the hyperparameters and see if you get better results. You can also find some other text datasets and use it to train the model here.  