# **NLG Using LSTM**

---

Natural Language Generation (NLG) is a software process that produces natural language output. NLG is the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information.

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs.  LSTM can be used to predict the next word in a sequence of words. The neural network takes a sequence of words as input and outputs a matrix of probability for each word from the dictionary to be the next word of the given sequence. The model learns how much similarity there is between each word or character and calculates the probability of each.


# **Assignment Instructions**


---
<br>

### **Introduction**

For this assignment, you will train a long short-term memory (LSTM) module that serves as a language model processing engine.  To train the model, report on its progress, and use it to generate text, the training process will generate a pair of files for each epoch completed and labeled for the corresponding epoch:

  *  **Losses**.  This file records the average loss calculated during the training for each epoch.  This file is used when plotting the loss curve.  This file takes the format of `losses_{epoch}.txt`.

  *  **Model**.  This is the LSTM model resulting from the training for each epoch.  This file takes the format of `lstm_model_{epoch}.pth`.

The time required to train the LSTM model is non-trivial.  It is possible&mdash;and almost certain&mdash;that your Colab runtime will time-out and disconnect before the model training is complete.  This will also delete any files in the Colab filespace.

To remedy this problem, you will need to create a dedicated folder in your Google Drive that this Colab notebook can access and where it can read and write files.  This will ensure that the necessary files, especially the model file, will not be automatically deleted if, and when, the Colab runtime disconnects.  The set-up instructions below explain how to create the Google Drive folder.  If you follow these instructions, this Colab will automatically recognize and connect to the location.

<br>

---

<br>

### **Set-up Google Drive**

Complete these steps to create the necessary Google Drive folder for this assignment:

1.   Open your Google Drive.  This will put you in your `MyDrive` folder by default.
2.   Click the `+ New` button in the upper left corner.
3.   Select `New folder`.
4.   In the dialog box that appears, type the name of the folder:  `LING 581`
5.   Click `Create`.
6.   Open the new `LING 581` folder you just created.
7.   Click the `+ New` button in the upper left corner.
8.   Select `New folder`.
9.   In the dialog box that appears, type the name of the folder:  `LSTM`
10.   Click `Create`.

This will ensure that your Google Drive has the necessary file path specified in Code Step 3 below:

    directory_path = '/content/drive/MyDrive/LING 581/LSTM/'

With your Google Drive properly configured, your LSTM files will be preserved and the LSTM training can resume from the last completed epoch should it get interrupted before it completes.

**Note that when you run this Colab, you will be prompted to give access to your Google Drive so that the training can read and write files that will persist beyond the life of the runtime.**

<br>

---

<br>

### **Create the Assignment Report**

For this assignment, you will need the following resource:

*  `1 Nephi Chaps 1_4.txt`

To receive full credit for this assignment, you will need to create and submit the following:

1.   A report document that contains the full information and correct answers to the questions contained in Parts I, II, and III below.  Submit a PDF version of that document.  

  *  At the beginning of your report, include a short description of the text file.

  *  At the end of your report, provide a brief but detailed summary that includes:
      *   A concise overall analysis of the data from each step of your language processing.
      *   Your explanation of the value of using LSTM for natural language generation (NLG).
      *   What you learned by doing this assignment.  
      *   One suggestion for improving this assignment.
2.   The URL of your copy of the Colab notebook (with View permissions) with your data results.


<br>

### **Part I:  Train the Model**


---


**Complete these tasks:**

1. Copy the text file to the `MyDrive/LING 581/LSTM` folder you created in your Google Drive.  Ensure that there are no other files in this folder.

2.  Run code steps 1 - 4 below to start the training of the LSTM.  Do not modify any of the hyperparameters in code step 3.  Once you execute code step 4, the LSTM model training will begin.  Training the LSTM has been improved and will only take a minute or two.  
  
  As you have seen before, there are three runtime options for a free Colab account.  **The CPU option takes the longest** during peak hours with the provided text file.  During non-peak hours, the time required to train the model can be significantly less with each runtime option, but plan accordingly.

  If the Colab runtime disconnects before completing the total number of epochs, simply run code steps 1 - 4 again as soon as possible.  The Python code in this Colab notebook will automatically detect the last completed epoch from the files placed in the `LSTM` folder in your Google Drive and will begin training the next epoch in sequence.

  Repeat this as many times as necessary until all 10 epochs have been successfully completed.

<br>

### **Part II:  Analyze**


---

**Complete these tasks:**

1.  Once the model training is complete, record the descriptive data:

    *  Total number of tokens
    *  Total number of unique words (vocabulary)
    *  Type-to-token ratio (TTR)


2.  Answer the following questions:

  <ol type="a">
    <li>On average and compared with other texts, is the TTR value for this text high, low, or somewhere in between?  (You may want to look back at your results from the Heaps' Law assignment to get a comparison.)</li>

    <li>What does the TTR say about these first four chapters of 1 Nephi?</li>
  </ol>


3.  Run code step 5 to plot the loss curve.  Copy the loss curve plot to your report, then answer the following questions:

  <ol type="a">
    <li>Are there any unusual, unexpected, or interesting features of the loss curve?  If so, what are they?  Do they correspond to any events during the model training?</li>

    <li>Does the loss for every epoch decrease or are there epochs where the loss increases compared to the prior epoch?</li>

    <li>Does the loss for the last epoch decrease or increase compared to the prior epoch?  Is this result expected?</li>
  </ol>

4. Describe the architecture of the LSTM neural network based upon the hyperparameter values specified in code step 3.

5.  Answer these questions:  

  <ol type="a">
    <li>What is the purpose of the learning rate (learn_rate) parameter and what does the specified value mean?</li>

    <li>Describe at least 2 ways to improve the performance (speed and accuracy) of the LSTM training and the resulting model.</li>
  </ol>  

<br>

### **Part III:  Use the Model**


---

**Complete these tasks:**

1.  In code step 6, enter each of the seed prompts listed below into the `seed_text` text box, set the value of `num_output_tokens` to 5, run the code cell, and record the resulting generated text in your report.

<br>

<div align="center">

  | Number | Seed Prompt | Reference |
  |----------|----------|----------:|
  | 1 | And it came to pass | 1:6 |
  | 2 | he built an altar of stones | 2:7 |
  | 3 | Behold I have dreamed a dream | 3:2 |
  | 4 | Laman went in unto the house of Laban | 3:11 |
  | 5 | the words which have been spoken by the mouth of all the holy prophets | 3:20 |
  | 6 | let us be faithful in keeping the commandments of the Lord | 4:1 |
  | 7 | Inasmuch as thy seed shall keep my commandments | 4:14 |

</div>

<br>

2.  Find three additional seed prompts from the text that you find interesting and record the resulting generated text in your report.  

  (Safety tip:  Ensure that your seed prompts do not contain punctuation or numbers.)


3.  Answer the following questions for all 10 results:

  <ol type="a">
    <li>Are there any quality issues with the generated texts?  If so, describe the issues.</li>

    <li>Which generated text is the best?  Justify your answer.  (For example, it is more grammatical, it seems to flow better than the others, etc.)</li>

    <li>Are there any general shared characteristics among the generated texts?  Explain.</li>
  </ol>

4.  Repeat these tasks for the 10 seed prompts but changing the value of `num_output_tokens` to 3, then answer these questions:

  <ol type="a">
    <li>Are the first 3 tokens of the generated texts (not counting the seed prompt portions) the same as the generated tokens when setting the number of output tokens to 5?</li>

    <li>Are there any general shared characteristics among the generated texts?  Explain.</li>

    <li>Are the outputs deterministic?  Explain.</li>
  </ol>

<br>

### **(Optional) Part IV:  Play Time!**


---

**Complete these *optional* tasks:**


*This part is optional.  You are not required to complete this portion of the assignment, though it is encouraged.*


You have now created several model versions, one version per epoch.  Each model version has different performance characteristics, with later models being generally more accurate than earlier models.

  1.  Modify the code in code step 6 to load any epoch-specific model (e.g., `lstm_model_2.pth`) and send it to the `predict_next_words()` function as the `model` parameter.  (Hint:  You will need to use the LSTM framework as used in code step 3 along with the `model.load_state_dict()` function as used in code step 4.)
  2.  Use that model to generate text using the seed prompts in Part III.
  3.  Compare the generated text from at least 2 different models.  
  
    * Describe the unique and interesting features of each.
    * Are there general patterns evident for a particular model?
    * Which model produces the best results?  Is that surprising?

<br>




---

## **Code Step 1:  Load the Necessary Code Libraries**

---

***What this step does:***

*  Loads the necessary Python libraries.

*  Mounts your Google Drive for Colab access.

---

<br>


**Complete these tasks for this step:**
1.   Run the code cell in this step.  This step needs to be run only once.

2.   You will be prompted to give access to your Google Drive so that the training can read and write files that will persist beyond the life of the runtime.  Provide access to your Google Drive for reading and writing files for training and reporting.

In [None]:
# @title Import necessary libraries
import os
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import nltk
from nltk.tokenize import word_tokenize
from collections import defaultdict
import string
import matplotlib.pyplot as plt
import re
import warnings
from google.colab import drive

# Ignore Python warnings (they get in the way)
warnings.simplefilter(action='ignore', category=FutureWarning)

nltk.download('punkt')

# Mount Google Drive
drive.mount('/content/drive')

## **Code Step 2:  Define Classes and Functions**

---

***What this step does:***

*  Defines the following classes:
  * `TextDataset`:  Used for defining the training dataset.
  * `LSTMModel`:  Used for defining the LSTM model.

*  Defines the following functions:
  * `clean_text()`:  Cleans the text and prepares it for processing.
  * `get_latest_file()`:  Finds the latest file by using regular expressions to find numbers in the filenames, identify the highest number, and select the corresponding filename.
  * `load_losses()`:  Loads the latest losses file.
  * `train_model()`:  Trains the LSTM for use as a language model.
  * `predict_next_words()`:  Once the LSTM model has been trained, uses the model to predict the next sequence of $n$ words (tokens).

<br>

---

<br>


**Complete this task for this step:**
1.   Run the code cell in this step.  This step needs to be run only once.

In [None]:
# @title Define Classes and Functions

class TextDataset(Dataset):
    def __init__(self, xs, labels):
        self.xs = xs
        self.labels = labels

    def __len__(self):
        return len(self.xs)

    def __getitem__(self, idx):
        return self.xs[idx], self.labels[idx]

class LSTMModel(nn.Module):
    def __init__(self, total_words, embed_dim, hidden_dim):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(total_words, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, total_words)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        x = self.fc(x[:, -1, :])

        return x

#===========================================================================
def clean_text(text):
    curlies = ['“', '”', '’']
    end_of_line =["\n", "\r"]

    # Create a list of punctuation and digits characters
    unwantedCharacters = list(string.punctuation)
    unwantedCharacters.extend(list(string.digits))
    unwantedCharacters.extend(curlies)

    # Remove unwanted characters from the text
    for character in unwantedCharacters:
        text = text.replace(character, '').lower()

    # Remove end of line characters
    for character in end_of_line:
        text = text.replace(character, ' ')

    return text

#===========================================================================
# Get latest
def get_latest_file(filenames):
    # Extract the numbers from the filenames
    numbers = [int(re.search(r'\d+', filename).group()) for filename in filenames]

    # Find the index of the maximum number
    max_index = numbers.index(max(numbers))

    # Get the max number
    max_num = max(numbers)

    # Select the filename with the highest number
    selected_filename = filenames[max_index]

    return selected_filename, max_num

#===========================================================================
def load_losses(losses_file, max_epoch, dir_path):
    print(f"Loading losses file: losses_{max_epoch}.txt")

    with open(dir_path + f'losses_{max_epoch}.txt', 'r') as file:
        losses = [float(line.strip()) for line in file]

    print(f"Loaded losses file: losses_{max_epoch}.txt")

    return losses

#===========================================================================
def train_model(model, dataloader, criterion, optimizer, start, epochs, losses, dir_path):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    from time import time

    if start != epochs:
        print()
        print("Training...")
        print(f"Using device: {device}")

        total_batches = len(dataloader)

        for epoch in range(start, epochs):
            epoch_start_time = time()
            total_loss = 0

            print(f"\nEpoch {epoch + 1}/{epochs}")
            print("-" * 20)

            for batch_idx, (xs_batch, labels_batch) in enumerate(dataloader):
                # Move data to GPU if available
                xs_batch = xs_batch.long().to(device)
                labels_batch = labels_batch.long().to(device)

                outputs = model(xs_batch)
                loss = criterion(outputs, labels_batch)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                total_loss += loss.item()

                # Print progress every 100 batches
                if (batch_idx + 1) % 100 == 0:
                    progress = (batch_idx + 1) / total_batches * 100
                    print(f"Batch {batch_idx + 1}/{total_batches} ({progress:.1f}%) - Current loss: {loss.item():.4f}")

            # Calculate the average loss for the epoch
            average_loss = total_loss / len(dataloader)
            losses.append(average_loss)

            epoch_time = time() - epoch_start_time

            print(f'\nEpoch {epoch + 1} Summary:')
            print(f'Average Loss: {average_loss:.4f}')
            print(f'Time taken: {epoch_time:.1f} seconds')

            # Move model to CPU for saving
            model.cpu()
            torch.save(model.state_dict(), dir_path + f'lstm_model_{epoch + 1}.pth')
            # Move model back to GPU
            model.to(device)

            # Save the losses list
            with open(dir_path + f'losses_{epoch + 1}.txt', 'w') as file:
                for number in losses:
                    file.write(f"{number}\n")

    print()
    print(f"Training complete. The model has been trained for {epochs} epochs.")

#===========================================================================
# Predict the next words in the sequence
def predict_next_words(model, tokenizer, text, n):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    for _ in range(n):
        token_list = [tokenizer[word] for word in word_tokenize(text)]
        token_list = np.pad(token_list, (max_sequence_len-len(token_list), 0), 'constant')
        token_list = torch.LongTensor(token_list).unsqueeze(0).to(device)

        with torch.no_grad():
            predicted = model(token_list)

        predicted_word = [word for word, index in tokenizer.items() if index == torch.argmax(predicted).item()][0]
        text += " " + predicted_word

    return text

## **Code Step 3:  DASHBOARD: Load and Clean the Text and Set the Hyperparameters**

---

***What this step does:***

*  Loads and cleans the text for processing.

*  Defines model hyperparameters and function values.

*  Calculates the Type-to-Token Ratio (TTR), which is the total number of unique tokens divided by the number of total tokens in a given corpus.

*  Looks for losses and model files that have been previously generated.

---

<br>


**Complete this task for this step:**
1.   Run the code cell in this step.  This step needs to be run only once.

In [None]:
# @title Load and Clean the Text and Set the Hyperparameters

files_list = []
losses_list = []
models_list = []

# List all files in the directory
directory_path = '/content/drive/My Drive/LING 581/LSTM/'
filenames = os.listdir(directory_path)

# Add all filenames to the files list
for filename in filenames:
    root, extension = os.path.splitext(filename)

    if extension == '.txt':
        if filename.startswith('losses_'):
            losses_list.append(filename)
        else:
            files_list.append(filename)

    if extension == '.pth':
        models_list.append(filename)

print(f"Number of text files: {len(files_list)}")

for file in files_list:
    # Read text from a file
    print(f"Using text: {file}")
    print("Loading the text...")

    with open(directory_path + file, 'r', encoding="utf-8") as file:
        text = file.read()

# Tokenize the text
text = clean_text(text)
tokens = word_tokenize(text)
tokenizer = defaultdict(lambda: len(tokenizer))

# Create word index
sequence = [tokenizer[word] for word in tokens]
total_words = len(tokenizer)

# Calculate type-to-token ratio (TTR)
ttr = total_words / len(tokens)

# Prepare the input sequences
input_sequences = []

for i in range(1, len(sequence)):
    n_gram_sequence = sequence[:i+1]
    input_sequences.append(n_gram_sequence)

# Pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array([np.pad(x, (max_sequence_len-len(x), 0), 'constant') for x in input_sequences])

# Split data into features and labels
xs, labels = input_sequences[:,:-1], input_sequences[:,-1]

dataset = TextDataset(xs, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Hyperparameters
epochs = 14 # @param {"type":"integer"}
embed_dim = 10 # @param {"type":"integer"}
hidden_dim = 100 # @param {"type":"integer"}
learn_rate = 0.01 # @param {"type":"number"}

# LSTM framework
model = LSTMModel(total_words, embed_dim, hidden_dim)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learn_rate)

print(f"Total number of tokens: {len(tokens)}")
print(f"Total number of unique words (vocabulary): {total_words}")
print(f"Type-to-token ratio (TTR): {ttr}")

print()

if len(losses_list) > 0:
    print("Losses file found.")

if len(models_list) > 0:
    print("Model file found.")

## **Code Step 4:  Train the LSTM Model**

---

***What this step does:***

Trains the LSTM model with the specified hyperparameters given the input text.


---

<br>


**Complete this task for this step:**
1.   Run the code cell in this step.  This step needs to be run only once.  Note that this may take several minutes for each epoch using the base CPU runtime; you might consider using a GPU or TPU runtime to speed up the training.

In [None]:
# @title Train the Model

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move model to GPU if available
model = model.to(device)

losses = []
start = 0

if len(models_list) > 0:
    model_file, max_epoch = get_latest_file(models_list)

    if max_epoch < epochs:
        cont = input(f"A model file was found with fewer completed epochs ({max_epoch}) than the specified epochs parameter ({epochs}).  Continue training? (y/n)  ").lower()
        print()

        if cont == 'y':
            print(f"Loading model: {model_file}")
            # Load model to GPU if available
            model.load_state_dict(torch.load(directory_path + model_file, map_location=device))
            print(f"Loaded model: {model_file}")

            losses_file, max_epoch = get_latest_file(losses_list)
            losses = load_losses(losses_file, max_epoch, directory_path)

            start = max_epoch

            train_model(model, dataloader, criterion, optimizer, start, epochs, losses, directory_path)
        else:
            print(f"Training stopped.  The current model has completed {max_epoch} out of {epochs} epochs.")
    else:
        print()
        print(f"Training complete.  The model has been trained for {epochs} epochs.")
else:
    train_model(model, dataloader, criterion, optimizer, start, epochs, losses, directory_path)

## **Code Step 5:  Plot the Loss Curve**

---

***What this step does:***

Plots the loss curve for the LSTM training.


---

<br>


**Complete this task for this step:**
1.   Run the code cell in this step.

In [None]:
# @title Plot the Loss Curve

# Re-scan the directory for loss files
losses_list = []  # Reinitialize the list

# List all files in the directory again
filenames = os.listdir(directory_path)

# Add all loss filenames to the losses_list
for filename in filenames:
    root, extension = os.path.splitext(filename)
    if extension == '.txt' and filename.startswith('losses_'):
        losses_list.append(filename)

if len(losses_list) > 0:
    losses_file, max_epoch = get_latest_file(losses_list)
    losses = load_losses(losses_file, max_epoch, directory_path)
    print()

    x_range = range(1, len(losses) + 1)
    plt.plot(x_range, losses)
    plt.xlim(1, epochs)
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('LSTM Training Loss Curve')
    plt.show()

else:
    print("No losses file found.")
    print("Unable to plot, insufficient data.")

## **Code Step 6:  Predict Next Words**

---

***What this step does:***

Generates the number of output text tokens given the input seed text.


---

<br>


**Complete these tasks for this step:**
1.   Specify the number of output tokens.
2.   Provide the seed text.
3.   Run the code cell in this step.

In [None]:
# @title Predict Next Words
# Example usage

num_output_tokens = 5 # @param {"type":"integer"}

seed_text = "" # @param {"type":"string"}
predicted_text = predict_next_words(model, tokenizer, seed_text.lower(), num_output_tokens)
print(predicted_text)

#  

|  |  |
|----------|----------|
| Colab notebook created by:|
| <img src="https://brightspotcdn.byu.edu/8e/28/7bcd62fe4b2b9517b74f783decfe/1-monogram-378w.svg" alt="BYU Logo" width="150">|Professor Duane K. Dougal<br>Computer Science Department & Department of Linguistics<br>Brigham Young University|
|||
||CUDA improvements by: Brock Rawson|