<a href="https://colab.research.google.com/github/rodrigoromanguzman/Actividades_Aprendizaje-/blob/main/A4_DL_TC5033_text_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TC 5033
### Text Generation

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation.

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function.

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



<h3>Install the Python libraries that we will need</h3>

In [1]:
pip install portalocker

Collecting portalocker
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


In [2]:
!pip install torchdata



In [3]:
pip install torch==1.9.1+cu102 torchvision==0.10.1+cu102 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
[31mERROR: Could not find a version that satisfies the requirement torch==1.9.1+cu102 (from versions: 1.11.0, 1.11.0+cpu, 1.11.0+cu102, 1.11.0+cu113, 1.11.0+cu115, 1.11.0+rocm4.3.1, 1.11.0+rocm4.5.2, 1.12.0, 1.12.0+cpu, 1.12.0+cu102, 1.12.0+cu113, 1.12.0+cu116, 1.12.0+rocm5.0, 1.12.0+rocm5.1.1, 1.12.1, 1.12.1+cpu, 1.12.1+cu102, 1.12.1+cu113, 1.12.1+cu116, 1.12.1+rocm5.0, 1.12.1+rocm5.1.1, 1.13.0, 1.13.0+cpu, 1.13.0+cu116, 1.13.0+cu117, 1.13.0+cu117.with.pypi.cudnn, 1.13.0+rocm5.1.1, 1.13.0+rocm5.2, 1.13.1, 1.13.1+cpu, 1.13.1+cu116, 1.13.1+cu117, 1.13.1+cu117.with.pypi.cudnn, 1.13.1+rocm5.1.1, 1.13.1+rocm5.2, 2.0.0, 2.0.0+cpu, 2.0.0+cpu.cxx11.abi, 2.0.0+cu117, 2.0.0+cu117.with.pypi.cudnn, 2.0.0+cu118, 2.0.0+rocm5.3, 2.0.0+rocm5.4.2, 2.0.1, 2.0.1+cpu, 2.0.1+cpu.cxx11.abi, 2.0.1+cu117, 2.0.1+cu117.with.pypi.cudnn, 2.0.1+cu118, 2.0.1+rocm5.3, 2.0.1+rocm5.4.2, 2.1.0, 2.1.0+cpu, 2.1.0+cpu.cxx11.abi, 2.1.0+cu118, 2.1.0+cu12

In [4]:
pip install scikit-plot

Collecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Installing collected packages: scikit-plot
Successfully installed scikit-plot-0.3.7


<h3>Import the lbraries that we will be using</h3>

In [5]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random

In [6]:
# Select the correct device where our app will be running
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [7]:
train_dataset, val_dataset, test_dataset = WikiText2()

<h3>WikiText2 dataset</h3>
<p>WikiText2 is a large collection of articles taken from Wikipedia, chosen for their quality and breadth of content. The articles included in the dataset are among the higher-quality content available on Wikipedia, having undergone thorough review and editing processes. WikiText-2 is smaller than some other datasets like WikiText-103, making it more manageable for training models with limited computational resources</p>
<p>WikiText-2 is typically divided into training, validation, and test sets. The training set is used to train language models.</p>

In [8]:
# Set up tokenizer and define a generator function
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

In [9]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

<h3>Data processing</h3>
<p>Prepare data training, validation and testing. The tensor is split into input (x) and target (y) sequences for the model. For each position i in the input sequence, the corresponding target is the token at position i+1. This setup trains the model to predict the next token in a sequence</p>

In [10]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
#     target_data = torch.cat(d)
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length),
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

<p>Step for preparing data to use with PyTorch's data loading utilities, specifically for batching and loading the data during the training and evaluation process.</p>

In [11]:
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

<p>The DataLoader takes a dataset and provides an iterator that returns batches of data. This is essential for efficiently processing large datasets and for functionalities like shuffling and parallel data loading</p>

In [12]:
batch_size = 64  # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

<h3>LSTM neural network model</h3>
<p>The model initializes an embedding layer to convert token indices into dense vector representations, followed by an LSTM layer for processing these sequences. The model's architecture includes a specified number of LSTM layers and a hidden state size, allowing it to capture temporal dependencies within the data. The output of the LSTM is then passed through a linear layer to map it to the size of the vocabulary, facilitating the prediction of the next token in a sequence. Additionally, the model includes a method to initialize the hidden states of the LSTM, ensuring they are reset appropriately for each new batch of data. The model is instantiated with specific parameters like vocabulary size, embedding size, hidden layer size, and the number of LSTM layers, making it suitable for various NLP tasks that involve sequential data</p>

In [19]:
# Define the LSTM model
# Feel free to experiment
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        # Transforms token indices into dense vectors of fixed size embed_size
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        # size of the hidden state in the LSTM
        self.hidden_size = hidden_size
        # number of LSTM layers stacked on top of each other.
        self.num_layers = num_layers
        # LSTM layer that processes sequences of embeddings.
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)# batch_first=True argument indicates that the input and output tensors are provided as (batch, seq, feature)
        # fully connected layer that maps the LSTM's output to the size of the vocabulary
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        # LSTM layer takes the embeddings and the initial hidden state as input, and returns the output for each time step and the final hidden state
        output, hidden = self.lstm(embeddings, hidden)
        # Output from the LSTM is passed through the linear layer to predict the next token
        decoded = self.fc(output)
        return decoded, hidden

    # Initializes the hidden state and cell state for the LSTM with zeros.
    def init_hidden(self, batch_size):
        # Ensures that the hidden states are on the same device as the model
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))

vocab_size = len(vocab) # vocabulary size
emb_size = 200 # embedding size
neurons = 200 # the dimension of the feedforward network model, i.e. # of neurons
num_layers = 2 # the number of nn.LSTM layers
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)


<h3>Testing our model</h3>
<p>Given that we have corresponding target sequences for each input data sequence, we are equipped to evaluate the accuracy of our language model. This can be achieved through the test_model function, which is structured to assess the model's performance in a testing environment. The function sets the model to evaluation mode, ensuring that operations like dropout or batch normalization are adjusted appropriately for testing. During evaluation, the model's predictions are compared against the actual target sequences to compute both the average loss and the <b>accuracy</b>.</p>

In [23]:
def test_model(model, test_loader, criterion, device, batch_size, vocab_size):
  model.eval()  # Set the model to evaluation mode
  total_loss, total_predictions, correct_predictions = 0, 0, 0
  hidden = model.init_hidden(batch_size)

  with torch.no_grad():
    for x_test, y_test in test_loader:
      x_test, y_test = x_test.to(device), y_test.to(device)
      output, hidden = model(x_test, hidden)
      hidden = (hidden[0].detach(), hidden[1].detach())  # Detaching hidden state

      loss = criterion(output.view(-1, vocab_size), y_test.view(-1))
      total_loss += loss.item() * x_test.size(0)

      # Calculate the total number of sequence elements
      total_predictions += x_test.size(0) * seq_length

      # Calculate accuracy
      _, predicted = torch.max(output, 2)  # Get the index of the max log-probability

      correct_predictions += (predicted == y_test).view(-1).sum().item()

  average_loss = total_loss / total_predictions
  accuracy = correct_predictions / total_predictions
  return average_loss, accuracy

<h3>Training Function with PyTorch</h3>
<p>This function serves as a standard training approach for PyTorch-based models. It ensures that the model, the data and the hidden layers are available on the specified device, facilitating GPU acceleration if available.</p>

In [21]:
def train(model, epochs, optimiser):
  '''
  The following are possible instructions you may want to conside for this function.
  This is only a guide and you may change add or remove whatever you consider appropriate
  as long as you train your model correctly.
      - loop through specified epochs
      - loop through dataloader
      - don't forget to zero grad!
      - place data (both input and target) in device
      - init hidden states e.g. hidden = model.init_hidden(batch_size)
      - run the model
      - compute the cost or loss
      - backpropagation
      - Update paratemers
      - Include print all the information you consider helpful

  '''
  criterion = nn.CrossEntropyLoss()
  model = model.to(device=device)

  for epoch in range(epochs):
    for i, (data, labels) in enumerate((train_loader)):
      model.train()
      # Clear old gradients
      optimiser.zero_grad()

      # Initialize hidden states for each batch
      hidden = model.init_hidden(batch_size)
      hidden = (hidden[0].to(device), hidden[1].to(device))

      # Convert labels to torch.long data type
      inputs = data.to(device = device,dtype=torch.long)
      labels = labels.to(device = device,dtype=torch.long)# Move hidden states to device

      # Forward pass
      outputs, hidden = model(inputs,hidden)

      # Compute loss
      loss = criterion(outputs.view(-1, vocab_size), labels.view(-1))

      # Backward pass and optimization
      loss.backward()
      optimiser.step()
    average_loss, accuracy = test_model(model, val_loader, criterion, device, batch_size, vocab_size)
    print(f"Average Test Loss: {average_loss}")
    print(f"Test Accuracy: {accuracy * 100:.2f}%")



<h3>Training of the model</h3>

In [24]:
# Call the train function
loss_function = nn.CrossEntropyLoss()
lr = 0.0005
epochs = 20
optimiser = optim.Adam(model.parameters(), lr=lr)
train(model, epochs, optimiser)

Average Test Loss: 0.1134851448571504
Test Accuracy: 20.10%
Average Test Loss: 0.10986625116262863
Test Accuracy: 21.00%
Average Test Loss: 0.10804870363491685
Test Accuracy: 21.68%
Average Test Loss: 0.10686180655636004
Test Accuracy: 21.77%
Average Test Loss: 0.10594809717206813
Test Accuracy: 22.16%
Average Test Loss: 0.10599816137285376
Test Accuracy: 22.17%
Average Test Loss: 0.10565218726200844
Test Accuracy: 22.44%
Average Test Loss: 0.10621451463272323
Test Accuracy: 22.16%
Average Test Loss: 0.10567238238320421
Test Accuracy: 22.51%
Average Test Loss: 0.10613417298046511
Test Accuracy: 22.44%
Average Test Loss: 0.10573331590908677
Test Accuracy: 22.71%
Average Test Loss: 0.10613781388126202
Test Accuracy: 22.82%
Average Test Loss: 0.10680383269466572
Test Accuracy: 22.49%
Average Test Loss: 0.1074395068723764
Test Accuracy: 22.45%
Average Test Loss: 0.1079189089874723
Test Accuracy: 22.20%
Average Test Loss: 0.10819200786192025
Test Accuracy: 22.32%
Average Test Loss: 0.108761

In [18]:
def generate_text(model, start_text, num_words, temperature=1.0):
  model.eval()
  words = tokeniser(start_text)
  hidden = model.init_hidden(1)
  for i in range(0, num_words):
      x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
      y_pred, hidden = model(x, hidden)
      last_word_logits = y_pred[0][-1]
      p = (F.softmax(last_word_logits / temperature, dim=0).detach()).to(device='cpu').numpy()
      word_index = np.random.choice(len(last_word_logits), p=p)
      words.append(vocab.lookup_token(word_index))

  return ' '.join(words)


# Generate some text
print(generate_text(model, start_text="I like", num_words=100))


i like poker things , as the uí king of schopenhauer changed them to his life following hanging for still 10 her squadron . bulloch became released and 2001 attempts to become embroiled in love , and dumah , journal of munster , in great , he did become from dangerously in support of his teens to 45 – 40 by 2 , tennessee was marketed as manager michael <unk> pickering ) , and the champions producers in august ( 97 ) , acting on the oricon and 1980s an elaborate <unk> family guy for the revival . development and queen elizabeth
