
<div align="center">
  <img src="https://www.dropbox.com/s/vold2f3fm57qp7g/ECE4179_5179_6179_banner.png?dl=1" alt="ECE4179/5179/6179 Banner" style="max-width: 60%;"/>
</div>

<div align="center">

# Recurrent Neural Networks (RNNs)

</div>

Welcome to week 10 of ECE4179/5179/6179! 

*Dun dun dun dun-da-da dun-da-da, dun-da-da!* 🎶

In an island far, far away you find yourself on a distant jungle, with nothing but your trusty laptop, equipped with the latest release of PyTorch and other beloved Python packages. 

What would you do? Of course you start coding and learning about **Recurrent Neural Networks (RNNs)**!
Let’s get started, May the gradients be with you! ✨


<div align="center">
  <img src="data/Mysterious-Island.jpg" alt="mysterious_island" style="max-width: 60%;"/>
</div>



In [1]:
import os
import random


import numpy as np
from tqdm import tqdm


import torch
from torch import nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader


RND_SEED = 42
np.random.seed(RND_SEED)
random.seed(RND_SEED)
torch.manual_seed(RND_SEED)


# Check if CUDA is available (for NVIDIA GPUs)
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("CUDA GPU detected. Using CUDA.")

# Check if MPS (Metal Performance Shaders) is available (for Apple Silicon M1/M2)
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Apple Silicon GPU detected. Using MPS.")

# Otherwise, default to CPU
else:
    device = torch.device("cpu")
    print("No GPU available. Using CPU.")

CUDA GPU detected. Using CUDA.


<div style="background-color: #2b0080; color: white; padding: 10px; border-radius: 5px;">

### <span style="color: pink;">Task #1. So you claim to be a Jedi in neural networks?</span>

Over the past 10 weeks, you've achieved amazing things with neural networks. Now, I challenge you to design an MLP for a simple task: adding two 3-digit numbers. For example, given 4179 and 5179, your MLP should return their sum, 9358. To simplify things further, if the sum exceeds 9999, just return the last 4 digits (e.g., 899 + 101 = 000).

Consider the input to your model as 2x3 symbols (digits), where each symbol can take 10 possible values (0 to 9), and the output will be 3 symbols. Discuss how you would approach this as a supervised learning problem with an MLP. Think about how to design the output, handle classification (if needed), process the input, and other aspects of your solution.

</div>


<div style="background-color: #2b0080; color: white; padding: 10px; border-radius: 5px;">

### <span style="color: pink;">Task #2. Have a go at your MLP!</span>

Great! Hopefully, you have a solid plan for Task #1. Now, let's implement your MLP in PyTorch. Below, I’ve provided a couple of functions to help prepare the data and evaluate the performance of your model. Your task is to design the model and the training loop. 

If you're unsure how to approach this problem, here's a suggestion to get started:

The input to your MLP should be an 8D vector—where the first 4 elements represent the digits of the first number, and the last 4 represent the second number. Instead of formulating this as a regression problem (which would be inefficient since the output space isn't continuous), treat it as a classification problem. Each digit is a class, meaning you'll need 4x10 output units. Use the cross-entropy loss function, applying it separately to each 10D output vector to predict each digit. PyTorch’s `CrossEntropyLoss` handles this directly.

</div>


In [2]:
class DigitAdditionDataset(Dataset):
    """
    A PyTorch Dataset class to generate pairs of random n-digit numbers and their sums,
    where the sum is truncated or padded to match the length of the input sequences.

    Args:
        size (int): The total number of samples in the dataset. Default is 1000.
        seq_len (int): Length of the input sequences (i.e., number of digits in each number). Default is 4.

    Returns:
        A tuple of three elements:
            - seq1 (torch.Tensor): The first number represented as a sequence of digits.
            - seq2 (torch.Tensor): The second number represented as a sequence of digits.
            - sum_seq (torch.Tensor): The sum of the two numbers, truncated/padded to seq_len.
    """

    def __init__(self, size=1000, seq_len=3):
        self.seq_len = (
            seq_len  # Length of the sequence (number of digits in the numbers)
        )
        self.size = size  # Total number of data samples to generate
        self.data = []  # Initialize an empty list to store the generated data

        # Define the range for random numbers based on the sequence length
        min_val = 10 ** (
            seq_len - 1
        )  # Minimum value for a number with seq_len digits (e.g., 100 for 3 digits)
        max_val = (
            10**seq_len
        ) - 1  # Maximum value for a number with seq_len digits (e.g., 999 for 3 digits)

        # Generate 'size' number of data samples
        for _ in range(size):
            num1 = random.randint(
                min_val, max_val
            )  # Randomly generate the first number
            num2 = random.randint(
                min_val, max_val
            )  # Randomly generate the second number
            sum_result = num1 + num2  # Calculate the sum of the two numbers

            # Convert the numbers into lists of digits
            seq1 = [
                int(digit) for digit in str(num1)
            ]  # Convert num1 to a list of digits
            seq2 = [
                int(digit) for digit in str(num2)
            ]  # Convert num2 to a list of digits
            sum_seq = [
                int(digit) for digit in str(sum_result)
            ]  # Convert sum_result to a list of digits

            # Ensure the sequences are all the same length (seq_len) by truncating or padding the sum
            if len(sum_seq) > seq_len:
                sum_seq = sum_seq[
                    -seq_len:
                ]  # Truncate the sum if it has more digits than seq_len
            else:
                sum_seq = [0] * (
                    seq_len - len(sum_seq)
                ) + sum_seq  # Pad the sum with leading zeros if it has fewer digits

            # Append the generated sequence (input1, input2, and the sum) to the data list
            self.data.append((seq1, seq2, sum_seq))

    def __len__(self):
        """
        Returns the total number of samples in the dataset.
        """
        return self.size

    def __getitem__(self, idx):
        """
        Returns the input-output pair (two input sequences and their sum) for a given index.

        Args:
            idx (int): Index of the sample to retrieve.

        Returns:
            A tuple of torch.Tensors:
                - seq1: The first number represented as a sequence of digits.
                - seq2: The second number represented as a sequence of digits.
                - sum_seq: The sum of the two numbers, represented as a sequence of digits.
        """
        seq1, seq2, sum_seq = self.data[idx]
        return torch.tensor(seq1), torch.tensor(seq2), torch.tensor(sum_seq)

In [3]:
# Training dataset
train_size = 500  # Number of training examples
train_dataset = DigitAdditionDataset(size=train_size)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Test dataset
test_size = 1000  # Number of test examples
test_dataset = DigitAdditionDataset(size=test_size)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [7]:
# Get the first batch from train_loader
seq1, seq2, sum_seq = next(iter(train_loader))
print(seq1.shape, seq2.shape, sum_seq.shape)

# Print the first 5 data samples from the batch
for i in range(5):
    print(seq1[i].shape, sum_seq[i].shape)
    print(f"Sequence 1: {seq1[i].tolist()}")
    print(f"Sequence 2: {seq2[i].tolist()}")
    print(f"Sum       : {sum_seq[i].tolist()}")
    print("-" * 30)  # Separator

torch.Size([32, 3]) torch.Size([32, 3]) torch.Size([32, 3])
torch.Size([3]) torch.Size([3])
Sequence 1: [4, 4, 4]
Sequence 2: [2, 0, 4]
Sum       : [6, 4, 8]
------------------------------
torch.Size([3]) torch.Size([3])
Sequence 1: [6, 7, 8]
Sequence 2: [6, 3, 5]
Sum       : [3, 1, 3]
------------------------------
torch.Size([3]) torch.Size([3])
Sequence 1: [4, 0, 8]
Sequence 2: [9, 7, 0]
Sum       : [3, 7, 8]
------------------------------
torch.Size([3]) torch.Size([3])
Sequence 1: [3, 2, 0]
Sequence 2: [8, 8, 1]
Sum       : [2, 0, 1]
------------------------------
torch.Size([3]) torch.Size([3])
Sequence 1: [7, 1, 7]
Sequence 2: [3, 2, 5]
Sum       : [0, 4, 2]
------------------------------


In [None]:
from typing import Any, Mapping
import pytorch_lightning as pl
from torch import Tensor


class AddModel(pl.LightningModule):
    def __init__(
        self,
        lr=1e-3,
        hidden_size=128,
        num_layers=2,
        seq_len=3,
    ) -> None:
        super().__init__()
        self.save_hyperparameters()

        input_size = 2
        output_size = 10

        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.seq_len = seq_len

        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.loss = nn.CrossEntropyLoss()

    def forward(self, x1, x2) -> Any:
        hidden = torch.zeros(self.num_layers, x1.size(0), self.hidden_size).to(
            self.device
        )
        cell = hidden.clone()
        x = torch.stack((x1, x2), dim=2).float().flip([1])
        rnn_out, (hidden, cell) = self.lstm(x, (hidden, cell))
        out = rnn_out.flip([1])
        out = self.fc(out)
        out = out.permute(0, 2, 1)
        return out

    def training_step(self, batch) -> Tensor | Mapping[str, Any] | None:
        x1, x2, y = batch
        logits = self(x1, x2)
        loss = self.loss(logits, y)

        self.log("train_loss", loss, on_step=False, on_epoch=True, prog_bar=True)

        return loss

    def validation_step(self, batch) -> Tensor | Mapping[str, Any] | None:
        x1, x2, y = batch
        logits = self(x1, x2)
        loss = self.loss(logits, y)

        self.log("val_loss", loss, on_step=False, on_epoch=True, prog_bar=True)

        return loss

    def test_step(self, batch) -> Tensor | Mapping[str, Any] | None:
        x1, x2, y = batch
        logits = self(x1, x2)
        loss = self.loss(logits, y)

        self.log("test_loss", loss, on_step=False, on_epoch=True, prog_bar=True)

        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.hparams_initial["lr"])


from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping

checkpoint = ModelCheckpoint(
    dirpath="checkpoints",
    filename="model-{epoch:02d}-{val_loss:.2f}",
    monitor="val_loss",
)
early_stopping = EarlyStopping(monitor="val_loss", patience=5)

model = AddModel()
trainer = pl.Trainer(max_epochs=200, callbacks=[checkpoint, early_stopping])
trainer.fit(model, train_loader, test_loader)
trainer.test(model, test_loader)

In [44]:
model = AddModel.load_from_checkpoint("checkpoints/model-epoch=122-val_loss=0.23.ckpt")


def eval_model(model):
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for x1, x2, y in test_loader:
            logits = model(x1.to(model.device), x2.to(model.device))
            preds = logits.argmax(dim=1)
            correct += (preds == y.to(model.device)).all(dim=1).sum().item()
            total += y.size(0)
    accuracy = correct / total
    print(f"Accuracy: {accuracy:.2f}")
    return accuracy


eval_model(model)

Accuracy: 0.80


0.802

### RNNs

Recurrent Neural Networks (RNNs) are a type of neural network designed for sequence data. Unlike standard feed-forward neural networks, RNNs have **recursion** in their architecture, which allows them to maintain a form of memory, known as **hidden state** across different time steps. Can you think why they might suit our addition problem?



In a vanilla RNN, the recursion can be mathematically represented as follows:

\begin{align}
\boldsymbol{h}^{\{t\}} = f\left(\boldsymbol{W}_{hh} \cdot \boldsymbol{h}^{\{t-1\}}  + \boldsymbol{W}_{hx} \cdot \boldsymbol{x}^{\{t\}}  + \boldsymbol{b}_h \right)
\end{align}

Where:
- $\boldsymbol{h}^{\{t\}}$ is the hidden state at time step $t$.
- $\boldsymbol{h}^{\{t-1\}}$ is the hidden state from the previous time step.
- $\boldsymbol{x}^{\{t\}}$ is the input at time step $t$.
- $\boldsymbol{W}_{hh}$ and $\boldsymbol{W}_{hx}$ are weights of the model, and $\boldsymbol{b}_h$ is the bias term.
- $f$ is the activation function (commonly $ \tanh $ or $ \text{ReLU} $).

<div style="text-align: center;">
    <img src="data/ece4179_ws10_rnn1.png" alt="RNN Example1" style="width: 50%;"/>
</div>




To understand how an RNN process sequences, you can think of it as unrolling the recursion over time. The hidden state at each time step is updated based on the input at that time step and the hidden state from the previous time step. This allows the RNN to maintain a form of memory across different time steps.

<div style="text-align: center;">
    <img src="data/ece4179_ws10_rnn2.png" alt="RNN Example2" style="width: 50%;"/>
</div>






### LSTM in PyTorch



A **Long Short-Term Memory (LSTM)** network is a type of recurrent neural network (RNN) designed to handle sequence data and learn long-term dependencies. Unlike a vanilla RNN, an LSTM contains **cell states** and **gates** (e.g., forget) to control the flow of information and prevent issues like vanishing gradients.

In PyTorch, an LSTM layer processes an input sequence and returns two outputs: 
1. The output for each time step.
2. The hidden and cell states.

The input to the LSTM layer in PyTorch has the following shape:


\begin{align*}
    \text{input} = (\text{B}, \tau, \text{input\_size})
\end{align*}


- $\text{B}$ is the **batch size**, the number of sequences processed simultaneously.
- $\tau$ is the **sequence length**, which is the number of time steps in each sequence.
- $\text{input\_size}$ refers to the size (number of features) of each element in the sequence.

For example, in a problem where each element in the sequence is a 2D vector (two features), and we process a batch of 4 sequences of length 5 in one batch, the input will have shape `(4, 5, 2)`.

LSTMs maintain a **hidden state** and a **cell state** for each layer. The hidden and cell states have the following shape:

\begin{align*}
    \text{hidden}, \text{cell} = (\text{num\_layers}, \text{B}, \text{hidden\_size})
\end{align*}

Where:
- $\text{num\_layers}$ is the number of LSTM layers.
- $ \text{hidden\_size} $ is the size of the hidden state (number of features in the hidden state).

If your LSTM has 2 layers, a batch size of 4, and a hidden size of 128, the hidden and cell states will have shape `(2, 4, 128)`.



The output of the LSTM consists of:
1. **Output for each time step**: This is the hidden state for each time step in the sequence.
   \begin{align*}
        \text{output} = (\text{B}, \tau, \text{hidden\_size})\;.
   \end{align*}
   
   If you process 4 sequences of length 5, and the hidden size is 128, the output will have the shape `(4, 5, 128)`.

2. **Final hidden and cell states**:
   \begin{align*}
   \text{hidden}, \text{cell} = (\text{num\_layers}, \text{B}, \text{hidden\_size})
   \end{align*}
   These tensors contain the final hidden and cell states for each layer and for each sequence in the batch. If the LSTM has 2 layers, a batch size of 4, and a hidden size of 128, both the hidden and cell states will have shape `(2, 4, 128)`.


### Did You Know? 

Very recently, advancements in LSTM architecture have led to the development of **xLSTM: Extended Long Short-Term Memory**. This new variation introduces modifications to the standard LSTM, enhancing its ability to **retain memory** over longer sequences and improve its ability to learn **long-range dependencies**. 

While we’ll be working with the traditional LSTM in this notebook, it's exciting to know that LSTMs are continuously evolving, and innovations like xLSTM are pushing the boundaries of what these models can achieve. 


<div style="text-align: center;">
    <img src="data/xlstm.jpg" alt="xLSTM" style="width: 50%;"/>
</div>

<div style="background-color: #2b0080; color: white; padding: 10px; border-radius: 5px;">

### <span style="color: pink;">Task #3. Learn to add via RNNs</span>

Time to level up your skills! Design an LSTM model to solve the addition problem you tackled with an MLP.

</div>

### Language Models

A **language model (LM)** is a type of model that learns to predict the next token in a sequence, given a preceding context. An example of an LM, duh, ChatGPT! So, let's design one 

### The Mysterious Island

**Jules Verne** (1828–1905) was a French novelist, and poet, widely regarded as one of the pioneers of science fiction. His adventure novels, filled with detailed scientific explanations. Some of his most famous works include **"Twenty Thousand Leagues Under the Sea"**, **"Around the World in Eighty Days"**, and **"Journey to the Center of the Earth"**.

**"The Mysterious Island"** is the story of five castaways who find themselves stranded on a mysterious, uninhabited island. Using their ingenuity and resourcefulness, they attempt to survive, all while uncovering the secrets of the island. In this task, we’ll use **"The Mysterious Island"** as the training data to build an **LM**, maybe we can mimic Verne’s unique writing style.






<div style="text-align: center;">
    <img src="data/Jules_Verne.jpg" alt="Jules Verne" style="width: 25%;"/>
</div>

In [None]:
## Reading and processing text
with open("data/The_Mysterious_Island.txt", "r", encoding="utf8") as jv_file:
    jv_book_text = jv_file.read()

# Create a set of unique characters
jv_char_set = set(jv_book_text)

# Print results
print("Total Length:", len(jv_book_text))
print("Unique Characters:", len(jv_char_set))

### Tokenization

A tokenizer processes text and converts it to a sequence of symbols or tokens that can be fed into a model. In this task, we’ll use a simple tokenizer that converts the text to lowercase and splits it into words. The **Tokenizer** class is responsible for:
1. **Creating a Vocabulary**: The tokenizer identifies all unique characters in the input text and builds a vocabulary, which is a sorted list of these unique characters.
2. **Character-to-Integer Mapping**: Once the vocabulary is established, the tokenizer creates a dictionary that maps each character to a unique integer index (`char2int`). This enables the model to work with numbers rather than raw text.
3. **Integer-to-Character Mapping**: The tokenizer also builds the reverse mapping (`int2char`), which allows the model to convert predicted integers back into characters for generating readable text.


- `encode(text)`: This method converts input text into a list of integer values based on the character-to-integer mapping. The output is a NumPy array of integers, where each integer represents a character in the text.
  
  Example:
  ```python
  tokenizer.encode('Hello') 
  # Output: [8, 4, 11, 11, 14]  (based on the vocabulary and char2int mapping)
  ```

- `decode(encoded_text)`: This method converts a list of integers (representing characters) back into the original text using the integer-to-character mapping.
  
  Example:
  ```python
  tokenizer.decode([8, 4, 11, 11, 14]) 
  # Output: 'Hello'  (converts integers back to the corresponding characters)
  ```


BTW, check this out [OpenAI Tokenizer](https://platform.openai.com/tokenizer)! 🤖


In [13]:
class Tokenizer:
    def __init__(self, text):
        # Create the vocabulary (sorted set of unique characters)
        self.vocab = sorted(set(text))
        self.char2int = {ch: i for i, ch in enumerate(self.vocab)}
        self.int2char = np.array(self.vocab)

    def encode(self, text):
        """Converts text into an array of integers."""
        return np.array([self.char2int[ch] for ch in text], dtype=np.int32)

    def decode(self, encoded_text):
        """Converts an array of integers back into text."""
        return "".join(self.int2char[encoded_text])

<div style="background-color: #2b0080; color: white; padding: 10px; border-radius: 5px;">

### <span style="color: pink;">Task #4. Study the character-level tokenizer</span>

To have a short and fun break, study the `Tokenizer` class provided below. Understand how it works and how it can be used to encode and decode text. Feel free to experiment with the tokenizer by encoding and decoding different texts.

</div>

In [None]:
# Initialize the Tokenizer with the text
jv_tokenizer = Tokenizer(jv_book_text)

# Provide a custom text for encoding and decoding
prompt = "hello ece4179"

# Encode the custom text
enc_tokens = jv_tokenizer.encode(prompt)

# Print the encoded custom text
print("Prompt:", prompt)
print("Encoded tokens:", enc_tokens)

# Decode the encoded custom text back to the original text
dec_prompt = jv_tokenizer.decode(enc_tokens)

# Print the decoded text
print("Decoded prompt from tokens:", dec_prompt)

### Training data

The **TextDataset** class is responsible for preparing the training data for our LM. It encodes the input text into integer sequences and creates pairs of input-target data that the model uses to learn how to predict the next character in a sequence.


When training a character-level LM, the goal is to feed sequences of characters (input) into the model and have it predict the next character (target). The **TextDataset** class constructs the training data by creating sliding windows of sequences from the encoded text.

The main idea is to slide a window of `seq_length` characters across the encoded text. For each index idx, the dataset generates an input sequence and its corresponding target sequence. The input sequence is a chunk of seq_length characters starting at position idx in the encoded text. The target sequence is created by shifting the input sequence by one character. This teaches the model to predict the next character for each position in the input sequence.


In [15]:
class TextDataset(Dataset):
    def __init__(self, book_text, tokenizer, seq_length):
        self.tokenizer = tokenizer
        self.seq_length = seq_length
        # Encode the entire book text using the tokenizer
        self.text_encoded = self.tokenizer.encode(book_text)

    def __len__(self):
        # The number of available chunks in the dataset
        return len(self.text_encoded) - self.seq_length

    def __getitem__(self, idx):
        # Generate input sequence and target dynamically
        input_seq = self.text_encoded[idx : idx + self.seq_length]
        target_char = self.text_encoded[idx + 1 : idx + self.seq_length + 1]
        return torch.tensor(input_seq).long(), torch.tensor(target_char).float()

In [None]:
seq_length = 40  # Sequence length for training
jv_dataset = TextDataset(jv_book_text, jv_tokenizer, seq_length=seq_length)

batch_size = 64
jv_dataloader = DataLoader(
    jv_dataset, batch_size=batch_size, shuffle=True, drop_last=True
)

# let's print the first sample from the dataset
trn_x_sample, trn_y_sample = jv_dataset[0]
print(f"Input sequence: {trn_x_sample}")
print(f"Target character: {trn_y_sample}")

<div style="background-color: #2b0080; color: white; padding: 10px; border-radius: 5px;">

### <span style="color: pink;">Task #5. Create an RNN for Character-Level LM</span>

Your RNN should consist of the following components:

1. **Embedding Layer**: This layer will map each character (represented as an integer) into a dense vector of fixed size (`embed_dim`). The embedding allows the model to learn meaningful representations of each character.

2. **LSTM Layer**: The core of your RNN model will be an **LSTM** (Long Short-Term Memory) layer. This layer will process the input sequences and capture patterns over time.

3. **Fully Connected (Linear) Layer**: After processing the sequence with the LSTM, a fully connected layer will map the hidden states back to the vocabulary space, allowing the model to predict the next character.

</div>

In [None]:
vocab_size = len(jv_tokenizer.vocab)
embed_dim = 256
rnn_hidden_size = 512

# instantiate the model
jv_model = # YOUR CODE HERE

### Training Loop for LM

The training loop for the LM is similar to the training loop for the previous task to a great degree. The loss function used in the training loop is the **CrossEntropyLoss**. We will train the model for fixed number of iterations in the workshop, feel free to improve the model by training it for more.

In [None]:
# Loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(jv_model.parameters(), lr=0.001)

# Number of iterations
num_iterations = 10000
print_every_k = 100  # Print and update the progress bar every `k` iterations
batch_size = 64  # Example batch size


# Initialize the progress bar for iterations
pbar = tqdm(total=num_iterations, desc="Training Progress", position=0, leave=True)


total_loss = 0

for iteration in range(num_iterations):

    seq_batch, target_batch = next(iter(jv_dataloader))

    # Initialize hidden and cell states
    hidden, cell = jv_model.init_hidden(batch_size)

    # Move batch to the correct device (CPU/GPU)
    seq_batch, target_batch = seq_batch.to(device), target_batch.to(device)

    # Zero the gradients
    optimizer.zero_grad()

    # Forward pass through the model with the entire sequence
    pred, hidden, cell = jv_model(seq_batch, hidden, cell)
    pred = pred.view(-1, vocab_size)  # Reshape to [batch_size,  seq_length, vocab_size]
    # Compute loss for the last time step prediction
    loss = loss_fn(pred, target_batch.view(-1).long())

    # Backpropagation and optimization
    loss.backward()
    optimizer.step()

    # Accumulate total loss for this iteration
    total_loss += loss.item()

    # Update the progress bar and print every `k` iterations
    if iteration % print_every_k == 0:
        # Calculate the average loss over the last `k` iterations
        avg_loss = total_loss / print_every_k

        # Update the progress bar with the average loss
        pbar.set_postfix(loss=avg_loss)
        pbar.update(print_every_k)  # Update the progress bar by 'k' steps

        # Reset total_loss for the next `k` iterations
        total_loss = 0

# Close the progress bar when training is done
pbar.close()

<div style="background-color: #2b0080; color: white; padding: 10px; border-radius: 5px;">

### <span style="color: pink;">Task #6. Write with your model</span>

Too tired to write? Let your model do the work! Use your trained LM to generate text based on a given prompt. You can start with a simple prompt like "The island" and see what your model comes up with. Feel free to experiment with different prompts and see how your model performs.
</div>

In [20]:
from torch.distributions import Categorical


def sample(model, starting_str, len_generated_text=500, temperature=1.0):
    # Encode the starting string
    encoded_input = jv_tokenizer.encode(starting_str)
    encoded_input = torch.tensor(encoded_input).unsqueeze(0)  # Shape [1, seq_length]

    generated_str = starting_str

    # Set model to evaluation mode
    model.eval()
    hidden, cell = model.init_hidden(1)  # For batch size of 1
    hidden, cell = hidden.to("cpu"), cell.to("cpu")

    # Pass the entire starting string through the model to prime the hidden state
    _, hidden, cell = model(encoded_input, hidden, cell)

    # Start sampling from the last character of the input
    last_char = encoded_input[:, -1]  # Get the last character of the input

    # Generate the next characters
    for i in range(len_generated_text):
        # Forward pass for the last character
        logits, hidden, cell = model(last_char.view(1, 1), hidden, cell)  # Shape [1, 1]
        logits = torch.squeeze(logits, 0)  # Remove batch dimension

        # Scale logits by temperature (divide by temperature)
        scaled_logits = logits / temperature

        # Sample from the scaled distribution
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()

        # Decode and append the generated character to the string
        generated_str += jv_tokenizer.decode([last_char.item()])

    return generated_str

In [None]:
jv_model.to("cpu")
print(
    sample(
        jv_model, starting_str="The island", len_generated_text=500, temperature=0.50
    )
)