# **GR5242 HW04 Problem 1: Shakespeare with LSTM networks**

**Instructions**: This problem is an individual assignment -- you are to complete this problem on your own, without conferring with your classmates.  You should submit a completed and published notebook to Courseworks; no other files will be accepted.

## Description:

This homework exercise has 3 primary goals:
 * Introduce some basic concepts from natural language processing
 * Get some practice training recurrent neural networks, specifically on text data
 * Be able to generate fake text data from your favorite author!   

By the end of this exercise, you will have a basic, but decent, computer program which can simulate the writing patterns of any author of your choice.

Here is an outline of the rest of the exercise.
 1. Data loading
     - We will start by downloading a text from Project Gutenberg that we will try to model
     - Data preprocessing and numerical encoding
     - Making training `Dataset` and `DataLoader` objects
 3. Learn to generate text with a neural network
     - Defining the recurrent network
     - Training
     - Predicting and sampling text from the model

     There are 12 questions (70 points) in total, which include coding and written questions. You can only modify the codes and text within \### YOUR CODE HERE ### and/or \### YOUR ANSWER HERE ###.


In [1]:
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import tqdm
import re

In [2]:
if torch.cuda.is_available():
    device = torch.device('cuda:0')
    print("GPU is available.")
else:
    device = torch.device('cpu')
    print("GPU is not available.")

GPU is available.


## Character-level language modeling

Our goal here is to build a model of language letter-by-letter. Since we may also allow numbers, spaces, and punctuation, it's better to say character-by-character. We will start by fixing an "alphabet": the set of allowed characters.

In math notation, let's call the alphabet $A$. In code,

In [3]:
alphabet = " ""'abcdefghijklmnopqrstuvwxyz1234567890.,!?:;ABCDEFGHIJKLMNOPQRSTUVWXYZ\n"

# Section 1: Data loading and preprocessing

We will start by downloading training data from Project Gutenberg: https://www.gutenberg.org/. Project Gutenberg is a free repository of public domain books. Find any book you like, and download it in Plain Text UTF-8 format.

For example, we will use Shakespeare's complete works: https://www.gutenberg.org/ebooks/100. There is a link on that page to the Plain Text format data.  Download the pg100.txt file, and then upload it from your computer to colab (click at left on the File icon, then click the upload icon).  

*Important*: whichever work you choose, make sure you have enough data! The size of your plain text file should be at least 2MB.

In [4]:
import requests

# URL of the book's plain text file on Project Gutenberg
url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt"  # Moby Dick; Or, The Whale

# Fetch and read the book
response = requests.get(url)
text = response.text

# Display the first 1000 characters to verify
print(text[:1000])

The Project Gutenberg eBook of Moby Dick; Or, The Whale
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Moby Dick; Or, The Whale

Author: Herman Melville

Release date: July 1, 2001 [eBook #2701]
                Most recently updated: August 18, 2021

Language: English

Credits: Daniel Lazarus, Jonesey, and David Widger


*** START OF THE PROJECT GUTENBERG EBOOK MOBY DICK; OR, THE WHALE ***




MOBY-DICK;

or, THE WHALE.

By Herman Melville



CONTENTS

ETYMOLOGY.

EXTRACTS (Supplied by a Sub-Sub-Librarian).

CHAPTER 1. Loomings.

CHAPTER 2. 


Let's load the text and see what it says:

In [5]:
print("text is", len(text), "characters long.")
print()
print("A sample from the middle:")
print()
print(text[len(text) // 2 : len(text) // 2 + 100])

text is 1260542 characters long.

A sample from the middle:

whole boat in its complicated coils,
twisting and writhing around it in almost every direction. All


### Data standardization

Now, we will clean the data: converting the data to lowercase, removing extra spaces and linebreaks, and get rid of characters which are not in our alphabet.

In [6]:
# remove extra characters by replacing them with spaces
text = re.sub(rf"[^{alphabet}]", " ", text)

Let's see how it looks again:

In [7]:
a = 110042
b = a+131
x_prompt = text[a:b]
print(x_prompt)

But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 


### Numerical encoding

Unfortunately, neural networks don't understand text. So, we need to convert our characters to numerical values. Here are some helper functions for doing this.

In [8]:
# let's build a dictionary mapping characters to integers
char2int = {c: i for i, c in enumerate(alphabet)}
alphabet_array = np.array([c for c in alphabet])

# this function will turn a string into a numpy array of integers
def int_encode(string):
    if any(c not in char2int for c in string):
        raise ValueError(
            "Found a character which was not in the alphabet in the input "
            f"to int_encode. Valid alphabet characters: {alphabet}"
        )
    return np.array([char2int[c] for c in string])

# this function will decode a numpy array of integers back to a string
def int_decode(int_array):
    return ''.join(alphabet_array[int_array])

(Question 1a: 4 points) Test out `int_encode` by passing `test_string` in and printing the result.

In [9]:
# Let's test these out!
### YOUR CODE HERE ###
test_string = "STAT5242: Advanced\nMachine, Learning! Course?"
e = int_encode(test_string)
print(e)

[62 63 44 63 32 29 31 29 42  0 44  5 23  2 15  4  6  5 70 56  2  4  9 10
 15  6 39  0 55  6  2 19 15 10 15  8 40  0 46 16 22 19 20  6 41]


(Question 1b: 4 points) Decode the result from the last cell using `int_decode` to make sure it is the same as `test_string`

In [10]:
### YOUR CODE HERE ###
d = int_decode(e)
print(d)

STAT5242: Advanced
Machine, Learning! Course?


Is the decoding the same as `test_string`? It should -- you have a bug above if not.

### Make a training dataset

First, we make a numerical encoded version of the entire dataset:

In [11]:
enctext = int_encode(text)

Use `torch.tensor` to make it into a PyTorch tensor:

In [12]:
enctext = torch.tensor(enctext)
print(enctext)

tensor([ 0, 63,  9,  ..., 70,  0, 70], dtype=torch.int32)


In [13]:
enctext.shape

torch.Size([1260542])

# Section 2: Training a NN

Our model will work as follows:
 - One-hot encoded input gets passed into a linear embedding layer. These two operations are combined with the `Embedding` layer: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
 - LSTM cell
 - Linear decoder layer

Torch has two main ways of interfacing with recurrent networks. In the case of LSTMs, those are:
 - the LSTM layer https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM
 - the LSTMCell layer https://pytorch.org/docs/stable/generated/torch.nn.LSTMCell.html

Both models are sequential: the goal is to process a batch of sequences of input features and produce a batch of sequences of output features. The `LSTM` class makes this simple and easy, and the `LSTMCell` class gives more control by allowing you to process the sequences one element at a time. We will use the `LSTM` layer to keep things simple, but keep in mind that some of what we do could be made more efficient with `LSTMCell`.

The inputs and outputs to recurrent networks in Torch have shape: `(batch_dimension, sequence_dimension, feature_dimension)`. In this case, our feature dimension is `len(alphabet)`.

Something to keep in mind: the output of this network will be stateful! In each batch, the `k`th output along the sequence dimension will be the logits for predicting the `k+1`th input in the batch.

In [14]:
# We will use this constant below
HIDDEN_DIM = 128

In [15]:
# Defining some parameters about data batching, explained in the next section
# Note: after you get the entire assignment working, you can make these
# bigger and train for longer, to get better performance
SEQUENCE_LENGTH = 128
BATCH_SIZE = 64

### Making the dataset of (input, target) pairs

To train the model, we need to make a `torch.utils.data.Dataset` containing input and target sequences. Our input sequences will be sequences of length `SEQUENCE_LENGTH` containing int-encoded characters from the input. Our target sequences will be the "next characters" corresponding to the input sequence: so, if the input sequence is the 10th, 11th, ... characters, then the target sequence is the 11th, 12th, ... characters.

We will walk through using `torch.utils.data.Dataset` methods to create these.

(Question 2c: 8 points) Write a `batch` function for a `torch` tensor, which we defined above, to make disjoint consecutive sequences of consecutive characters of length `SEQUENCE_LENGTH`. `torch.split()` and `torch.vstack()` may be useful. Remember to be careful of the edge case that arises when `len(enctext) % SEQUENCE_LENGTH != 0`.

In [16]:
def batch(enctext: torch.Tensor, seqlen: int):
    # YOUR CODE HERE
    n = enctext.shape[0]
    remainder = n % seqlen
    batches = torch.split(enctext[:n - remainder], seqlen)
    if remainder:
        last_batch = torch.hstack([
                enctext[n - remainder:],
                torch.tensor([0] * (seqlen - remainder))
            ]) # padding
        return torch.vstack([torch.vstack(batches), last_batch])
    return torch.vstack(batches)

(Question 2d: 8 points) Now, use batch to create target sequences from the following version of the dataset which has been offset by 1 element:

In [17]:
batches = batch(enctext, SEQUENCE_LENGTH)
input_seqs  = batches[:-1]
target_seqs = batches[1:]

In [18]:
input_seqs.shape, target_seqs.shape

(torch.Size([9847, 128]), torch.Size([9847, 128]))

(Question 2e: 6 points) Now, use the `torch` builtin class `torch.utils.data.TensorDataset` to create a dataset of (input, target) pairs:

In [19]:
pairs = TensorDataset(input_seqs, target_seqs)
pairs

<torch.utils.data.dataset.TensorDataset at 0x1db4ce6d310>

(Question 2f: 4 points) Finally, define a `torch.utils.data.DataLoader` object to generate batches of pairs of length `BATCH_SIZE`:

In [20]:
train_loader = DataLoader(pairs, batch_size=BATCH_SIZE, shuffle=True)
train_loader

<torch.utils.data.dataloader.DataLoader at 0x1db4ce97050>

You may uncomment the below cell if you would like to understand the structure of the `train_loader`.

In [21]:
for i, (x, y) in enumerate(train_loader):
    print(i, x.shape, y.shape)
    if i > 1:
        break

0 torch.Size([64, 128]) torch.Size([64, 128])
1 torch.Size([64, 128]) torch.Size([64, 128])
2 torch.Size([64, 128]) torch.Size([64, 128])


(Question 2a: 10 points) Model definition: make a Sequential model with an Embedding layer with input dimension `len(alphabet)` and output dimension `HIDDEN_DIM`, followed by an LSTM layer with `HIDDEN_DIM` features, followed by a Linear layer with `len(alphabet)` features. A helper class is provided to extract tensors from the output of the LSTM layer to prepare as input to the input of the final linear layer. Use of this class in the Sequential container would look something like `('extract', extract_tensor(return_sequences=return_sequences))`  

In [22]:
from collections import OrderedDict

return_sequences = True

# LSTM() returns tuple of (tensor, (recurrent state))
class extract_tensor(nn.Module):
    def __init__(self, return_sequences=False):
        super(extract_tensor, self).__init__()
        self.return_sequences = return_sequences

    def forward(self,x):
        # Output shape (batch, features, hidden)
        tensor, _ = x
        # Reshape shape (batch, hidden)
        if not self.return_sequences:
            tensor = tensor[:, -1, :]
        return tensor

# input: (BATCH_SIZE, len(alphabet)) = (64, 71)
model = nn.Sequential(OrderedDict([
    ('embedding', nn.Embedding(len(alphabet), HIDDEN_DIM)), # -> (BATCH_SIZE, SEQUENCE_LENGTH, embedding_dim) = (64, 128, 128)
    ('lstm', nn.LSTM(HIDDEN_DIM, HIDDEN_DIM)), # -> (BATCH_SIZE, SEQUENCE_LENGTH, HIDDEN_DIM) = (64, 128, 128)
    ('extract', extract_tensor(return_sequences=return_sequences)), # -> (BATCH_SIZE, SEQUENCE_LENGTH, HIDDEN_DIM) = (64, 128, 128)
    ('linear', nn.Linear(HIDDEN_DIM, len(alphabet))) # -> (BATCH_SIZE, SEQUENCE_LENGTH, len(alphabet)) = (64, 128, 71)
]))

model = model.to(device)

In [23]:
print(model)

Sequential(
  (embedding): Embedding(71, 128)
  (lstm): LSTM(128, 128)
  (extract): extract_tensor()
  (linear): Linear(in_features=128, out_features=71, bias=True)
)


In [24]:
print(len(alphabet), HIDDEN_DIM)
for name, param in model.named_parameters():
    print(f"Layer: {name} | Shape: {param.shape}")

71 128
Layer: embedding.weight | Shape: torch.Size([71, 128])
Layer: lstm.weight_ih_l0 | Shape: torch.Size([512, 128])
Layer: lstm.weight_hh_l0 | Shape: torch.Size([512, 128])
Layer: lstm.bias_ih_l0 | Shape: torch.Size([512])
Layer: lstm.bias_hh_l0 | Shape: torch.Size([512])
Layer: linear.weight | Shape: torch.Size([71, 128])
Layer: linear.bias | Shape: torch.Size([71])


embedding.weight | Shape: torch.Size([71, 128]):

    71 unique characters (alphabet) and embed each into a 128-dimensional space.

Layer: lstm.weight_ih_l0 | Shape: torch.Size([512, 128]):

    input-to-hidden weight matrix for LSTM. [512, 128]: it takes a 128-dimensional input (the output of the embedding layer) and transforms it into a 512-dimensional hidden state. 512 comes from that an LSTM has 4 internal gates (input, forget, cell, output), and each gate has its own weight matrix of size [HIDDEN_DIM, input_size], so 4 * 128 = 512.

Layer: lstm.weight_hh_l0 | Shape: torch.Size([512, 128]):

    hidden-to-hidden weight matrix within LSTM. It takes the previous 128-dimensional hidden state and transforms it into a new 512-dimensional hidden state (again, due to the 4 gates).

Layer: lstm.bias_ih_l0 | Shape: torch.Size([512]):

    Bias terms for the input-to-hidden connections.

Layer: lstm.bias_hh_l0 | Shape: torch.Size([512]):

    Bias terms for the hidden-to-hidden connections.

Layer: linear.weight | Shape: torch.Size([71, 128]):

    This is your final linear layer. It takes the 128-dimensional output of the LSTM (after your extract_tensor) and maps it to 71 output classes (one for each character in your alphabet).

Layer: linear.bias | Shape: torch.Size([71]):

    Bias terms for the linear layer.

(Question 2b: 8 points) If we want to use the output of the model as logits for predicting a character (which we can think of as a class), what loss should we use? Name this `criterion`. Additionally, define an optimizer to use in training. As per usual, we will recommend the use of `optim.Adam`.

In [25]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.9) #

(Question 2g: 2 points) Train the model!


Write a training loop in PyTorch for a model that processes batched input data. Use `NUM_EPOCHS = 40` as the number of training epochs, and ensure that:
1. Each epoch consists of iterating over batches from a `train_loader`.
2. For each batch, the model's gradients are zeroed, a forward pass is made, and a loss is calculated using a provided criterion.
3. After each batch's loss is calculated, perform backpropagation and optimizer steps.
4. Track and print the average loss at the end of each epoch.



In [26]:
### YOUR CODE HERE ###
n_epochs = 15
n_train = len(train_loader) # number of batches

print("start to train")
for epoch in range(n_epochs):
    model.train()
    total_accuracy = 0.0
    total_loss = 0.0

    loop = tqdm.tqdm(train_loader, leave=True)
    for inputs, labels in loop:
        # print(inputs[:1], labels[:1])
        inputs = inputs.to(device) # torch.Size([BATCH_SIZE, SEQUENCE_LENGTH]) = (64,128)
        labels = labels.to(device) # torch.Size([BATCH_SIZE, SEQUENCE_LENGTH]) = (64,128)
        # labels = nn.functional.one_hot(labels, num_classes=len(alphabet))  # (64, 128, 71)

        optimizer.zero_grad()
        outputs = model(inputs) # -> torch.Size([64, 128, 71])
        outputs = outputs.view(-1, len(alphabet)) # -> (64*128, 71) = (8192, 71)
        labels = labels.view(-1) # -> (64*128, ) = (8192, )
        loss = criterion(outputs, labels)

        # predicted_classes = outputs.argmax(dim=1) # -> (8192, )
        _, predicted_classes = torch.max(nn.functional.softmax(outputs, dim=1), 1) # -> (8192, )
        batch_accuracy = (predicted_classes == labels).float().mean()
        total_accuracy += batch_accuracy.item()
        total_loss += loss.item() * len(labels)

        loss.backward()
        optimizer.step()

        loop.set_description(f"Epoch [{epoch+1}/{n_epochs}]")
        loop.set_postfix(loss=loss.item())

    # Calculate the average accuracy for the entire epoch
    average_accuracy = total_accuracy / n_train
    average_loss = total_loss / n_train

    scheduler.step()

    print(f" Epoch {epoch + 1}/{n_epochs}, Loss: {average_loss:.3f}, Accuracy: {average_accuracy * 100:.2f}%")
    #print("len train_loader is " + str(len(train_loader)))

start to train


Epoch [1/15]: 100%|██████████| 154/154 [00:01<00:00, 112.95it/s, loss=3.14]


 Epoch 1/15, Loss: 36568.758, Accuracy: 14.83%


Epoch [2/15]: 100%|██████████| 154/154 [00:01<00:00, 138.78it/s, loss=3.14]


 Epoch 2/15, Loss: 26029.649, Accuracy: 17.45%


Epoch [3/15]: 100%|██████████| 154/154 [00:01<00:00, 139.25it/s, loss=3.15]


 Epoch 3/15, Loss: 27057.892, Accuracy: 16.53%


Epoch [4/15]: 100%|██████████| 154/154 [00:01<00:00, 140.86it/s, loss=3.15]


 Epoch 4/15, Loss: 25803.391, Accuracy: 18.02%


Epoch [5/15]: 100%|██████████| 154/154 [00:01<00:00, 144.01it/s, loss=3.12]


 Epoch 5/15, Loss: 25846.772, Accuracy: 17.76%


Epoch [6/15]: 100%|██████████| 154/154 [00:01<00:00, 139.29it/s, loss=3.18]


 Epoch 6/15, Loss: 25623.713, Accuracy: 18.17%


Epoch [7/15]: 100%|██████████| 154/154 [00:01<00:00, 140.72it/s, loss=3.13]


 Epoch 7/15, Loss: 25593.065, Accuracy: 18.12%


Epoch [8/15]: 100%|██████████| 154/154 [00:01<00:00, 137.04it/s, loss=3.15]


 Epoch 8/15, Loss: 25624.168, Accuracy: 18.00%


Epoch [9/15]: 100%|██████████| 154/154 [00:01<00:00, 143.20it/s, loss=3.13]


 Epoch 9/15, Loss: 25721.262, Accuracy: 17.93%


Epoch [10/15]: 100%|██████████| 154/154 [00:01<00:00, 145.20it/s, loss=3.11]


 Epoch 10/15, Loss: 25735.352, Accuracy: 17.93%


Epoch [11/15]: 100%|██████████| 154/154 [00:01<00:00, 138.67it/s, loss=3.19]


 Epoch 11/15, Loss: 25621.486, Accuracy: 18.11%


Epoch [12/15]: 100%|██████████| 154/154 [00:01<00:00, 140.30it/s, loss=3.13]


 Epoch 12/15, Loss: 25679.202, Accuracy: 18.04%


Epoch [13/15]: 100%|██████████| 154/154 [00:01<00:00, 147.79it/s, loss=3.1] 


 Epoch 13/15, Loss: 25685.797, Accuracy: 17.99%


Epoch [14/15]: 100%|██████████| 154/154 [00:01<00:00, 146.85it/s, loss=3.15]


 Epoch 14/15, Loss: 25729.259, Accuracy: 17.86%


Epoch [15/15]: 100%|██████████| 154/154 [00:01<00:00, 145.37it/s, loss=11.2]

 Epoch 15/15, Loss: 36714.137, Accuracy: 15.17%





Here, make sure the loss goes down as it trains.

# Section 3: Did it work? Let's see what the model learned

Here, we'll write some functions to see how well the model has learned to predict text and to draw samples from the model.

First, we'll give you a function to "seed" the model with some input text and then predict the most likely future text. It will be your job to create a variation on this function in the question below, so make sure you understand how it works.

In [35]:
def predict(seed_string, sample_length=50):
    # Convert seed_string to int
    current_text_ints = list(int_encode(seed_string))
    # print(current_text_ints)

    for i in range(sample_length):
        # Add an empty batch dimension and convert to tensor
        text_arr = np.array(current_text_ints).reshape(1, -1)
        text_arr = torch.tensor(text_arr).to(device)
        # print(text_arr)

        # set our model to return only one output instead of the sequence
        model.extract.return_sequences = False

        # Get the full sequence of predictions, remove the batch dim
        logits = model(text_arr)
        # print(logits)

        # Remove the batch dimension and get the final logits
        final_logits = logits[-1]
        # print(final_logits)

        # Get the prediction using tf.argmax
        pred = torch.argmax(final_logits)
        # print(pred)
        # print('--------------')

        # Append this to `current_text_ints`
        # current_text_ints.append(pred.numpy())
        current_text_ints.append(pred.item())

    return int_decode(np.array(current_text_ints))

In [36]:
pred_length = 150

# Print the initial prompt (x_prompt) used for prediction
print("Initial prompt (x_prompt) used for prediction:")
print(len(x_prompt), x_prompt)  # Assuming x_prompt is already defined

a = 110042
bb = a + 131 + pred_length
x_prompt_plus = text[a:bb]  # Extracting substring from text

# Print the extracted substring from 'text' within the specified range
print("\nExtracted substring (x_prompt_plus) from 'text' starting at index", a, "up to index", bb, ":")
print(x_prompt_plus)

# Print the predicted text based on x_prompt with specified prediction length
print("\nPredicted text based on x_prompt with a prediction length of", pred_length, "characters:")

pred_result = predict(x_prompt, pred_length)
print(len(pred_result), len(pred_result.strip()), pred_result)

Initial prompt (x_prompt) used for prediction:
131 But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 

Extracted substring (x_prompt_plus) from 'text' starting at index 110042 up to index 110323 :
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel s 
face; and this bright face shed a distinct spot of radiance upon the 
ship s tossed deck, something like that silver plate now inserted into 
the V

Predicted text based on x_prompt with a prediction length of 150 characters:
281 280 But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m 


In [38]:
# feel free to try your own seed!
p = "Hello"
res = predict(p, 10)
print(res, len(res)) # it's simply "predicting" lots of m's and spaces

Hellod m m m m  15


It seems like maybe the model learned something, but the output is a little boring. Let's make it more interesting with *randomness*!

Right now, the function always picks the most likely next letter. Instead, let's sample the next letter from the model's predicted probability distribution.

(Question 3a: 8 points) Fill in the blanks in the function below.

In [None]:
logits = torch.tensor([[ 1.1882e+01, -2.1653e+01,  7.3202e+00,  1.0307e+01,  9.2052e+00,
          1.2288e+01,  1.0670e+01, -5.2825e-01, -7.5656e-01,  9.4695e+00,
          7.0515e+00, -6.8845e+00,  3.6359e+00,  3.6441e+00,  7.9128e+00,
          1.1908e+01,  9.1662e+00,  7.3278e+00, -8.8111e+00,  7.7560e+00,
          8.8169e+00,  1.0833e+01,  1.0459e+01, -9.0431e+00, -6.9849e-01,
         -2.5091e+01,  5.1266e+00, -3.3887e+01, -1.1239e+01, -4.7748e+01,
          1.8662e-02, -3.3456e+01, -9.3685e+00, -3.3035e+01, -1.9563e+01,
         -1.6129e+01, -1.4965e+01, -1.7602e+00, -6.3772e+00,  1.0108e+01,
          5.1536e+00,  4.1522e+00, -3.6856e+01, -1.7400e+01, -4.5224e+01,
          3.3552e+00, -2.9813e+00,  2.7603e+00, -4.3596e+00,  3.4254e+00,
         -4.3740e+00,  4.8554e+00, -8.1969e+00, -4.2921e+01, -7.3727e+01,
         -1.2317e+01, -7.8261e+00,  1.4202e+00, -1.3906e+00, -1.8264e+01,
         -2.6762e+00,  4.4032e+00, -4.0476e+01,  4.2402e+00, -1.3938e+01,
         -4.9166e+00, -1.8416e+01, -4.0483e+01, -1.8585e+01, -8.0525e+00,
          7.4953e+00]])

probabilities = F.softmax(logits, dim=1)
torch.multinomial(probabilities, num_samples=1).item()

In [85]:
def generate(seed_string, sample_length=50):
    # Convert seed_string to int
    current_text_ints = list(int_encode(seed_string))

    for i in range(sample_length):
        # Add an empty batch dimension and convert to tensor
        text_arr = np.array(current_text_ints).reshape(1, -1)
        text_arr = torch.tensor(text_arr).to(device)

        # set our model to return only one output instead of the sequence
        model.extract.return_sequences = False

        # Get the full sequence of predictions, remove the batch dim
        logits = model(text_arr)

        # Remove the batch dimension and get the final logits
        final_logits = logits[-1]

        # Normalize the final_logits to a probability distribution
        probs = F.softmax(final_logits, dim=0)  # YOUR CODE HERE

        # Call .numpy so we can use a numpy function
        # probs = probs.numpy()

        # Sample from the probability distribution using
        # the function np.random.choice - doesnt work for dimension >= 32
        sample = torch.multinomial(probs, num_samples=1).item()

        # Append this to `current_text_ints`
        current_text_ints.append(sample)

    return int_decode(np.array(current_text_ints))

In [None]:
# current_text_ints = list(int_encode('heelo'))
# text_arr = np.array(current_text_ints).reshape(1, -1)
# text_arr = torch.tensor(text_arr).to(device)
# model.extract.return_sequences = False

# # Get the full sequence of predictions, remove the batch dim
# logits = model(text_arr)

# # Remove the batch dimension and get the final logits
# final_logits = logits[-1]

# # Normalize the final_logits to a probability distribution
# print(final_logits)
# probs = F.softmax(final_logits, dim=0)
# probs

(Question 3b: 6 points) Test this function `generate`. Is its output different from `predict`? How does it differ, and why?

In [86]:
# YOUR CODE HERE

# Print the initial prompt (x_prompt) used for prediction
print("Initial prompt (x_prompt) used for prediction:")
print(x_prompt)  # Assuming x_prompt is already defined

a = 110042
bb = a + 131 + pred_length
x_prompt_plus = text[a:bb]  # Extracting substring from text

# Print the extracted substring from 'text' within the specified range
print("\nExtracted substring (x_prompt_plus) from 'text' starting at index", a, "up to index", bb, ":")
print(x_prompt_plus)

# Print the predicted text based on x_prompt with specified prediction length
print("\nPredicted text based on x_prompt with a prediction length of", pred_length, "characters:")
print(predict(x_prompt, pred_length))

# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))


# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))


Initial prompt (x_prompt) used for prediction:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 

Extracted substring (x_prompt_plus) from 'text' starting at index 110042 up to index 110323 :
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel s 
face; and this bright face shed a distinct spot of radiance upon the 
ship s tossed deck, something like that silver plate now inserted into 
the V

Predicted text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m 

Generated text based on x_prompt with a prediction length of 150 characters:
But hi

In [87]:
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))

# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))


Generated text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel m ud mt  meo mnpt ub le me e mus meobt ud m mht hdd ud mnnpo   e ,e mt mt emuubnpt m me ubd mnpot  e uub d ,e mt mt  lh m t ld unponpt m mn
dnpo le ub

Generated text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel m me le , ls meod mct mnnnn0 mis mb leodn
 m mt me m ,e ,npohct ubnponp esubnnnpodddd ududut ub m u,dddh,e ,npt mnpt meod me mnpt mm mnpt m me mn; mt 


(Question 3c: 2 point) Try running `generate` a few times. Are the results the same or different? Why?

The results are difference since there is randomness. However, there seems to be characters that show up much more frequently than others. (Or, some characters do not appear at all.)