Trey Tuscai and Gordon Doore

Spring 2025

CS 444: Deep Learning

#### Project 4: Transformers

In this final notebook, we will train larger GPTs on a large corpus of prose — the entire works of Shakespeare. Once trained, you will be able to prompt your GPTs with some text and it will generate text that appears to follow.

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=4)

# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

![Some fun](images/transformer4.png)

## Task 8. Preprocess a large corpus of text

**NOTE:** This is no Task 7. It got removed due to time constraints.

<!-- Let's write code to load in the works of Shakespeare (`shakespeare.txt`) and preprocess it so that we can try a transformer on the text. -->

Run the test code in this section to make sure the works of Shakespeare (`shakespeare.txt`) are loaded and preprocessed properly for the transformer.

In [2]:
from preprocess_corpus import load_document, make_char2ind_map, make_seqs_and_labels

### 8a. Generate corpus and vocabulary

<!-- In `preprocess_corpus.py`, implement the `load_document` function to load in the Shakespeare corpus and make the vocabulary. -->

In [3]:
corpus, vocab = load_document()

print(f'The vocabulary has {len(vocab)} tokens and it should have 65.')
print(f'The vocabulary is (split up over multiple lines):\n{vocab[:25]}\n{vocab[25:50]}\n{vocab[50:]}\n')
print('and it should be:')
print("""['\\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
['M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']""")

print(f'The corpus has {len(corpus)} chars and it should have 1115394.')
print(55*'-')
print('The first 50 chars of the corpus is:')
print(corpus[:50])
print('and it should be:')
print('''First Citizen:
Before we proceed any further, hear''')
print(55*'-')
print('The last 50 chars of the corpus is:')
print(corpus[-50:])
print('and it should be:')
print('''eep--die, rather; wink'st
Whiles thou art waking.
''')

The vocabulary has 65 tokens and it should have 65.
The vocabulary is (split up over multiple lines):
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
['M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

and it should be:
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
['M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The corpus has 1115394 chars and it should have 1115394.
-------------------------------------------------------
The first 50 chars of the corpus is:
First Citizen:
Before we proceed any further, hear
and it should be:
Fi

### 8b. Create char2ind map

<!-- In `preprocess_corpus.py`, implement the `make_char2ind_map` function and test it below. -->

In [4]:
char2ind_map = make_char2ind_map(vocab)

print(f'Size of your char2ind map is {len(char2ind_map)} and it should be 65.')
print('Keys of your char2ind map:')
print(''.join(char2ind_map.keys()))
print("They should be \n\n !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
print('Values of your char2ind map:')
print(list(char2ind_map.values()))
print("They should be")
print('[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]')


Size of your char2ind map is 65 and it should be 65.
Keys of your char2ind map:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
They should be 

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Values of your char2ind map:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]
They should be
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]


### 8c. Create sequences of int-coded texts and labels

<!-- In `preprocess_corpus.py`, implement the `make_seqs_and_labels` function, which should extract sequential `seq_len` long chunks (*our desired sequence length for the transformer*) to form the sequences on which we will train the transformer. The labels/targets are just the chars shifted by 1 (i.e. the next char in the corpus). -->

In [6]:
seq_len = 250
seqs, labels = make_seqs_and_labels(corpus, char2ind_map, seq_len=seq_len)

print(f'The shape of your Shakespeare sequences is {seqs.shape} and it should be (4461, 250).')
print(f'The shape of your Shakespeare labels is {labels.shape} and it should be (4461, 250).')
print('The first 15 int-coded tokens of the 1st few sequences are:')
print(seqs[:5, :15].numpy())
print('they should be:')
print('''[[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0]
 [ 0 13 50 50 10  0 35 43  1 49 52 53 61  5 58]
 [ 1 41 47 58 47 64 43 52 57  6  1 58 46 43  1]
 [ 1 58 46 43  1 53 40 48 43 41 58  1 53 44  1]
 [31 43 41 53 52 42  1 15 47 58 47 64 43 52 10]]''')

print('The first 15 int-coded tokens of the last few sequences are:')
print(seqs[-5:, :15].numpy())
print('they should be:')
print('''[[57 53  1 61 43 39 49 50 63  8  1 35 47 50 50]
 [ 6  0 16 53  1 52 53 58  1 53 51 47 58  1 58]
 [58 56 39 52 45 43  1 42 56 53 61 57 47 52 43]
 [42 56 53 54 54  5 42  6  1 39 57  1 40 63  1]
 [13 26 10  0 35 46 39 58  6  1 39 56 58  1 58]]''')

The shape of your Shakespeare sequences is (4461, 250) and it should be (4461, 250).
The shape of your Shakespeare labels is (4461, 250) and it should be (4461, 250).
The first 15 int-coded tokens of the 1st few sequences are:
[[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0]
 [ 0 13 50 50 10  0 35 43  1 49 52 53 61  5 58]
 [ 1 41 47 58 47 64 43 52 57  6  1 58 46 43  1]
 [ 1 58 46 43  1 53 40 48 43 41 58  1 53 44  1]
 [31 43 41 53 52 42  1 15 47 58 47 64 43 52 10]]
they should be:
[[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0]
 [ 0 13 50 50 10  0 35 43  1 49 52 53 61  5 58]
 [ 1 41 47 58 47 64 43 52 57  6  1 58 46 43  1]
 [ 1 58 46 43  1 53 40 48 43 41 58  1 53 44  1]
 [31 43 41 53 52 42  1 15 47 58 47 64 43 52 10]]
The first 15 int-coded tokens of the last few sequences are:
[[57 53  1 61 43 39 49 50 63  8  1 35 47 50 50]
 [ 6  0 16 53  1 52 53 58  1 53 51 47 58  1 58]
 [58 56 39 52 45 43  1 42 56 53 61 57 47 52 43]
 [42 56 53 54 54  5 42  6  1 39 57  1 40 63  1]
 [13 26 10  0 35 46 39

### 8d. Add padding char to dictionary

**TODO:** Add the usual padding char (`'#'`) to the char2ind map to the next available int slot.

In [7]:
# Add padding char to dictionary
padding_char = '#'
next_available_index = len(char2ind_map)
char2ind_map[padding_char] = next_available_index
print(f'Size of your char2ind map is {len(char2ind_map)} and it should be 66.')
print('Keys of your char2ind map:')
print(''.join(char2ind_map.keys()))
print("They should be \n\n !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#")
print('Values of your char2ind map:')
print(list(char2ind_map.values()))
print("They should be")
print('[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65]')

Size of your char2ind map is 66 and it should be 66.
Keys of your char2ind map:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#
They should be 

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#
Values of your char2ind map:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65]
They should be
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65]


## Task 9. Train GPT on Shakespeare

Now we are ready to train a GPT on the works of Shakespeare!

### 9a. Build `GPTMini6`

We will use a deeper transformer called `GPTMini6` for training on the Shakespeare corpus. Build the neural network then check the summary below.

In [14]:
from gpts import GPTMini6

In [15]:
# TODO: Set padding_char_enc to the int coded padding token below
padding_char_enc = char2ind_map['#']
myminigpt = GPTMini6(vocab_sz=9, seq_len=15, padding_char_enc=padding_char_enc)
myminigpt.compile(loss='temporal_cross_entropy')

---------------------------------------------------------------------------
Dense layer output(output) shape: [1, 15, 9]
TransformerBlock_5:
	TransformerBlock_5/MLP:
	Dropout layer output(TransformerBlock_5/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_5/multihead_attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_5/multihead_attention/attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_5/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_k) shape: [1, 15, 3

The above cell should output:

```
---------------------------------------------------------------------------
Dense layer output(output) shape: [1, 15, 9]
TransformerBlock_5:
	TransformerBlock_5/MLP:
	Dropout layer output(TransformerBlock_5/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_5/multihead_attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_5/multihead_attention/attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_5/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_4:
	TransformerBlock_4/MLP:
	Dropout layer output(TransformerBlock_4/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_4/multihead_attention:
	Dropout layer output(TransformerBlock_4/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_4/multihead_attention/attention:
	Dropout layer output(TransformerBlock_4/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_4/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_4/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_3:
	TransformerBlock_3/MLP:
	Dropout layer output(TransformerBlock_3/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_3/multihead_attention:
	Dropout layer output(TransformerBlock_3/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_3/multihead_attention/attention:
	Dropout layer output(TransformerBlock_3/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_3/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_3/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_2:
	TransformerBlock_2/MLP:
	Dropout layer output(TransformerBlock_2/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_2/multihead_attention:
	Dropout layer output(TransformerBlock_2/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_2/multihead_attention/attention:
	Dropout layer output(TransformerBlock_2/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_2/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_2/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_1:
	TransformerBlock_1/MLP:
	Dropout layer output(TransformerBlock_1/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_1/multihead_attention:
	Dropout layer output(TransformerBlock_1/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_1/multihead_attention/attention:
	Dropout layer output(TransformerBlock_1/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_1/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_1/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_0:
	TransformerBlock_0/MLP:
	Dropout layer output(TransformerBlock_0/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_0/multihead_attention:
	Dropout layer output(TransformerBlock_0/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_0/multihead_attention/attention:
	Dropout layer output(TransformerBlock_0/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_0/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_0/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
PositionalEncodingBlock:
	Dropout layer output(PositionalEncodingBlock/dropout) shape: [1, 15, 384]
	Positional encoding layer output(PositionalEncodingBlock/positional_enc_layer) shape: [1, 15, 384]
Embedding layer output(EmbeddingLayer) shape: [1, 15, 384]
---------------------------------------------------------------------------
```

### 9b. Train `GPTMini6` on the works of Shakespeare

Use default hyperparameters except for the following:
- For the validation set, use the 1st 200 sequences. For the training set, use all sequences beyond the 1st 200.
- Batch size of `64`.
- Patience of `15`.
- Learning rate decay patience of `9`.
- Learning rate should be allowed to decay no more than `3` times.
- Limit training to `100` epochs maximum.

Make a well-labeled plot showing the **training and validation loss** over the course of training.

In [None]:
# Load the document and create the char2ind map
corpus, vocab = load_document()
char2ind_map = make_char2ind_map(vocab)

# Define sequence and labels
seq_len = 250
seqs, labels = make_seqs_and_labels(corpus, char2ind_map, seq_len=seq_len)

# Add padding character to the char2ind map
padding_char = '#'
next_available_index = len(char2ind_map)
char2ind_map[padding_char] = next_available_index

# Split data
seqs_train, seqs_val = seqs[200:], seqs[:200]
labels_train, labels_val = labels[200:], labels[:200]

# Set vocab size and padding character encoding
vocab_sz = len(char2ind_map)
padding_char_enc = char2ind_map['#']

# Train
minigpt = GPTMini6(vocab_sz=vocab_sz, seq_len=seq_len, padding_char_enc=padding_char_enc)
minigpt.compile(loss='temporal_cross_entropy')
train_loss_hist, val_loss_hist, _, _ = minigpt.fit(seqs_train, labels_train, seqs_val, labels_val, batch_size=64, patience=15, max_epochs=100, lr_max_decays=3, lr_patience=9)

---------------------------------------------------------------------------
Dense layer output(output) shape: [1, 250, 66]
TransformerBlock_5:
	TransformerBlock_5/MLP:
	Dropout layer output(TransformerBlock_5/MLP/dropout) shape: [1, 250, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_1) shape: [1, 250, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_0) shape: [1, 250, 1536]
	TransformerBlock_5/multihead_attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/dropout) shape: [1, 250, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/dense_1) shape: [1, 250, 384]
	TransformerBlock_5/multihead_attention/attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/attention/dropout) shape: [1, 6, 250, 250]
	TransformerBlock_5/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_v) shape: [1, 250, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_k) shape:

Exception ignored in: <function AtomicFunction.__del__ at 0x154781d80>
Traceback (most recent call last):
  File "/Users/treytuscai/miniforge3/envs/cs444/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py", line 302, in __del__
    self._bound_context.remove_function(self.name)
  File "/Users/treytuscai/miniforge3/envs/cs444/lib/python3.10/site-packages/tensorflow/python/eager/context.py", line 1530, in remove_function
    pywrap_tfe.TFE_ContextRemoveFunction(self._handle, name)
KeyboardInterrupt: 


Tensor("Reshape_32:0", shape=(64, 250), dtype=float32)


### 9c. Prompt GPT to generate Shakespearian text  

Have your GPT to generate a large amount of text (e.g. generate 5000 chars) that follows a prompt of your choice (a string containing few words or a sentence).

**Guidelines**
1. Use your `make_ind2char_mapping` from the math datasets to make the reverse map.
2. Use the `'distributed'` method for generating text.

When you turn in your project, include an example of at least one long passage of generated text by your GPT below.

In [None]:
from addition_dataset import make_ind2char_mapping

In [None]:


print('***final output***')
print(''.join(gen_text))

### 9d. Questions

**Question 9:** Rerun your generation using the `'max'` method. Which method generates better sounding/more interesting text? **Why?**

**Answer 9:**

## Extensions

### General guidelines

1. Never integrate extensions into your base project so that they change the expected behavior of core functions. If your extension changes the core design/behavior, no problem, duplicate your working base project and add features from there.
2. Check the rubric to keep in mind how extensions on this project will be graded.
3. While I may consult your code and "written log" of what you did, **I am grading your extensions based on what you present in your 3-5 min video.**
3. I suggest documenting your explorations in a "log" or "lab notebook" style (i.e. documenting your thought/progression/discovery/learning process). I'm not grading your writing, so you can keep it succinct. **Whatever is most useful to you to remember what you did.** 
4. I suggest taking a hypothesis driven approach. For example "I was curious about X so I explored Y. I found Z, which was not what I expected because..., so then tried A..."
5. Make plots to help showcase your results.
6. **More is not necessarily better.** Generally, a small number of "in-depth" extensions count for more than many "shallow" extensions.

### AI guidelines

You may use AI in mostly any capacity for extensions. However, keep in mind:
1. There is no need to use AI at all!
2. You are welcome to use AI as a tool (e.g. automate something that is tedious, help you get unstuck, etc.). However, you should be coding, you should be thinking, you should be writing, you should be creating. If you are spending most (or even close to most) of your time typing into a chatbot and copy-pasting, you have probably gone too far with AI use.
3. I don't find large volumes of AI generated code/text/plots to be particularly impressive and you risk losing my interest while grading. Remember: I'm grading your extensions based on your video presentation. **More is not necessarily better.**

### Video guidelines

1. Please try to keep your video to 5 minutes (*I have other projects to grade!*). If you turn in a longer video, I make no promise that I will watch more than 5 minutes.
2. Your screen should be shared as you show me what you did. A live video of your face should also appear somewhere on the screen (e.g. picture-in-picture overlay / split screen).
3. Your partner should join you for the video and take turns talking, but, if necessary, it is fine to have one team member present during the record the video.
4. Do not simply read text from your notebook, do not read from a prepared script. I am not grading how polished your video presentation is (see extension grading criteria on rubric). 
5. I am looking for original and creative explorations sparked by your curiosity/interest/passion in a topic. This should be apparent in your video.
6. Be natural,, don't feel the need to impress me with fancy language. If it is helpful, imagine that we are talking one-on-one about your extension. Tell me what you did :)

### Extension ideas

#### 1. Generate text based on other corpora

Train one of your GPTs on a different text dataset and use it to generate text that resembles that body of work.

#### 2. GPT-1

Train OpenAI's GPT-1 model. It has the same architecture as `GPTMini6` except it has:
- 12 stacked Transformer Blocks
- 12 attention heads
- Embedding dimension of 768
- Dropout rate of 0.2

#### 3. GPT-2

Train a model in the family of OpenAI's GPT2 models. It has the same architecture as `GPTMini6` except it has different values for (number of transformer blocks, embedding dimension, attention heads):

**GPT-2 Medium:** (24, 1024, 16)<br/>
**GPT-2 Large:** (36, 1280, 20)<br/>
**GPT-2 XL:** (48, 1600, 25)

Feel free to adapt/pare down based on training time and GPU resources.

#### 4. More complex arithmetic

Explore any of the following:
- Train your transformers to perform addition and/or multiplication with larger numbers.
- Add support for negative integer operands.
- Allow for longer chains of operands (e.g. `1+1+1+1=4`)
- Add support for subtraction and/or other arithmetic operations.

#### 5. Explore hyperparameters

Explore how any of the following affects the quality of the generated text and/or loss:
- Sequence length
- Embedding dimension