# Basic Instructions

1. Enter your Name, UID and Link to Google Drive in the provided space.
2. Submit the assignment to Gradescope.


Final Submission Deadline: May 2, 5:00pm

Late Submission Deadline: May 4, 5:00pm

Name:  **Sumedh Koppula**  
UID:  **117386066**

Link to Google Drive : **https://drive.google.com/file/d/1ta6aqyCjJe2LsPpuibmVFRHWo5prbSlA/view?usp=sharing**

In this assignment, you will learn how to use transformers to generate text. Specifically, you will implement very small GPT model. It will predict streams of characters to attempt to form nice sounding sentences.

You will complete 5 exercises, described in detail later on in this notebook.

In [None]:
import torch
print("Pytorch version：")
print(torch.__version__)
print("CUDA Version: ")
print(torch.version.cuda)
print("cuDNN version is :")
print(torch.backends.cudnn.version())

Pytorch version：
2.0.0+cu117
CUDA Version: 
11.7
cuDNN version is :
8500


In [None]:
import os
import time
import math
import pickle
from contextlib import nullcontext

import numpy as np
import torch
from torch.distributed import init_process_group, destroy_process_group

import os
import pickle
import requests
import numpy as np

import math
import inspect
from dataclasses import dataclass

import torch
import torch.nn as nn
from torch.nn import functional as F

## DATA PREPARATION 

In [None]:
# download the tiny shakespeare dataset
if not os.path.exists('data'):
  os.makedirs('data')
if not os.path.exists('data/shakespeare'):
  os.makedirs('data/shakespeare')
data_root = 'data/shakespeare'
input_file_path = os.path.join(data_root, 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/learn2phoenix/CMSC472_HW6/main/input.txt'
    with open(input_file_path, 'w') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r') as f:
    data = f.read()
print(f"length of dataset in characters: {len(data):,}")

# get all the unique characters that occur in this text
chars = sorted(list(set(data)))
vocab_size = len(chars)
print("all the unique characters:", ''.join(chars))
print(f"vocab size: {vocab_size:,}")

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
def encode(s):
    return [stoi[c] for c in s] # encoder: take a string, output a list of integers
def decode(l):
    return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# create the train and test splits
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode both to integers
train_ids = encode(train_data)
val_ids = encode(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(data_root, 'train.bin'))
val_ids.tofile(os.path.join(data_root, 'val.bin'))

# save the meta information as well, to help us encode/decode later
meta = {
    'vocab_size': vocab_size,
    'itos': itos,
    'stoi': stoi,
}
with open(f'{data_root}/meta.pkl', 'wb') as f:
    pickle.dump(meta, f)

length of dataset in characters: 1,115,395
all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,855 tokens
val has 111,540 tokens


In [None]:
input_file_path

'data/shakespeare/input.txt'

##Complete the TODO sections in the cell below.

In [None]:
# @torch.jit.script # good to enable when not using torch.compile, disable when using (our default)
def new_gelu(x):
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

class LayerNorm(nn.Module):

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

###Exercise 1
Complete the forward function for `CausalSelfAttention` class. Most of the function is already implemented, you just have to compute the query, key and values.

In [None]:
class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()

        # TODO: you should calculate key, query, values (k, q, v) from `x` for all heads in batch.
        # Don't forget to move head forward to be the batch dim
        # HINT: using self.c_attn and splits to have q, k, v
        # YOUR CODE BEGINS HERE
        qkv = self.c_attn(x)
        q, k, v = torch.chunk(qkv, 3, dim=-1)
        q = q.view(B, T, self.n_head, self.n_embd // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.n_embd // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.n_embd // self.n_head).transpose(1, 2)
        # YOUR CODE ENDS HERE
        if self.flash:
            y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        else:
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        y = self.resid_dropout(self.c_proj(y))
        return y

Some other utility blocks are defined as:

In [None]:
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = new_gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True

### Exercise 2
Complete the forward function for `GPT` class. Most of the function is again already implemented, you need to do forward for `self.transformer` of this class. 

**HINT:** Read the token and position embeddings, forward through each block in loop and then forward through last layer.

In [None]:
class GPT(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight

        self.apply(self._init_weights)
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

        # report number of parameters
        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))

    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        remember to subtract the position embeddings for non_embedding
        The token embeddings would have received the same treatement too, but 
        for their use as weights, due to parameter sharing, in the final layer.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

        # TODO: write the forward for the GPT model and assign output to x. HINT: Refer to definition for self.transformer
        # YOUR CODE BEGINS HERE
        x = self.transformer["wte"](idx) + self.transformer["wpe"](pos)
        x = self.transformer["drop"](x)

        for block in self.transformer["h"]:
            x = block(x)

        x = self.transformer["ln_f"](x)
        # YOUR CODE ENDS HERE

        if targets is not None:
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            logits = self.lm_head(x[:, [-1], :])
            loss = None

        return logits, loss

    def crop_block_size(self, block_size):
        assert block_size <= self.config.block_size
        self.config.block_size = block_size
        self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
        for block in self.transformer.h:
            if hasattr(block.attn, 'bias'):
                block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]

    @classmethod
    def from_pretrained(cls, model_type, override_args=None):
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        override_args = override_args or {}
        assert all(k == 'dropout' for k in override_args)
        from transformers import GPT2LMHeadModel
        print("loading weights from pretrained gpt: %s" % model_type)


        config_args = {
            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
        print("forcing vocab_size=50257, block_size=1024, bias=True")
        config_args['vocab_size'] = 50257
        config_args['block_size'] = 1024
        config_args['bias'] = True
        if 'dropout' in override_args:
            print(f"overriding dropout rate to {override_args['dropout']}")
            config_args['dropout'] = override_args['dropout']
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] 
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        decay = set()
        no_decay = set()
        whitelist_weight_modules = (torch.nn.Linear, )
        blacklist_weight_modules = (torch.nn.LayerNorm, LayerNorm, torch.nn.Embedding)
        for mn, m in self.named_modules():
            for pn, p in m.named_parameters():
                fpn = '%s.%s' % (mn, pn) if mn else pn
                if pn.endswith('bias'):
                    no_decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                    decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                    no_decay.add(fpn)

        decay.remove('lm_head.weight')

        param_dict = {pn: p for pn, p in self.named_parameters()}
        inter_params = decay & no_decay
        union_params = decay | no_decay
        assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
        assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                    % (str(param_dict.keys() - union_params), )

        optim_groups = [
            {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": weight_decay},
            {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
        ]
        use_fused = (device_type == 'cuda') and ('fused' in inspect.signature(torch.optim.AdamW).parameters)
        print(f"using fused AdamW: {use_fused}")
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)

        return optimizer

    def estimate_mfu(self, fwdbwd_per_iter, dt):
        N = self.get_num_params()
        cfg = self.config
        L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size
        flops_per_token = 6*N + 12*L*H*Q*T
        flops_per_fwdbwd = flops_per_token * T
        flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
        flops_achieved = flops_per_iter * (1.0/dt)
        flops_promised = 312e12 
        mfu = flops_achieved / flops_promised
        return mfu

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

## TRAINING

In [None]:
## TRAIN CONFIG
out_dir = 'out-shakespeare-char'
out_dir_email = 'out-enron-email'
eval_interval = 250
log_interval = 10
eval_iters = 200
eval_only = False
always_save_checkpoint = False
# data
dataset = 'shakespeare'
dataset_email = 'enron_data'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256
# model
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2
bias =  False
# adamw optimizer
learning_rate = 1e-3
max_iters = 5000
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.99
grad_clip = 1.0
decay_lr = True
warmup_iters = 100
lr_decay_iters = 5000
min_lr = 1e-4
# system
device = 'cuda'
dtype = 'float16'
compile = False

seed_offset = 0
ddp_world_size = 1
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

tokens per iteration will be: 16,384


### Exercise 3 
1. Complete the TODO sections in the cell below and train the model on the shakespeare data. Complete the `get_lr` function. You should implement your learning rate schedule here. Your learning rate schedule should involve linear warmup and cosine decay.

Train the model. For training, you should get loss below 2.0 in roughly 10 minutes. You should not need to run for any longer than 20 minutes (on colab GPU) to get nice results. If you're just testing things out, consider training for only a minute or so at a time, and just confirming that loss decreases. You should only need to train from start to finish 1 time- when you're ready to submit.

In [None]:
def get_lr(it):
    if it > lr_decay_iters:
        return min_lr
    # TODO: Implement the learning rate schedule and return lr for the iteration
    # 1: include linear warmup
    # 2: implement cosine decay for after warmup (use warmup_iters from your hyperparams)
    # YOUR CODE BEGINS HERE
    if it < warmup_iters:
        decay = it / warmup_iters
    else:
        decay = 0.5 * (1 + math.cos(math.pi * (it - warmup_iters) / (lr_decay_iters - warmup_iters)))
    assert 0 <= decay <= 1
    coefficient = decay
    # YOUR CODE ENDS HERE
    return min_lr + coefficient * (learning_rate - min_lr)


In [None]:
os.makedirs(out_dir, exist_ok=True)
torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
device_type = 'cuda' if 'cuda' in device else 'cpu'
ptdtype = {'float32': torch.float32, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

data_dir = os.path.join('data', dataset)
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

iter_num = 0
best_val_loss = 1e9

meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    meta_vocab_size = meta['vocab_size']
    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

# model init
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout)
model_args['vocab_size'] = meta_vocab_size
gptconf = GPTConfig(**model_args)
model = GPT(gptconf)
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size
model.to(device)

scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
checkpoint = None

# compile the model
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model, backend='triton') # requires PyTorch 2.0

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# training loop
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()
local_iter_num = 0 # number of iterations in the lifetime of this process
raw_model = model
running_mfu = -1.0
for iter_num in range(max_iters):
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    if iter_num % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    for micro_step in range(gradient_accumulation_steps):
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps
        X, Y = get_batch('train')
        scaler.scale(loss).backward()
    if grad_clip != 0.0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0:
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    local_iter_num += 1

found vocab_size = 65 (inside data/shakespeare/meta.pkl)
number of parameters: 10.65M
using fused AdamW: True
step 0: train loss 4.2875, val loss 4.2826
iter 0: loss 4.2559, time 26404.08ms, mfu -100.00%
iter 10: loss 3.1613, time 181.93ms, mfu 2.05%
iter 20: loss 2.7856, time 181.93ms, mfu 2.05%
iter 30: loss 2.6532, time 183.13ms, mfu 2.05%
iter 40: loss 2.5717, time 181.93ms, mfu 2.05%
iter 50: loss 2.5366, time 182.99ms, mfu 2.05%
iter 60: loss 2.5172, time 181.94ms, mfu 2.05%
iter 70: loss 2.5206, time 182.29ms, mfu 2.05%
iter 80: loss 2.4905, time 182.11ms, mfu 2.05%
iter 90: loss 2.4793, time 181.78ms, mfu 2.05%
iter 100: loss 2.4800, time 181.94ms, mfu 2.05%
iter 110: loss 2.4873, time 182.17ms, mfu 2.05%
iter 120: loss 2.4659, time 183.49ms, mfu 2.04%
iter 130: loss 2.4458, time 182.22ms, mfu 2.04%
iter 140: loss 2.4543, time 183.19ms, mfu 2.04%
iter 150: loss 2.4504, time 183.49ms, mfu 2.04%
iter 160: loss 2.4567, time 182.00ms, mfu 2.04%
iter 170: loss 2.4535, time 182.29ms,

### Exercise 4
Run inference on the model. Complete the TODO portions

1. You need to call `model.generate` in the given for loop.
2. Show 10 samples. These might not be perfectly sensible English, but they should be very Shakespeare-like. Make sure they can be read in your submitted PDF.

In [None]:
# -----------------------------------------------------------------------------
start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
num_samples = 10 # number of samples to draw
max_new_tokens = 500 # number of tokens generated in each sample
temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
# -----------------------------------------------------------------------------

model.eval()

# get the absolute path of the current working directory
current_dir = os.getcwd()

# construct the relative path to the meta.pkl file
meta_path = os.path.join(current_dir, 'data', 'shakespeare', 'meta.pkl')

print(f"Loading meta from meta.pkl...")
with open(meta_path, 'rb') as f:
    meta = pickle.load(f)
stoi, itos = meta['stoi'], meta['itos']
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# encode the beginning of the prompt
if start.startswith('FILE:'):
    with open(start[5:], 'r', encoding='utf-8') as f:
        start = f.read()
start_ids = encode(start)
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

Loading meta from meta.pkl...


In [None]:
# run generation
with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            #TODO: Generate the sample
            generated_text = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            # decode the generated text
            decoded_text = decode(generated_text[0].tolist())
            print(f"Sample {k+1}:\n{decoded_text}\n{'-'*80}")

Sample 1:

The state against the last death,
And but the newless of night as he would
Were adviced, and supplied the environ,
And not yield so fair as having stain'd.
Would you know he do the king; and I shall rest,
That given cried to a blood creature little hands
By being the breath of the people guests.
But, be pale-enchilated and sorrow;
And therefore, to your suitors, that I will not say.

CLARENCE:
No, by the Volsces are they not stand their heads,
To strong their and royal place of grace,
And strike
--------------------------------------------------------------------------------
Sample 2:

Pardon for your own prison; you'll bring you, fair as
your great can be prepared of your true.

DUKE VINCENTIO:
What, for you mean? what will you are?

LUCIO:
Why, you can please you have a consul? Can you
would plead in this mother king?
The fellow I am gone?

ISABELLA:
Why, I cannot speak in the Tower. What are yours?

DUKE VINCENTIO:
You will hear you our two?

ANGELO:
Sir, perform your com

### Exercise 5: Train the model on a new dataset and show results.
This exercise is mostly about making sure you can find and preprocess text, as well as checking that you understand the above code well enough to reuse it.
1. Find some text data. Use our Shakespeare file as reference. You will want a similar amount of text data. Don't go overboard- a big text file will just make things take too long.
2. Perform any preprocessing necessary to get the text ready for the model. Use the preprocessing code we provide as reference.
3. Train the model on your text.
4. Generate and print 10 samples from the model trained on your text.

You may want to implement the functions below, using the code in the previous cells. Or not! It's up to you. You just need to write code that can train a model to generate text from some non-Shakespeare data. The generated text is the main deliverable that most of the grade will be based on. Make sure it displays prominently in your submitted PDF.

In [None]:
import os
import requests
import tarfile
import email
from email import policy
import numpy as np
import pickle

def download_data():
    """
    - Downloads the Enron Email dataset
    - saves it to the data folder
    - preprocesses the dataset
    - saves the preprocessed dataset to input.txt, train.bin, val.bin, meta.pkl
    - saves the meta data to meta.pkl
    - Provides the statistics of the dataset
    """

    # Download and extract the dataset
    url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz"
    filename = "enron_mail_20150507.tgz"
    folder = "maildir" # Manually considering only 51 users data for simplicity

    if not os.path.exists(filename):
        response = requests.get(url)
        open(filename, 'wb').write(response.content)

    if not os.path.exists(folder):
        with tarfile.open(filename, "r:gz") as tar:
            tar.extractall()

    # Preprocess the dataset
    def extract_text_from_email(email_file):
        with open(email_file, 'r', encoding='utf-8', errors='ignore') as f:
            msg = email.message_from_file(f, policy=policy.default)
            return msg.get_body(preferencelist=('plain')).get_content()

    def preprocess_enron_emails(data_folder):
        texts = []
        for root, _, files in os.walk(data_folder):
            for file in files:
                email_path = os.path.join(root, file)
                try:
                    email_text = extract_text_from_email(email_path)
                    texts.append(email_text)
                except Exception as e:
                    print(f"Error processing {email_path}: {e}")
        return "\n".join(texts)
    
    def encode(s):
        return [stoi[c] for c in s] # encoder: take a string, output a list of integers
    
    def decode(l):
        return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

    email_data = preprocess_enron_emails(folder)

    if not os.path.exists('data/enron_data'):
        os.makedirs('data/enron_data')
        
    # save the data to a file as input.txt
    with open('data/enron_data/input.txt', 'w', encoding='utf-8') as f:
        f.write(email_data)

    # save the meta information as well, to help us encode/decode later
    data_root = 'data/enron_data'

    # Prepare the dataset
    data = email_data
    print(f"length of dataset in characters: {len(data):,}")

    # get all the unique characters that occur in this text
    chars = sorted(list(set(data)))
    vocab_size = len(chars)
    print("all the unique characters:", ''.join(chars))
    print(f"vocab size: {vocab_size:,}")

    # create a mapping from characters to integers
    stoi = { ch:i for i,ch in enumerate(chars) }
    itos = { i:ch for i,ch in enumerate(chars) }

    # create the train and test splits
    n = len(data)
    train_data = data[:int(n*0.9)]
    val_data = data[int(n*0.9):]

    # encode both to integers
    train_ids = encode(train_data)
    val_ids = encode(val_data)
    print(f"train has {len(train_ids):,} tokens")
    print(f"val has {len(val_ids):,} tokens")

    # export to bin files
    train_ids = np.array(train_ids, dtype=np.uint16)
    val_ids = np.array(val_ids, dtype=np.uint16)
    train_ids.tofile(os.path.join(data_root, 'train.bin'))
    val_ids.tofile(os.path.join(data_root, 'val.bin'))

    # save the meta information as well, to help us encode/decode later
    meta = {
        'vocab_size': vocab_size,
        'itos': itos,
        'stoi': stoi,
    }
    with open(f'{data_root}/meta.pkl', 'wb') as f:
        pickle.dump(meta, f)
        

def train_model():
    """Train the model that is defined in Exercise 1 on your train data"""

    # Email generation model
    os.makedirs(out_dir_email, exist_ok=True)
    torch.manual_seed(1337 + seed_offset)
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    device_type = 'cuda' if 'cuda' in device else 'cpu'
    ptdtype = {'float32': torch.float32, 'float16': torch.float16}[dtype]
    ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

    # load the email dataset and create batches
    data_dir = os.path.join('data', dataset_email)
    train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
    val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
    def get_batch(split):
        data = train_data if split == 'train' else val_data
        ix = torch.randint(len(data) - block_size, (batch_size,))
        x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
        y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
        if device_type == 'cuda':
            x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
        else:
            x, y = x.to(device), y.to(device)
        return x, y

    iter_num = 0
    best_val_loss = 1e9

    meta_path = os.path.join(data_dir, 'meta.pkl')
    meta_vocab_size = None
    if os.path.exists(meta_path):
        with open(meta_path, 'rb') as f:
            meta = pickle.load(f)
        meta_vocab_size = meta['vocab_size']
        print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

    # model init
    model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                    bias=bias, vocab_size=None, dropout=dropout)
    model_args['vocab_size'] = meta_vocab_size
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    if block_size < model.config.block_size:
        model.crop_block_size(block_size)
        model_args['block_size'] = block_size
    model.to(device)

    scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

    # optimizer
    optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
    checkpoint = None

    # compile the model
    if compile:
        print("compiling the model... (takes a ~minute)")
        unoptimized_model = model
        model = torch.compile(model, backend='triton') # requires PyTorch 2.0

    @torch.no_grad()
    def estimate_loss():
        out = {}
        model.eval()
        for split in ['train', 'val']:
            losses = torch.zeros(eval_iters)
            for k in range(eval_iters):
                X, Y = get_batch(split)
                with ctx:
                    logits, loss = model(X, Y)
                losses[k] = loss.item()
            out[split] = losses.mean()
        model.train()
        return out

    # training loop
    X, Y = get_batch('train') # fetch the very first batch
    t0 = time.time()
    local_iter_num = 0 # number of iterations in the lifetime of this process
    raw_model = model
    running_mfu = -1.0
    for iter_num in range(max_iters):
        lr = get_lr(iter_num) if decay_lr else learning_rate
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        if iter_num % eval_interval == 0:
            losses = estimate_loss()
            print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
            if losses['val'] < best_val_loss or always_save_checkpoint:
                best_val_loss = losses['val']
                if iter_num > 0:
                    checkpoint = {
                        'model': raw_model.state_dict(),
                        'optimizer': optimizer.state_dict(),
                        'model_args': model_args,
                        'iter_num': iter_num,
                        'best_val_loss': best_val_loss
                    }
                    print(f"saving checkpoint to {out_dir_email}")
                    torch.save(checkpoint, os.path.join(out_dir_email, 'emailGPT_ckpt.pt'))
        if iter_num == 0 and eval_only:
            break

        for micro_step in range(gradient_accumulation_steps):
            with ctx:
                logits, loss = model(X, Y)
                loss = loss / gradient_accumulation_steps
            X, Y = get_batch('train')
            scaler.scale(loss).backward()
        if grad_clip != 0.0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)

        # timing and logging
        t1 = time.time()
        dt = t1 - t0
        t0 = t1
        if iter_num % log_interval == 0:
            lossf = loss.item() * gradient_accumulation_steps
            if local_iter_num >= 5: # let the training loop settle a bit
                mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
                running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
            print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
        local_iter_num += 1

    return model
        

In [None]:
def eval_model(model):
    """Runs inference of the trained model on your test data"""

    # -----------------------------------------------------------------------------
    start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
    num_samples = 11 # number of samples to draw
    max_new_tokens = 1000 # number of tokens generated in each sample
    temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
    top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
    # -----------------------------------------------------------------------------

    model.eval()

    # get the absolute path of the current working directory
    current_dir = os.getcwd()

    # construct the relative path to the meta.pkl file
    meta_path = os.path.join(current_dir, 'data', 'enron_data', 'meta.pkl')
    
    print(f"Loading meta from meta.pkl...")
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    stoi, itos = meta['stoi'], meta['itos']
    encode = lambda s: [stoi[c] for c in s]
    decode = lambda l: ''.join([itos[i] for i in l])

    # encode the beginning of the prompt
    if start.startswith('FILE:'):
        with open(start[5:], 'r', encoding='utf-8') as f:
            start = f.read()
    start_ids = encode(start)
    x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
    # run generation
    with torch.no_grad():
        with ctx:
            for k in range(num_samples):
                #TODO: Generate the sample
                generated_text = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
                # decode the generated text
                decoded_text = decode(generated_text[0].tolist())
                print(f"Sample {k+1}:\n{decoded_text}\n{'-'*80}")

In [None]:
# Download the data and preprocess it for training and evaluation of the model
download_data()

length of dataset in characters: 337,409,058
all the unique characters: 	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{|}~
vocab size: 117
train has 303,668,152 tokens
val has 33,740,906 tokens


In [None]:
# Train the model
# expected loss should start from -ln(1/117) = 4.762
trained_model = train_model()

found vocab_size = 117 (inside data/enron_data/meta.pkl)
number of parameters: 10.67M
using fused AdamW: True
step 0: train loss 4.7417, val loss 4.7410
iter 0: loss 4.7292, time 26248.32ms, mfu -100.00%
iter 10: loss 3.5903, time 182.76ms, mfu 2.04%
iter 20: loss 3.1500, time 183.13ms, mfu 2.04%
iter 30: loss 2.9429, time 183.48ms, mfu 2.04%
iter 40: loss 3.0150, time 182.58ms, mfu 2.04%
iter 50: loss 2.8258, time 182.75ms, mfu 2.04%
iter 60: loss 2.7966, time 183.83ms, mfu 2.04%
iter 70: loss 2.8596, time 183.04ms, mfu 2.04%
iter 80: loss 2.7788, time 183.06ms, mfu 2.04%
iter 90: loss 2.7965, time 181.96ms, mfu 2.04%
iter 100: loss 2.7575, time 182.55ms, mfu 2.04%
iter 110: loss 2.7725, time 182.67ms, mfu 2.04%
iter 120: loss 2.8021, time 182.69ms, mfu 2.04%
iter 130: loss 2.6917, time 183.08ms, mfu 2.04%
iter 140: loss 2.6610, time 183.54ms, mfu 2.04%
iter 150: loss 2.7197, time 182.84ms, mfu 2.04%
iter 160: loss 2.6077, time 182.19ms, mfu 2.04%
iter 170: loss 2.5853, time 182.25ms,

In [None]:
# Evaluate the model
eval_model(model=trained_model)

Loading meta from meta.pkl...
Sample 1:

 Water on August 2001, 2001 11:52 AM
To: Kolli Kewitz/ENRON@enronXgate Communications@ENRON
cc: Vicki Heas/HOU/ECT@ECT
cc: Donna Blank/HOU/ECT@ECT, Robert Schulder/Corp/Enron@Enron
cc:  
Subject: Re: FW: November 1. Should receive 






---Original Message----
From: 	Dasovich, Mara  
Sent:	Monday, October 03, 2001 12:34 PM
To:	White, James D Subject:	


Do we did not any help we please do not contact which may be my sistance the 
contact of personal companies as the recent of the financial 
energy companies that these state's just as of a current 
rebacked in Partner.  I am short the price of the explanation of the address which 
is the believe charge the call host 
provide by the cost of the website. The state company of the state will 
need to expect to be that the U.S. company of a SoCal factors.  This is go toget 
continued by the service minuter, and bankruptcy manager products will seek to the brought 
many message a released its optimate