## Building a Character-Level GPT from Scratch – Training Shakespeare with Transformers
### Goal:
Train a small autoregressive Transformer (decoder only like GPT) to generate Shakespeare-like text at the character level.

### Why:
    - Learn the core components behind GPT models:

        **Self-attention**: How the model focuses on relevant past context.

        **Positional encoding**: How sequence order is encoded without recurrence.

        **Autoregressive language modeling**: Predicting the next character one at a time.

    - Implement a decoder-only Transformer architecture from scratch, mimicking the GPT family of models.

    - Understand how attention works in practice and how it powers generation.



### Attention
The purpose of attention is to reconstruct the embedding (representation ) of a word in a vector space such that the word's vector lies closer to the words that are contexually relavant in a given sequence

#### What Attention *Really* Does:

> It reconstructs or refines a token’s vector (embedding) so that it moves closer in vector space to other contextually relevant words, based on the current sentence.
> 

---

#### Before vs After Attention

- **Before attention**:
    
    The embedding of a word like `"bank"` is static — the same for `"river bank"` and `"bank loan"`.
    
- **After attention**:
    
    The word `"bank"` has a new vector based on **which words are nearby**.
    
    It shifts closer to:
    
    - `"river"`, `"shore"` in `"the bank of the river"`
    - `"loan"`, `"account"` in `"she went to the bank"`

---

#### Why This Matters

This is **how transformers handle meaning**:

- They **don’t just memorize words**
- They **compute new, dynamic meanings** for words in **every new context**

---

#### Intuition

Imagine a 3D space:

- The word `"bank"` is like a drone hovering in the middle.
- Based on nearby words, **attention pulls it** toward `"finance"` or `"geography"` clusters.
- That final position is the **contextualized meaning**.

#### Transformer

Transformer architecture was introduced in [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) as:

![Transformer architecture as explained in the paper](Annotated-Transformers-Architecture.webp)

**This notebook implements ONLY the DECODER block, with SELF_ATTENTION instead of CROSS-ATTENTION. This is consistent with the GPT2 architecture.**

### Initial Setup

In [43]:
import torch
import torch.nn as nn
from torch.nn import functional as F

### Data Preparation

In [18]:
# download the Shakespeare  dataset -- optional
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# read the dataset
with open("input.txt", 'r', encoding='utf-8') as f:
  text = f.read()

--2025-05-19 11:34:21--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2025-05-19 11:34:22 (108 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [None]:
# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4 # broght down from 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

In [19]:
# create vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)

# create a character level tokenizer
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
tokenizer = lambda s: [stoi[c] for c in s]
detokenizer = lambda l: ''.join([itos[i] for i in l])

# tokenize the input
data = torch.tensor(tokenizer(text)) # this will be the input to transformer

In [20]:
# split the data into train/dev splits
# 90% data is train data, rest is val data
n = int((0.9) * len(data))
train_data = data[:n]
val_data = data[n:]
print(len(train_data), len(val_data))

1003854 111540


In [57]:
# create a dataloader
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # get 4 random indices
    x = torch.stack([data[i:i+block_size] for i in ix]) # shape will be (B,T)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # shape will be(B ,T)
    x, y = x.to(device), y.to(device) # move the data to device (this is important in case device == cuda. we need to move the data to the gpu after loading)
    return x, y


### Transformer Architecture

In [22]:
class LayerNorm1d:

    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps # a small value called epsilon. added to the std in the denominator to avoid 0 in the denominator in case std is 0
        # parameters (trained with backprop)
        self.gamma = torch.ones(dim) # bngain
        self.beta = torch.zeros(dim ) # bnbias)

    def __call__(self, x):
        # calculate the forward pass for layernorm
        xmean = x.mean(1, keepdim=True) # mean and variance over rows
        xvar = x.var(1, keepdim=True)
        # standardize
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
        self.out = self.gamma * xhat + self.beta

        return self.out

    def parameters(self):
        return [self.gamma, self.beta]


In [45]:
# Self-Attention Head
class Head(nn.Module):
  """one head of self-attention"""

  def __init__(self, head_size):
    super().__init__()
    self.key = nn.Linear(n_embd, head_size, bias=False)
    self.query = nn.Linear(n_embd, head_size, bias=False)
    self.value = nn.Linear(n_embd, head_size, bias=False)
    # Register causal mask (lower triangular) in variable "tril" as non-trainable buffer
    self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    self.dropout = nn.Dropout(dropout)

  def forward(self, x):
    B, T, C = x.shape
    k = self.key(x) # (B, T, head_size)
    q = self.query(x) # (B, T, head_size)
    v = self.value(x) # (B, T, head_size)
    # compute attention wieghts
    wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
    # create mask
    wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
    # attention scores
    wei = F.softmax(wei, dim=-1) # (B, T, T)
    wei = self.dropout(wei) # dropout in the self-attention module after calculating the attention score
    # final attention embeddings
    out = wei @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
    return out

# Multi-head Attention
class MultiHeadAttention(nn.Module):
  """multiple heads of self-attention in parallel"""
  def __init__(self, num_heads, head_size):
    super().__init__()
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
    self.proj = nn.Linear(n_embd, n_embd)
    self.dropout = nn.Dropout(dropout)

  def forward(self, x):
    out = torch.cat([h(x) for h in self.heads], dim=-1)
    out = self.proj(out) # projecting back into the residual pathway
    out = self.dropout(out) # droput after the multihead self attention layer
    return out

In [46]:
# feedforward class of linear layers
class FeedForward(nn.Module):
  """a simple linear layer followed by a non-linearity"""

  def __init__(self, n_embd):
    super().__init__()
    self.net = nn.Sequential(
        nn.Linear(n_embd, 4 * n_embd), # the inner layer has a dimensionality of 4 times the embedding size as per the transformer paper
        nn.ReLU(),
        nn.Linear(4 * n_embd, n_embd), # projection layer going back into the residual pathway
        nn.Dropout(dropout)) # add dropout before passing the connection back into the rsidual pathway

  def forward(self,x):
    return self.net(x)

# grouping multihead SELF attention and feedfoward in a block
class Block(nn.Module):
  """Transformer block: communication(attention) followed by computation(linear layers)"""
  def __init__(self, n_embd, n_head):
    super().__init__()
    head_size = n_embd // n_head
    self.sa = MultiHeadAttention(n_head, head_size)
    self.ffwd = FeedForward(n_embd)
    self.ln1 = nn.LayerNorm(n_embd) # layernorm for the multihead-attention layer
    self.ln2 = nn.LayerNorm(n_embd) # layernorm for the feedforward layer

  def forward(self, x):
    # "x + " is adding x for residual connection/skip connection
    # in the original paper layernorm is applied after the self attention layer and the feedfwd layer
    # It is now common to apply it before the self attention layer and the feedfwd layer - so the input to the two sa and ffwd layers will be layernormed input
    # this is called the "pre-norm formulation" and we will be implenting that

    x = x + self.sa(self.ln1(x))
    x = x + self.ffwd(self.ln2(x))
    return x


In [47]:
# Bigram Language Model
class BigramLanguageModel(nn.Module):
  def __init__(self): # removing vocab_size from the constructor since it is already defined as a global variable
    super().__init__()
    # nn.embedding creates a weight matrix and extracts the row corrsponding to every value in the input x matrix
    # expanding the model to add more layers after the embedding layer
    self.token_embedding_table = nn.Embedding(vocab_size, n_embd) # create an embedding lookup table for vocab_size tokens with each token represented by an n_embd size
    self.position_embedding_table = nn.Embedding(block_size, n_embd) # position embedding layer - creates a position embedding for every token position (T num of tokens)
    # self.sa_head = Head(head_sizea=n_embd) # self-attention layer with head_size same as embedding size
    # self.sa_heads = MultiHeadAttention(4, n_embd // 4) # create 4 heads of size  so the total number of channels after concat of heads will be 32 (n_embd)
    # self.ffwd = FeedForward(n_embd) # feedforward layer

    # replacing multihead attention and ffwd net with multiple transformer blocks in a sequence
    self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
    self.ln_f = nn.LayerNorm(n_embd) # final layer norm just before the lm_head (output layer
    self.lm_head = nn.Linear(n_embd, vocab_size) # final layer of the model called language model head

  def forward(self, idx, targets=None): # idx is the input token
    B, T = idx.shape
    token_emb = self.token_embedding_table(idx) # (B, T, C) This will give us the embedding for every token, hence adding the channel dimension
    pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C) to be added to token_emd using broadcasting
    x = token_emb + pos_emb # (B, T, C) # concat the token and pos embeddings. x holds the token identities along with their position info. Not helpful for a bigram model
    # x = self.sa_head(x) # apply one head of self-attention (B, T, head_size)
    # x = self.sa_heads(x) # apply multiple heads of self-attention in parallel (B, T, C)
    # x = self.ffwd(x) # apply a feedforward layer (B, T, C)

    # replacing multihead attention and ffwd net with multiple transformer blocks in a sequence
    x = self.blocks(x)
    logits = self.lm_head(x) # (logits) final output (B, T, vocab_size)

    if targets is None:
      loss = None
    else:
      # calculate the loss
      # Pytorch expects (B, C, T) instead of (B, T, C). Instead of reshaping to (B, C, T), we make our logits 2D
      B, T, C = logits.shape
      logits = logits.view(B*T, C) # (32, 65)
      targets = targets.view(B*T) # (32,)
      loss = F.cross_entropy(logits, targets)

    return logits, loss

  def generate(self, idx, max_new_tokens):
    # idx is (B, T) array of indices in the current context
    for _ in range(max_new_tokens):
      # crop idx to get just the last block_size number of tokens else adding pos_emb to token_embd will cause errors since we have pos_embd only for 8 tokens
      idx_cond = idx[:, -block_size:]
      # get predictions
      logits, loss = self(idx_cond) # this will call forward
      # focus only on the last time step to extract the logits for the last token
      logits = logits[:, -1, :] # becomes (B, C)
      # apply softmax to get probabilities
      probs = F.softmax(logits, dim=-1) # (B, C)
      # sample from the distribution
      idx_next = torch.multinomial(probs, num_samples=1) # (B,1) num_samples = 1 because we only sample one token at a time
      # append sampled example to the running sequence
      idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)

    return idx


In [48]:
# simply printing the loss after every iteration is noisy because it the loss on a single batch
# a better idea to view the average of loss instead
# we average the los for train and test splits after every eval_iters number of batches

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval() # sets the model to eval mode. This is because some layers like dropout, batchnorm etc have different behaviours during train and test/eval time
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # sets the model to train mode
    return out

In [51]:
# initialization
model = BigramLanguageModel()
m = model.to(device) # move the model params like weights of the embedding table to device
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

### Training

In [52]:
# training loop
for iter in range(max_iters):
  if iter % eval_interval == 0:
    losses = estimate_loss()
    print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

  # get a batch of size 32, 8 for every iteration
  xb, yb = get_batch('train')

  # evaluate loss
  logits, loss = m(xb, yb) # forward pass
  optimizer.zero_grad(set_to_none=True) # set grad to 0 before backward pass
  loss.backward()
  optimizer.step()

print(loss.item())

step 0: train loss 4.5300, val loss 4.5341
step 500: train loss 2.0929, val loss 2.1512
step 1000: train loss 1.6635, val loss 1.8199
step 1500: train loss 1.4902, val loss 1.6779
step 2000: train loss 1.3910, val loss 1.6009
step 2500: train loss 1.3178, val loss 1.5515
step 3000: train loss 1.2660, val loss 1.5284
step 3500: train loss 1.2232, val loss 1.5089
step 4000: train loss 1.1824, val loss 1.4962
step 4500: train loss 1.1468, val loss 1.4965
1.1747770309448242


In [53]:
total_params = sum(p.numel() for p in model.parameters())
total_params

10788929

### Sampling/Inference

In [55]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)  # context also needs to be generated on the device
print(detokenizer(m.generate(context, max_new_tokens=2000)[0].tolist()))



And treason to there anger at the daunts
Hot forbids to give York not lie!

GLOUCESTER:
Say night, my noble scools,
By some eartaff'd and kill'd my sovereign
Hard prison Chamilia: so fare your peers worth
That you, should not see do bring them.

CAPULET:
What say we must better on that steel years?

LADY CAPULET:
There arise they sabel, though you prove!

Nurse:
Faretch, because sir, nor further lady heart.

JULIET:
What, or else dowry honger will not go and us:
Ask thou, tutost very distrent me! I am cull'd
Inform thy holy and Saint Cliffold spile cowards
Ermon begins, presently, whose thrices are creeks no
The enviom'd passable, and the kindwells: I do consul.

Lords:
Being person
Beforedy, some that what is sue in thy sin,
How they dare choose. Fivility, you shall not an
Finisher sainted yours: so thy be bittern is thee
valoued as I have made my is all;
Think all this lets in a nay, but obey is.

FRIAR LAURENCE:
But they not not sue bear me.

FRILANCE:
Provost:
And this daughts and

## Results with simpler architectures - Fewer layers and less regularization

#### Output after training with a single attention head (no dropout. layernorm)

Thr haclle y rosthe yocou rers hout ontheed yeth on toet a ianghth to'Mi'su JZH:  
Tnof ise cie.  
Coof, who yof no norcay 'seat tend maibe om sheace, Iny a detsilsshe aspolen boto b.  

ABndst hopor yis gealangou oit dace hag yoou  
O'd nfurie lefonouirn bo hes syeten: nhey hod nus ousour,  
The wohe thar Andesonfst, tich'de; wal ingat I tarru, multuth chot sat ghtou s waperllon milth hor yeoush oow thinee tou thod ghou thare  
Ocouve thoof s,  
ha, mouka omen.  
Whavin yancind al.  

LONNTIUS:  
I hour pu Sowpte!  

Yon Genfor, od tharint  
Nm do hingan.  
fnd  shon:  
Cor wooova  
Gpou tour tar.  
Yu.  
'TEW:  

Ho' Rerhe Ron fllo; heee yound u hon; oarnd wirxkimnok kde o mod-d wouvefd bee Ro yodet omer foagint pe, mererom thake has ha wheak dod buraliave' prou thad skoby s keve bst sord I o ppay in, wined hinnt thakized ivesthey myre o gtacongir:  
Cancth chin. '  

Wy haved d turok in,  
Na,  
HLLI: andy stholou's.  
Te he they o a  
Hik,  
II at hat  
In oarringh't tinth soun a o thof the te itr onert crectrrro hy haninget heth


**Observation**

The output has started to learn the format of the input data along with some words like they, the, who. But it's still not good. Though there is improvement as compared to a simple no-attention bigram model.

### Output after multihead attention and feed forward layer(single linear layer + relu)

therer penancena foughth mase hopearair doight!  

ELROL: have a I moully thealsey?  

Fangried seand?  

Whatk you, oshoure ablaw, that the detting'dlinatte by no odent puk in Yin   

desin muned thatrgy wore,  

And  

WerlZEN elings.  

Fear frit Ble am stroth the com. andels.  

No o tvorer givas; sot, Maperd'so dead. whwasl  
myt libtheold a a ging bothic asou rkics'dcevalnst,  
lo womn lim ehon Iould I t'sdalie a nda yamag,  
lhat no inter.  
achthidung. a ale ivouen geld tathill  
Oper't: tor'd pilla  
rsgom weorrsh: thit, I bras!  
las Ella masinglenger; srroeap;  
wehehe' Ea broThth ke quirty, well not child-lasichtiote a in grown meatn,  
Tathin sthell'chel:-atheh oulde: mun creoe'st to;  
Tawillllyt, tou be hiso Reear devy  
rat:er athimrrens  
Tatl loond the hem erther heyo a snot to tveacel?  
maarn age I For OSfulffuur him broy ouerrsitel.  

luen sdo So, hicheired doun, well Buucdt ille, yomprp,  
I bioa, lend Cel you an, To they werlee willl myor;  
W uilliore'd tran, gif isgllamtik cound loue your  
'shou you pnone him pal  

### Output after introducing Blocks of attention and feedforward layers

The out after adding multiple blocks of attention and ffwd is still not good because the model is becoming deep and deep models have opttimization issues. SO to cater for that, we add regularization techniques as discussed in the Attention is ALL you need paper. We will be adding skip-connections (residual connections) and layer norm.


---
output:

et ye'l kuce are sat lais hel me is  
End sill bees by, birou my fy soist sis.  

WAy Merstree my lur cagaioror,  
The, din ling: fre craven aris, I ithio oour weve  
Uove hiralle wold handy thate agf pome  
Corcepe meny beatel pof thith, I gim unknt:  
tougia po lon co ncoornoliny tol my yitht saksim lie;  
And lind Bat od bith pof tharuar,  
Pe teethone pucaver agier thin-stie ite,  
Agries wa bakt man bit aye lirou torgrstonces:  
Fhes. wilk, is fould itrcksss thing.  

LALIONET:  

Lird arthe fot, od; thy thow paynd, ar tho gitited  
I wice Mrerams, my me oore, higath spard apored myough?  
Noust olot'ly, ande oun; trountses band hesptaty.  
Ine my sad nor, ot theco odte, tood thegh hemous.  
Found theanay mound hirwe the.  
Llorptse:  
For awed orabo's t yeabjustee.  

GLUK:  
Mard norgangiord, ay waly we, capit: san mitst rore,  
A, our pyor ooss  
Toom bree fyoof hase the gris of arcon.  
Sote mabller, hard ilaminud forwhe angar sostur,  
Ond my tof toccelld has, he leay'd torge theabll exepry her to smord!  
Heane--  
I saver: I  




## Output after applying residual connections:
Train loss: 1.6277
Val loss: 1.8527

Observations:
  - We start seeing some overfitting as the training loss keeps decreasing whhile val loss mostly pleatues out
  - The output starts looking like English with some proper English vocab. The structure of the output has also starting taking shape of a dialogue.

---


His your for for  
do waitin the to in gown conspy up!  
You priling the quonst of thou no feectrumman, die  
Wherel: I will and I not, bid besich's!  
And left songue done, Bence, as bethands! bodes'd.  

ANGELO:  
Had my name sir, nafor live that with Deys  
Your of our belse weell well is dorisor  
Is his the book for the wordloges.  

ISABELA:  
My nought. For your and swilk-  
Beten on maser, Moculdite: if a fount-incle's.  
Loveing as perforthe by not hound and,  
Wife to in till friends: chaurel was your my vicdar:  
Why, the and hasse bear,-  
I, amdve forforther'd that hout tong shemernar's thems?  

BAPTOFOZEH:  
Conriffer:  
Evench me, and scommand thee froud to the inn dids,  
What samboys, that me me; forbastent togan unclectohe it.  

CLAUDIO:  
My crousofur be our here and my doth.  
Way, good him morn her as't dismand than:  
Vour I watland by that I not we lice in  
And sole ontok welcingham: your refore.  

GEONTHORE:  
I am ten Edward's or mades to luge and back;  
Vor shall if the eyem; for bance the an such nuriend's,  

## Output after adding LayerNorm
Observation:
 - train loss 1.6561
 - val loss 1.8386


---

Output:

the cressigne-ccrant, in an other bast,  
To bust of strung inters and beens: my amores.  

LORTHUMBY:  
Queace you me! Call citief  
Whatch of Lord Claudo to with o'er theth toar.  

CLIFFORD:  
Why, adabist and thou.  

GLOUCESTER:  
Neved out here; boous armoon arm sonda?  

WARWARD IV:  
For HENE OF YORK:  
Here tour Alster, my purcimph her,  
That hast with that ear?  

DUCHESSTER:  
Struh, there my prove Boldy stretly is appin's  
To placke say ausir is out they hurph theeford?  

BRUTUST:  
Rull dyred; if your loks oneselve sent-very.  

ANEMERS:  
This Lord, could lady learntly of with whom dow  
Hisbled.  
Sech, and lookes hour kasfter and the fromford  
With Romeond, thou it eare still a go they hone:  
Do ear-boun as as this sweears you hourstags hi  
beh brelinge, their abouse the's ourn of there  
'Twou'ers his so am dopplens, well.  

Julstues ake how quallip't faire lasticece:  
Beck, firberied this ornemy thrat'st hea, he vill  
For ovide sicke them's let thee, you e's and let  
fet thy gent the alme fort, mile an- heall pusbl  

# How the Character-Level Transformer Works

## Context Processing
- The model maintains a sliding context window of **256 characters**
- For each prediction step, it analyzes exactly **256 consecutive characters**
- The context window contains positions `[0]` to `[255]` (zero-indexed)

## Attention Mechanism
- Self-attention computes relationships between all characters in the current window
- Each character can only attend to:
  - Itself
  - Previous characters in the window (causal masking prevents looking ahead)
- The 256th character (position `[255]`) receives contextual information from:
  - Its own embedding
  - All previous characters (`[0]` to `[254]`)

## Prediction Process
- The model processes all 256 characters in parallel during training
- The final character's output embedding (position `[255]`) is used to predict:
  - The 257th character (next token in sequence)
- During generation:
  - The predicted character is appended to the sequence
  - The context window slides forward by 1 position
  - The process repeats for the next character

## Key Properties
- **Training**: Processes full 256-character sequences simultaneously (with masking)
- **Inference**: Generates text auto-regressively (one character at a time)
- **Position Awareness**: Learned positional embeddings track each character's absolute position

## Output after deepening the model and adding dropout
Observations:
- train loss 1.1468
- val loss 1.4965
---

Output:


And treason to there anger at the daunts
Hot forbids to give York not lie!

GLOUCESTER:
Say night, my noble scools,
By some eartaff'd and kill'd my sovereign
Hard prison Chamilia: so fare your peers worth
That you, should not see do bring them.

CAPULET:
What say we must better on that steel years?

LADY CAPULET:
There arise they sabel, though you prove!

Nurse:
Faretch, because sir, nor further lady heart.

JULIET:
What, or else dowry honger will not go and us:
Ask thou, tutost very distrent me! I am cull'd
Inform thy holy and Saint Cliffold spile cowards
Ermon begins, presently, whose thrices are creeks no
The enviom'd passable, and the kindwells: I do consul.

Lords:
Being person
Beforedy, some that what is sue in thy sin,
How they dare choose. Fivility, you shall not an
Finisher sainted yours: so thy be bittern is thee
valoued as I have made my is all;
Think all this lets in a nay, but obey is.

FRIAR LAURENCE:
But they not not sue bear me.

FRILANCE:
Provost:
And this daughts and mine here! where
JULIET:
Ay, I am sure near so nop in peace or grace.
What would you have less,
Let purchase for you, and not be honded encommended:
For me, some powers at enter the noble senses,
And so, trove as you, good uprison;
And in the court'sympating heart of the time;
A drunk hell, for mine.

FRUMIO:
Pray, good sheep--O, this in fool the quiet.

Shepherd:
I that like infortion true and some pactly of you banish,
And mothers simplain how the counsel, have you but loss
Reckows in chept depart suition my injust,
Con breaks that may not so brings wandered.
The dretime of thise of the lance:
If ever thou heart inforce their advantage deceive
To thee who shuns mose advantainted daours
Shall even yourself untLuck and living dreads;
The thinish would enough is palace our deep!
Where we must glad that falter for Exetest,
And whether by use, which eas a liventy.

TYBUS:
My lord Watchman:
But lead the pries power a foot.

Clown:
Ay, a lord, soby!

MAMILLIUS:
None, lord,
A mate, Conspirat

## Observations
1. Is it like Shakespeare?

✅ Resemblances:

Stylistic Imitation: Your model is capturing the surface-level style of Shakespeare fairly well:

Capitalized character names (JULIET:, TYBUS:, FRIAR LAURENCE:).

Dialogue structure with each line representing speech.

Old English-sounding phrases (thou, nay, hence, etc.).

Poetic rhythm, occasional use of iambic-style meter (although inconsistent).

Thematic Hallucinations: Words like "lord", "noble", "sheep", "watchman", "court", and "prison" are thematic to Shakespearean drama, especially the tragedies.

❌ Not Quite There Yet:

Nonsense Words: Many words are completely made-up or broken:
(fivility, tutost, infortion, frilance, advantainted, untLuck).

Poor Grammar: Syntax is frequently mangled:

“What, or else dowry honger will not go and us”

“That you, should not see do bring them.”

Missing Coherence: There's little to no narrative structure or dialogue logic. Characters respond incoherently or without relevance to one another.

So yes — it imitates Shakespeare’s form impressively well for such a small model, but lacks Shakespeare’s substance (meaning, coherence, character logic).

