# Pretraining GPT2

We shorten the context length because our data set doesn't have enough tokens for the original 1024 context.

In [1]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.0,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

## Torch Implementation

Import the libraries, and now we have separated the `GPTModel` implementation to a new file to make the notebook smaller, of course.

In [2]:
import torch
import tiktoken
from scripts.gpt2_model import GPTModel

torch.manual_seed(123)

<torch._C.Generator at 0x719cb0d26b10>

We will just instantiate the model. Since our config already has 0 drop out, we don't need to turn on `eval` mode for the model.

In [3]:
model = GPTModel(GPT_CONFIG_124M)

Define some helpers. 

In [4]:
from scripts.generate import generate_text_simple

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # remove batch dimension
    return tokenizer.decode(flat.tolist())


Let's test the model at the moment. Again we expect it to generate some garbage.

In [5]:
start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
  model=model,
  idx=text_to_token_ids(start_context, tokenizer),
  max_new_tokens=10,
  context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you rentingetic wasnم refres RexMeCHicular stren


## Loss Calculation

we have 2 batches here, so we add the corresponding "next" token into the targets while maintaining a consistent context length. this gives a "sliding window" effect.

In [6]:
inputs = torch.tensor([[16833, 3626, 6100],   # ["every effort moves",
                       [40,    1107, 588]])   #  "I really like"]

targets = torch.tensor([[3626, 6100, 345  ],  # [" effort moves you",
                        [1107,  588, 11311]]) #  " really like chocolate"]

inputs, targets

(tensor([[16833,  3626,  6100],
         [   40,  1107,   588]]),
 tensor([[ 3626,  6100,   345],
         [ 1107,   588, 11311]]))

In [7]:
torch.manual_seed(123)

logits = model(inputs)

probas = torch.softmax(logits, dim=-1) # Probability of each token in vocabulary
print(probas.shape) # Shape: (batch_size, num_tokens, vocab_size)


torch.Size([2, 3, 50257])


Probas contains for each token, there are vocab_size probabilities in the last dimension. We want to select the token ID (the index) of the highest number

In [8]:

probas

tensor([[[1.8849e-05, 1.5172e-05, 1.1687e-05,  ..., 2.2409e-05,
          6.9776e-06, 1.8776e-05],
         [9.1569e-06, 1.0062e-05, 7.8786e-06,  ..., 2.9090e-05,
          6.0103e-06, 1.3571e-05],
         [2.9877e-05, 8.8507e-06, 1.5741e-05,  ..., 3.5456e-05,
          1.4094e-05, 1.3526e-05]],

        [[1.2561e-05, 2.0538e-05, 1.4332e-05,  ..., 1.0389e-05,
          3.4784e-05, 1.4239e-05],
         [7.2731e-06, 1.7864e-05, 1.0565e-05,  ..., 2.1206e-05,
          1.1390e-05, 1.5559e-05],
         [2.9496e-05, 3.3605e-05, 4.1029e-05,  ..., 6.5249e-06,
          5.8203e-05, 1.3698e-05]]], grad_fn=<SoftmaxBackward0>)

For each word, we find an individual token to succeed the text

In [9]:
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]])


But too bad! we're nowhere close to our target!

In [10]:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

print(f"Targets batch 2: {token_ids_to_text(targets[1], tokenizer)}")
print(f"Outputs batch 2: {token_ids_to_text(token_ids[1].flatten(), tokenizer)}")

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix
Targets batch 2:  really like chocolate
Outputs batch 2:  pressuring empoweredfaith


So what did this probabilities look like for the _real_ indices for targets? You'll find them to be small. The goal now is to increase these probabilities.

In [11]:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05], grad_fn=<IndexBackward0>)
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06], grad_fn=<IndexBackward0>)


Take the log for individual probabilities and then concatenate them.

In [12]:
# Compute logarithm of all token probabilities
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)

tensor([ -9.5042, -10.3796, -11.3677, -11.4798,  -9.7764, -12.2561],
       grad_fn=<LogBackward0>)


Take the average and negate to make it positive.

In [13]:
# Calculate the average probability for each token
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

tensor(-10.7940, grad_fn=<MeanBackward0>)


In [14]:
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)


tensor(10.7940, grad_fn=<MulBackward0>)


The above is basically what we call "cross entropy loss"

## Cross Entropy

Lets just use torch cross_entropy to do everything we did above.

In [15]:
# Logits have shape (batch_size, num_tokens, vocab_size)
print("Logits shape:", logits.shape)

# Targets have shape (batch_size, num_tokens)
print("Targets shape:", targets.shape)

Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])


In [16]:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()

print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])


In [17]:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)


tensor(10.7940, grad_fn=<NllLossBackward0>)


We can calculate the perplexity too.

In [18]:
perplexity = torch.exp(loss)
print(perplexity)

tensor(48725.8203, grad_fn=<ExpBackward0>)


## Data Stuff

In [19]:
with open("data/the-verdict.txt", "r") as f:
  text_data = f.read()

text_data[:99]

'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no '

In [20]:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print("Characters:", total_characters)
print("Tokens:", total_tokens)

Characters: 20479
Tokens: 5145


In [21]:
from scripts.prepare_data import create_dataloader_v1

# Train/validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

In [22]:
# Sanity check

if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
    print("Not enough tokens for the training loader. "
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
          "increase the `training_ratio`")

if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
    print("Not enough tokens for the validation loader. "
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
          "decrease the `training_ratio`")

In [23]:
print("Train loader:")
for x, y in train_loader:
    print(x.shape, y.shape)

print("\nValidation loader:")
for x, y in val_loader:
    print(x.shape, y.shape)
    

Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])

Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])


In [24]:
train_tokens = 0
for input_batch, target_batch in train_loader:
    train_tokens += input_batch.numel()

val_tokens = 0
for input_batch, target_batch in val_loader:
    val_tokens += input_batch.numel()

print("Training tokens:", train_tokens)
print("Validation tokens:", val_tokens)
print("All tokens:", train_tokens + val_tokens)

Training tokens: 4608
Validation tokens: 512
All tokens: 5120


Given the input and targets, we forward pass through the model for the input batch. Then given the logits, we flatten on the dimensions and calculate the cross entropy loss.

In [25]:
def calc_loss_batch(input_batch, target_batch, model):
    input_batch, target_batch = input_batch, target_batch
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())

    return loss


In [26]:

def calc_loss_loader(data_loader, model, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        # Reduce the number of batches to match the total number of batches in the data loader
        # if num_batches exceeds the number of batches in the data loader
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches

In [27]:
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader

train_loss = calc_loss_loader(train_loader, model)
val_loss = calc_loss_loader(val_loader, model)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 10.98758347829183
Validation loss: 10.98110580444336


## Training!!

In [28]:
def train_model_simple(model, train_loader, val_loader, optimizer, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    # Main training loop
    for epoch in range(num_epochs):
        
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad() # Reset loss gradients from previous batch iteration
            loss = calc_loss_batch(input_batch, target_batch, model)
            loss.backward() # Calculate loss gradients
            optimizer.step() # Update model weights using loss gradients
            tokens_seen += input_batch.numel()
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step:06d}): "
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        # Print a sample text after each epoch
        generate_and_print_sample(
            model, tokenizer, start_context
        )

    return train_losses, val_losses, track_tokens_seen


def evaluate_model(model, train_loader, val_loader, eval_iter):
    train_loss = calc_loss_loader(train_loader, model, num_batches=eval_iter)
    val_loss = calc_loss_loader(val_loader, model, num_batches=eval_iter)
    return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, start_context):
    context_size = model.pos_emb.weight.shape[0]
    encoded = text_to_token_ids(start_context, tokenizer)
    token_ids = generate_text_simple(
        model=model, idx=encoded,
        max_new_tokens=50, context_size=context_size
    )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " "))  # Compact print format

In [29]:
from scripts.perf_timer import PerfTimer

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10

timer = PerfTimer()

timer.start()
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)
timer.stop()

print(f"Took this long to train: {timer.elapsed_ms()} ms")


Ep 1 (Step 000000): Train loss 9.794, Val loss 9.909
Ep 1 (Step 000005): Train loss 8.038, Val loss 8.324
Every effort moves you,,,,,,,,,,,,,,.                                   
Ep 2 (Step 000010): Train loss 6.598, Val loss 7.041
Ep 2 (Step 000015): Train loss 5.996, Val loss 6.575
Every effort moves you, and, and, and, and, and, and, and. ", and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and,
Ep 3 (Step 000020): Train loss 5.558, Val loss 6.444
Ep 3 (Step 000025): Train loss 5.956, Val loss 7.675
Every effort moves you.                                                 
Ep 4 (Step 000030): Train loss 4.243, Val loss 6.283
Ep 4 (Step 000035): Train loss 4.304, Val loss 6.210
Every effort moves you.               "I--and's--and it's had been, and I had been the, and I had been the honour of the, and he had been, I had
Ep 5 (Step 000040): Train loss 3.443, Val loss 6.196
Every effort moves you know it was not a littleI glanced after him, and I had been his eye

In [30]:
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you?"

"Yes--quite insensible to the irony. She wanted him vindicated--and by me!"




## Saving the Model Weights

In [31]:
torch.save(model.state_dict(), "models/gpt2-verdict-model.pth")

## Reload the Model

In [32]:
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(
  torch.load("models/gpt2-verdict-model.pth", weights_only=True)
)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layer): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=

In [33]:
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you?"

"Yes--quite insensible to the irony. She wanted him vindicated--and by me!"




## TTNN

We are not going to train the model using TTNN, instead, we will use it to perform inference in an already trained model. In this case, we diligently trained the GP2 model with our data set using CPU and had saved it to disk. We can reload those weights and reapply them to the GPTModel_ttnn class.


Redefine the config.

In [1]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.0,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

Set some imports.

In [2]:
import ttnn
import tiktoken
import torch
from torch import nn
from scripts.text_helpers import text_to_token_ids, token_ids_to_text

torch.manual_seed(123)

device = None

2025-05-11 14:30:56.164 | DEBUG    | ttnn.library_tweaks:prepare_dir_as_metal_home:54 - Existing installation of 0.57.0rc60+any detected
2025-05-11 14:30:56.188 | DEBUG    | ttnn:<module>:83 - Initial ttnn.CONFIG:
Config{cache_path=/home/avgdev/.cache/ttnn,model_cache_path=/home/avgdev/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_should_raise_exception=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}


## Open the Device

In [3]:
if device:
  ttnn.close_device(device)

device_id = 0
device = ttnn.open_device(device_id=device_id)

                 Device | INFO     | Opening user mode device driver
[32m2025-05-11 14:30:56.784[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Opened PCI device 0; KMD version: 1.33.0, IOMMU: disabled

[32m2025-05-11 14:30:56.792[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Opened PCI device 0; KMD version: 1.33.0, IOMMU: disabled
[32m2025-05-11 14:30:56.794[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Harvesting mask for chip 0 is 0x200 (physical layout: 0x1, logical: 0x200, simulated harvesting mask: 0x0).
[32m2025-05-11 14:30:56.794[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Opened PCI device 0; KMD version: 1.33.0, IOMMU: disabled
[32m2025-05-11 14:30:56.795[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected PCI devices: [0]
[32m2025-05-11 14:30:56.795[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Using local chip ids: 

New chip! We now have 1 chips
Chip initialization complete (found )
Chip initializing complete...
 ARC

 [4/4] DRAM

 [16/16] ETH

 CPU

Chip detection complete (found )


## Initialize the GPT Model

We will reload the model.

In [4]:
from scripts.gpt2_model import GPTModel
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(
  torch.load("models/gpt2-verdict-model.pth", weights_only=True)
)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layer): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=

What we can do now is reload weights into our model_ttnn.

In [5]:
from scripts.gpt2_model_ttnn import GPTModel_ttnn, TransformerBlock_ttnn

model_ttnn = GPTModel_ttnn(GPT_CONFIG_124M, device)
model_ttnn.eval()

model_ttnn.pos_emb.weight = torch.nn.Parameter(model.pos_emb.weight) 
model_ttnn.tok_emb.weight = torch.nn.Parameter(model.tok_emb.weight)

for i, block in enumerate(model.trf_blocks):
  t = model_ttnn.trf_blocks_ttnn[i]

  t.ff.lin_1.weight = torch.nn.Parameter(block.ff.layer[0].weight)
  t.ff.lin_1.bias = torch.nn.Parameter(block.ff.layer[0].bias)
  t.ff.lin_2.weight = torch.nn.Parameter(block.ff.layer[2].weight)
  t.ff.lin_2.bias = torch.nn.Parameter(block.ff.layer[2].bias)

  t.att.W_key.weight = torch.nn.Parameter(block.att.W_key.weight)
  t.att.W_key.bias = torch.nn.Parameter(block.att.W_key.bias)
  t.att.W_query.weight = torch.nn.Parameter(block.att.W_query.weight)
  t.att.W_query.bias = torch.nn.Parameter(block.att.W_query.bias)
  t.att.W_value.weight = torch.nn.Parameter(block.att.W_value.weight)
  t.att.W_value.bias = torch.nn.Parameter(block.att.W_value.bias)
  
  t.att.out_proj.weight = torch.nn.Parameter(block.att.out_proj.weight)
  t.att.out_proj.bias = torch.nn.Parameter(block.att.out_proj.bias)

  t.norm1.scale = torch.nn.Parameter(block.norm1.scale)
  t.norm1.shift = torch.nn.Parameter(block.norm1.shift)

  t.norm2.scale = torch.nn.Parameter(block.norm2.scale)
  t.norm2.shift = torch.nn.Parameter(block.norm2.shift)

model_ttnn.final_norm.shift = torch.nn.Parameter(model.final_norm.shift)
model_ttnn.final_norm.scale = torch.nn.Parameter(model.final_norm.scale)
model_ttnn.out_head.weight = torch.nn.Parameter(model.out_head.weight)

model_ttnn.update_weights()

In [6]:
model_ttnn.tok_emb_ttnn.shape

Shape([50257, 768])

## Compare Token Embeddings

In [7]:
from scripts.compare_tensors import compare_tensors

compare_tensors(
  ttnn.to_torch(
    ttnn.reshape(
      (model_ttnn.tok_emb_ttnn),
      (1, model_ttnn.tok_emb_ttnn.shape[0], model_ttnn.tok_emb_ttnn.shape[1])
    )
  ),
  model.tok_emb.weight.unsqueeze(0)
)



=== Tensor Comparison ===
Shapes: PyTorch torch.Size([1, 50257, 768]), TTNN torch.Size([1, 50257, 768])
Data types: PyTorch torch.bfloat16, TTNN torch.bfloat16

Tolerance Checks:
  Max Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Mean Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Correlation: 1.000000 (Tolerance: 0.990000) ✅ PASS

Overall Status: ✅ PASS

Sample Value Comparisons (first 3 positions):
  Position [0,0,0]: PyTorch=0.341797, TTNN=0.341797, Diff=0.000000 ✅
  Position [0,0,1]: PyTorch=-0.175781, TTNN=-0.175781, Diff=0.000000 ✅
  Position [0,0,2]: PyTorch=-0.302734, TTNN=-0.302734, Diff=0.000000 ✅
  Position [0,0,3]: PyTorch=-0.585938, TTNN=-0.585938, Diff=0.000000 ✅
  Position [0,0,4]: PyTorch=0.347656, TTNN=0.347656, Diff=0.000000 ✅
  Position [0,0,5]: PyTorch=0.660156, TTNN=0.660156, Diff=0.000000 ✅
  Position [0,0,6]: PyTorch=-0.219727, TTNN=-0.219727, Diff=0.000000 ✅
  Position [0,0,7]: PyTorch=-0.376953, TTNN=-0.376953, Diff=0.000000 ✅
  Position [0,0,

{'max_diff': 0.0,
 'mean_diff': 0.0,
 'correlation': 1.0,
 'max_diff_status': True,
 'mean_diff_status': True,
 'correlation_status': True,
 'overall_status': True}

## Compare Positional Embeddings

In [8]:
from scripts.compare_tensors import compare_tensors

compare_tensors(
  ttnn.to_torch(
    ttnn.reshape(
      (model_ttnn.pos_emb_ttnn),
      (1, model_ttnn.pos_emb_ttnn.shape[0], model_ttnn.pos_emb_ttnn.shape[1])
    )
  ),
  model.pos_emb.weight.unsqueeze(0)
)

=== Tensor Comparison ===
Shapes: PyTorch torch.Size([1, 256, 768]), TTNN torch.Size([1, 256, 768])
Data types: PyTorch torch.bfloat16, TTNN torch.bfloat16

Tolerance Checks:
  Max Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Mean Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Correlation: 0.996094 (Tolerance: 0.990000) ✅ PASS

Overall Status: ✅ PASS

Sample Value Comparisons (first 3 positions):
  Position [0,0,0]: PyTorch=0.871094, TTNN=0.871094, Diff=0.000000 ✅
  Position [0,0,1]: PyTorch=0.253906, TTNN=0.253906, Diff=0.000000 ✅
  Position [0,0,2]: PyTorch=0.839844, TTNN=0.839844, Diff=0.000000 ✅
  Position [0,0,3]: PyTorch=-0.507812, TTNN=-0.507812, Diff=0.000000 ✅
  Position [0,0,4]: PyTorch=0.341797, TTNN=0.341797, Diff=0.000000 ✅
  Position [0,0,5]: PyTorch=-0.208984, TTNN=-0.208984, Diff=0.000000 ✅
  Position [0,0,6]: PyTorch=-0.699219, TTNN=-0.699219, Diff=0.000000 ✅
  Position [0,0,7]: PyTorch=0.421875, TTNN=0.421875, Diff=0.000000 ✅
  Position [0,0,8]: PyTo

{'max_diff': 0.0,
 'mean_diff': 0.0,
 'correlation': 0.99609375,
 'max_diff_status': True,
 'mean_diff_status': True,
 'correlation_status': True,
 'overall_status': True}

## Compare Final Norm

In [9]:
from scripts.compare_tensors import compare_tensors

final_norm_shift_shape = model_ttnn.final_norm.shift_ttnn.shape
final_norm_scale_shape = model_ttnn.final_norm.scale_ttnn.shape

compare_tensors(
  ttnn.to_torch(
    ttnn.reshape(
      model_ttnn.final_norm.shift_ttnn,
      (1, 1, final_norm_shift_shape[0])
    )
  ),
  model.final_norm.shift.unsqueeze(0).unsqueeze(0)
)

print()

compare_tensors(
  ttnn.to_torch(
    ttnn.reshape(
      model_ttnn.final_norm.scale_ttnn,
      (1, 1, final_norm_scale_shape[0])
    )
  ),
  model.final_norm.scale.unsqueeze(0).unsqueeze(0)
)

=== Tensor Comparison ===
Shapes: PyTorch torch.Size([1, 1, 768]), TTNN torch.Size([1, 1, 768])
Data types: PyTorch torch.bfloat16, TTNN torch.bfloat16

Tolerance Checks:
  Max Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Mean Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Correlation: 0.996094 (Tolerance: 0.990000) ✅ PASS

Overall Status: ✅ PASS

Sample Value Comparisons (first 3 positions):
  Position [0,0,0]: PyTorch=0.026733, TTNN=0.026733, Diff=0.000000 ✅
  Position [0,0,1]: PyTorch=-0.026001, TTNN=-0.026001, Diff=0.000000 ✅
  Position [0,0,2]: PyTorch=-0.021362, TTNN=-0.021362, Diff=0.000000 ✅
  Position [0,0,3]: PyTorch=-0.013184, TTNN=-0.013184, Diff=0.000000 ✅
  Position [0,0,4]: PyTorch=0.025269, TTNN=0.025269, Diff=0.000000 ✅
  Position [0,0,5]: PyTorch=0.023926, TTNN=0.023926, Diff=0.000000 ✅
  Position [0,0,6]: PyTorch=-0.025391, TTNN=-0.025391, Diff=0.000000 ✅
  Position [0,0,7]: PyTorch=0.017700, TTNN=0.017700, Diff=0.000000 ✅
  Position [0,0,8]: PyTorc

{'max_diff': 0.0,
 'mean_diff': 0.0,
 'correlation': 1.0078125,
 'max_diff_status': True,
 'mean_diff_status': True,
 'correlation_status': True,
 'overall_status': True}

## Out Head Comparison


In [10]:
from scripts.compare_tensors import compare_tensors

out_head_shape = model_ttnn.out_head_ttnn.shape
print(out_head_shape)

compare_tensors(
  ttnn.to_torch(
    ttnn.reshape(
      model_ttnn.out_head_ttnn,
      (1, out_head_shape[0], out_head_shape[1])
    )
  ),
  model.out_head.weight.unsqueeze(0)
)



Shape([50257, 768])
=== Tensor Comparison ===
Shapes: PyTorch torch.Size([1, 50257, 768]), TTNN torch.Size([1, 50257, 768])
Data types: PyTorch torch.bfloat16, TTNN torch.bfloat16

Tolerance Checks:
  Max Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Mean Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Correlation: 1.000000 (Tolerance: 0.990000) ✅ PASS

Overall Status: ✅ PASS

Sample Value Comparisons (first 3 positions):
  Position [0,0,0]: PyTorch=-0.030151, TTNN=-0.030151, Diff=0.000000 ✅
  Position [0,0,1]: PyTorch=-0.021973, TTNN=-0.021973, Diff=0.000000 ✅
  Position [0,0,2]: PyTorch=-0.013428, TTNN=-0.013428, Diff=0.000000 ✅
  Position [0,0,3]: PyTorch=-0.025391, TTNN=-0.025391, Diff=0.000000 ✅
  Position [0,0,4]: PyTorch=-0.026733, TTNN=-0.026733, Diff=0.000000 ✅
  Position [0,0,5]: PyTorch=-0.020020, TTNN=-0.020020, Diff=0.000000 ✅
  Position [0,0,6]: PyTorch=-0.041748, TTNN=-0.041748, Diff=0.000000 ✅
  Position [0,0,7]: PyTorch=-0.023560, TTNN=-0.023560, Diff=0

{'max_diff': 0.0,
 'mean_diff': 0.0,
 'correlation': 1.0,
 'max_diff_status': True,
 'mean_diff_status': True,
 'correlation_status': True,
 'overall_status': True}

## Transformer Blocks Comparison

In [None]:
from scripts.compare_tensors import compare_tensors

for i, block in enumerate(model.trf_blocks):
  t = model_ttnn.trf_blocks_ttnn[i]

  compare_tensors(
    ttnn.to_torch(
      ttnn.reshape(
        t.att.W_key_ttnn,
        (1, t.att.W_key_ttnn.shape[0], t.att.W_key_ttnn.shape[1]))
    ),
    model.trf_blocks[i].att.W_key.weight.unsqueeze(0),
    suppress_details=True
  )
  compare_tensors(
    ttnn.to_torch(
      ttnn.reshape(
        t.att.W_query_ttnn,
        (1, t.att.W_query_ttnn.shape[0], t.att.W_query_ttnn.shape[1]))
    ),
    model.trf_blocks[i].att.W_query.weight.unsqueeze(0),
    suppress_details=True
  )
  compare_tensors(
    ttnn.to_torch(
      ttnn.reshape(
        t.att.W_value_ttnn,
        (1, t.att.W_value_ttnn.shape[0], t.att.W_value_ttnn.shape[1]))
    ),
    model.trf_blocks[i].att.W_value.weight.unsqueeze(0),
    suppress_details=True
  )

=== Tensor Comparison ===
Shapes: PyTorch torch.Size([1, 768, 768]), TTNN torch.Size([1, 768, 768])
Data types: PyTorch torch.bfloat16, TTNN torch.bfloat16

Tolerance Checks:
  Max Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Mean Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Correlation: 0.992188 (Tolerance: 0.990000) ✅ PASS

Overall Status: ✅ PASS

Sample Value Comparisons (first 3 positions):
=== Tensor Comparison ===
Shapes: PyTorch torch.Size([1, 768, 768]), TTNN torch.Size([1, 768, 768])
Data types: PyTorch torch.bfloat16, TTNN torch.bfloat16

Tolerance Checks:
  Max Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Mean Absolute Diff: 0.000000 (Tolerance: 0.020000) ✅ PASS
  Correlation: 1.007812 (Tolerance: 0.990000) ✅ PASS

Overall Status: ✅ PASS

Sample Value Comparisons (first 3 positions):
=== Tensor Comparison ===
Shapes: PyTorch torch.Size([1, 768, 768]), TTNN torch.Size([1, 768, 768])
Data types: PyTorch torch.bfloat16, TTNN torch.bfloat16

Toleran

In [13]:
from scripts.generate_ttnn import generate_text_simple_ttnn

torch.manual_seed(123)
start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

inputs_ttnn = ttnn.from_torch(
  text_to_token_ids(start_context, tokenizer),
  layout=ttnn.TILE_LAYOUT,
  dtype=ttnn.uint32,
  device=device
)
token_ids = generate_text_simple_ttnn(
  model=model_ttnn,
  idx_ttnn=inputs_ttnn,
  max_new_tokens=25,
  context_size=GPT_CONFIG_124M["context_length"],
  device=device
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you thought'd him on that my host he was "interesting a good a enough a so to-- to exqu,! Jack's


In [18]:
ttnn.close_device(device)

                  Metal | INFO     | Closing device 0


                  Metal | INFO     | Disabling and clearing program cache on device 0
