### LLM Inference Example

This notebook contains a basic inference example for using our `ttml` Python API to build, load, and run a large language model from Hugging Face on our TT hardware. By default, it is set to create and load a GPT2 model, but this notebook can quickly and easily be edited to use any of the LLMs that the tt-train project currently supports. 

Below, in the first cell, we have our imports and basic directory housekeeping.

In [1]:
import os, sys, random
import numpy as np  # For numpy arrays
from dataclasses import dataclass # For configuration classes
from huggingface_hub import hf_hub_download # To download safetensors from Hugging Face
from transformers import AutoTokenizer
from yaml import safe_load # To read YAML configs
from pathlib import Path

sys.path.append(f"{os.environ['TT_METAL_HOME']}/tt-train/sources/ttml")
import ttml

# Can be used to set the random seed for reproducibility
def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    ttml.autograd.AutoContext.get_instance().set_seed(seed)

set_seed()

# Change working directory to tt-train
os.chdir(f"{os.environ['TT_METAL_HOME']}/tt-train")

@dataclass
class TransformerConfig:
    n_head: int = 12
    embed_dim: int = 768
    dropout: float = 0.2
    n_blocks : int = 12
    vocab_size: int = 96
    max_seq_len: int = 1024
    weight_tying: str = "enabled"

Use the cell below to change global parameters in this notebook. 

`OUTPUT_TOKENS` : the length of the generated text in token (not characters!) 

`WITH_SAMPLING` : enable or disable output token sampling (only used for PyTorch)

`TEMPERATURE`   : sampling temperature; set to 0 to disable sampling in `generate_with_tt()`

In [2]:
OUTPUT_TOKENS = 256
WITH_SAMPLING = False
TEMPERATURE = 0.0

While the notebook is currently configured for GPT2, you can quickly change the tokenizer you want to use by changing the input to `from_pretrained()` below.

In [3]:
# Load the tokenizer from Hugging Face and the transformer config from YAML
tokenizer = AutoTokenizer.from_pretrained("gpt2")
transformer_cfg = safe_load(open("configs/training_shakespeare_gpt2s.yaml", "r"))["training_config"]["transformer_config"]

As above, the call to `hf_hub_download()` will download (or otherwise find on your local system) the SafeTensors model weight file for GPT2, but can be updated to download other SafeTensors files.

In [4]:
# Get safetensors
local_path = hf_hub_download(repo_id="gpt2", filename="model.safetensors")
local_path = local_path.replace("model.safetensors","")

In [5]:
def build_causal_mask(T: int) -> ttml.autograd.Tensor:
    # [1,1,T,T] float32 with 1s for allowed positions (i >= j), else 0
    m = np.tril(np.ones((T, T), dtype=np.float32))
    return ttml.autograd.Tensor.from_numpy(m.reshape(1, 1, T, T), ttml.Layout.TILE, ttml.autograd.DataType.BFLOAT16)

def build_logits_mask(vocab_size: int, padded_vocab_size: int) -> ttml.autograd.Tensor:
    logits_mask = np.zeros((1, 1, 1, padded_vocab_size), dtype=np.float32)
    logits_mask[:, :, :, vocab_size:] = 1e4
    return ttml.autograd.Tensor.from_numpy(logits_mask, ttml.Layout.TILE, ttml.autograd.DataType.BFLOAT16)   # [1,1,1,T], float32

In [6]:
def create_model(cfg, vocab_size: int, seq_len: int):
    # GPT2 config via your bindings
    gcfg = ttml.models.gpt2.GPT2TransformerConfig()
    gcfg.num_heads = cfg["num_heads"]
    gcfg.embedding_dim = cfg["embedding_dim"]
    gcfg.num_blocks = cfg["num_blocks"]
    gcfg.vocab_size = int(vocab_size)
    gcfg.max_sequence_length = seq_len
    gcfg.dropout_prob = cfg["dropout_prob"]
    gcfg.weight_tying = ttml.models.WeightTyingType.Enabled if cfg["weight_tying"] == "enabled" else ttml.models.gpt2.WeightTyingType.DISABLED
    gcfg.runner_type = ttml.models.RunnerType.Default

    model = ttml.models.gpt2.create_gpt2_model(gcfg)
    model.load_from_safetensors(Path(local_path))
    return model

vocab_size = tokenizer.vocab_size

if vocab_size % 32 != 0:
    print(f"Warning: vocab size {vocab_size} is not multiple of 32, padding for tilizing.")
    padded_vocab_size = ((tokenizer.vocab_size + 31) // 32) * 32

else:
    padded_vocab_size = vocab_size

tt_model = create_model(transformer_cfg, padded_vocab_size, transformer_cfg["max_sequence_length"])
tt_model


Transformer configuration:
    Vocab size: 50272
    Max sequence length: 1024
    Embedding dim: 768
    Num heads: 12
    Dropout probability: 0.2
    Num blocks: 12
    Positional embedding type: Trainable
    Runner type: Default
    Composite layernorm: false
    Weight tying: Enabled
2025-10-01 04:33:11.591 | info     |          Device | Opening user mode device driver (tt_cluster.cpp:188)
2025-10-01 04:33:11.644 | info     |   SiliconDriver | Harvesting mask for chip 0 is 0x80 (NOC0: 0x80, simulated harvesting mask: 0x0). (cluster.cpp:403)
2025-10-01 04:33:11.691 | info     |   SiliconDriver | Opening local chip ids/PCIe ids: {0}/[0] and remote chip ids {} (cluster.cpp:252)
2025-10-01 04:33:11.691 | info     |   SiliconDriver | All devices in cluster running firmware version: 18.10.0 (cluster.cpp:232)
2025-10-01 04:33:11.691 | info     |   SiliconDriver | IOMMU: disabled (cluster.cpp:173)
2025-10-01 04:33:11.691 | info     |   SiliconDriver | KMD version: 2.4.0 (cluster.cpp:176)

<_ttml.models.gpt2.GPT2Transformer at 0x7f27c5d734d0>

mer/gpt_block_5/ln1/beta
parameter name: transformer/gpt_block_7/ln2/gamma
parameter name: transformer/gpt_block_3/attention/qkv_linear/bias
parameter name: transformer/gpt_block_10/mlp/fc2/weight
parameter name: transformer/gpt_block_0/mlp/fc1/bias
parameter name: transformer/ln_fc/beta
parameter name: transformer/gpt_block_1/ln1/beta
parameter name: transformer/gpt_block_4/ln2/beta
parameter name: transformer/gpt_block_2/ln2/beta
parameter name: transformer/gpt_block_1/ln2/beta
parameter name: transformer/gpt_block_7/ln1/gamma
parameter name: transformer/gpt_block_3/ln1/gamma
parameter name: transformer/gpt_block_8/ln2/gamma
parameter name: transformer/gpt_block_11/mlp/fc2/bias
parameter name: transformer/gpt_block_1/ln2/gamma
parameter name: transformer/gpt_block_0/attention/out_linear/bias
parameter name: transformer/gpt_block_2/attention/qkv_linear/bias
parameter name: transformer/gpt_block_10/ln2/gamma
parameter name: transformer/pos_emb/weight
parameter name: transformer/gpt_blo

`generate_with_tt()` uses TT hardware acceleration to generate output from the chosen LLM

In [7]:
def generate_with_tt(model, prompt_tokens):

    model.eval()
    ttml.autograd.AutoContext.get_instance().set_gradient_mode(ttml.autograd.GradMode.DISABLED)

    if padded_vocab_size != vocab_size:
        logits_mask_tensor = build_logits_mask(vocab_size, padded_vocab_size)
    else:
        logits_mask_tensor = None

    causal_mask = build_causal_mask(transformer_cfg["max_sequence_length"])  # [1,1,seq_len,seq_len], float32
    padded_prompt_tokens = np.full((1, 1, 1, transformer_cfg["max_sequence_length"]), 
                                    tokenizer.eos_token_id,
                                    dtype=np.uint32)

    print("************************************")
    start_idx = 0
    for token_idx in range(OUTPUT_TOKENS):

        if len(prompt_tokens) > transformer_cfg["max_sequence_length"]:
            start_idx = len(prompt_tokens) - transformer_cfg["max_sequence_length"]

        padded_prompt_tokens[0, 0, 0, :transformer_cfg["max_sequence_length"]] = 0
        padded_prompt_tokens[0, 0, 0, start_idx:len(prompt_tokens)-start_idx] = prompt_tokens
        padded_prompt_tensor = ttml.autograd.Tensor.from_numpy(
            padded_prompt_tokens,
            ttml.Layout.ROW_MAJOR,
            ttml.autograd.DataType.UINT32
        )  # [1,1,1, max_seq_len], uint32

        logits = model(padded_prompt_tensor, causal_mask)  # [1,1,1, vocab_size]
        next_token_tensor = ttml.ops.sample.sample_op(logits, TEMPERATURE, np.random.randint(low=1e6), logits_mask_tensor)  # [1,1,seq_len,vocab_size], uint32
        
        next_token_idx = transformer_cfg["max_sequence_length"] - 1 if len(prompt_tokens) >= transformer_cfg["max_sequence_length"] else len(prompt_tokens) - 1
        next_token = next_token_tensor.to_numpy().flatten()[next_token_idx]

        output = tokenizer.decode(next_token)

        prompt_tokens.append(next_token)

        print(output, end='', flush=True)

    print("\n************************************\n\n")

`generate_with_pytorch()` generates output using PyTorch CPU and Hugging Face GPT2

In [10]:
def generate_with_pytorch(prompt_tokens):
    import torch
    from transformers import AutoModelForCausalLM

    torch_model = AutoModelForCausalLM.from_pretrained("gpt2", dtype=torch.bfloat16)
    torch_model.eval()
    print("************************************")

    outputs = torch_model.generate(
        prompt_tokens,
        max_new_tokens=OUTPUT_TOKENS,
        do_sample=WITH_SAMPLING, # Enable sampling
        temperature=TEMPERATURE,   # Temperature for sampling
        num_beams=1, # Use multinomial sampling (standard sampling)
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    for t in generated_text:
        print(t)
        
    print("\n************************************\n\n")



In [9]:
prompt_str = "The difference between cats and dogs is:"
prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT:")
generate_with_tt(tt_model, prompt_tokens.copy())
print("Generating with PyTorch:")
prompt_tokens = tokenizer.encode(prompt_str, return_tensors="pt")
generate_with_pytorch(prompt_tokens)

Generating with TT:
************************************
2025-10-01 04:34:24.235 | info     |            Test | Small moreh_layer_norm algorithm is selected. (moreh_layer_norm_program_factory.cpp:168)


A cat is a dog.

A dog is a cat.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.

A cat is a dog.
************************************


Generating with PyTorch:


`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


************************************
The difference between cats and dogs is:

A cat is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that is a cat that

********************************

In [11]:
prompt_str = "Compared to spoons, forks are meant to:"
prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT:")
generate_with_tt(tt_model, prompt_tokens.copy())
print("Generating with PyTorch:")
prompt_tokens = tokenizer.encode(prompt_str, return_tensors="pt")
generate_with_pytorch(prompt_tokens)

Generating with TT:
************************************


Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold a fork in place

Be able to hold
************************************


Generating with PyTorch:
************************************

In [12]:
prompt_str = "Bees are similar to:"
prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT:")
generate_with_tt(tt_model, prompt_tokens.copy())
print("Generating with PyTorch:")
prompt_tokens = tokenizer.encode(prompt_str, return_tensors="pt")
generate_with_pytorch(prompt_tokens)

Generating with TT:
************************************




A bee is a small, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy, hairy,
************************************


Generating with PyTorch:
************************************
Bees are sim

Now try your own prompt!

In [None]:
prompt_str = input("Enter your prompt: ")

prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT model:")
generate_with_tt(tt_model, prompt_tokens)
prompt_tokens = tokenizer.encode(prompt_str, return_tensors="pt")
print("Generating with PyTorch model:")
generate_with_pytorch(prompt_tokens)