### LLM Inference Example

This notebook contains a basic inference example for using our `ttml` Python API to build, load, and run a large language model from Hugging Face on our TT hardware. By default, it is set to create and load a GPT2 model, but this notebook can quickly and easily be edited to use any of the LLMs that the tt-train project currently supports. 

Below, in the first cell, we have our imports and basic directory housekeeping.

In [None]:
import os, sys, random
import numpy as np  # For numpy arrays
from dataclasses import dataclass # For configuration classes
from huggingface_hub import hf_hub_download # To download safetensors from Hugging Face
from transformers import AutoTokenizer
from yaml import safe_load # To read YAML configs
from pathlib import Path

sys.path.append(f"{os.environ['TT_METAL_HOME']}/tt-train/sources/ttml")
import ttnn
import ttml
from ttml.common.config import get_training_config, load_config
from ttml.common.utils import set_seed, round_up_to_tile
from ttml.common.model_factory import TransformerModelFactory

# Change working directory to tt-train
os.chdir(f"{os.environ['TT_METAL_HOME']}/tt-train")


Use the cell below to change global parameters in this notebook. 

`OUTPUT_TOKENS` : the length of the generated text in token (not characters!) 

`WITH_SAMPLING` : enable or disable output token sampling (only used for PyTorch)

`TEMPERATURE`   : sampling temperature; set to 0 to disable sampling in `generate_with_tt()`

`SEED`          : randomization seed (for reproducibility)

In [2]:
OUTPUT_TOKENS = 256
WITH_SAMPLING = True
TEMPERATURE = 0.8
SEED = 42
CONFIG = "gpt2_inference.yaml"

set_seed(SEED)

While the notebook is currently configured for GPT2, you can quickly change the tokenizer you want to use by changing the input to `from_pretrained()` below.

In [4]:
# Load the tokenizer from Hugging Face and the transformer config from YAML
tokenizer = AutoTokenizer.from_pretrained("gpt2")
training_config = get_training_config(CONFIG)
model_yaml = load_config(training_config.model_config, configs_root=os.getcwd())

As above, the call to `hf_hub_download()` will download (or otherwise find on your local system) the SafeTensors model weight file for GPT2, but can be updated to download other SafeTensors files.

In [5]:
# # Get safetensors
safetensors_path = hf_hub_download(repo_id="gpt2", filename="model.safetensors")
safetensors_path = safetensors_path.replace("model.safetensors","")

print(f"Safetensors path: {safetensors_path}")


Safetensors path: /home/ubuntu/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/


In [None]:
def build_causal_mask(T: int) -> ttml.autograd.Tensor:
    # [1,1,T,T] float32 with 1s for allowed positions (i >= j), else 0\n",
    m = np.tril(np.ones((T, T), dtype=np.float32))
    return ttml.autograd.Tensor.from_numpy(m.reshape(1, 1, T, T), ttnn.Layout.TILE, ttnn.DataType.BFLOAT16)

def build_logits_mask(vocab_size: int, padded_vocab_size: int) -> ttml.autograd.Tensor:
    logits_mask = np.zeros((1, 1, 1, padded_vocab_size), dtype=np.float32)
    logits_mask[:, :, :, vocab_size:] = 1e4
    return ttml.autograd.Tensor.from_numpy(logits_mask, ttnn.Layout.TILE, ttnn.DataType.BFLOAT16)   # [1,1,1,T], bfloat16"

In [7]:
orig_vocab_size = tokenizer.vocab_size

tt_model_factory = TransformerModelFactory(model_yaml)
tt_model_factory.transformer_config.vocab_size = orig_vocab_size

max_sequence_length = tt_model_factory.transformer_config.max_sequence_length

tt_model = tt_model_factory.create_model()
tt_model.load_from_safetensors(safetensors_path)
tt_model

padded_vocab_size = round_up_to_tile(orig_vocab_size, 32)

if orig_vocab_size != padded_vocab_size:
    print(f"Padding vocab size for tilization: original {orig_vocab_size} -> padded {padded_vocab_size}")


Transformer configuration:
    Vocab size: 50272
    Max sequence length: 1024
    Embedding dim: 768
    Num heads: 12
    Dropout probability: 0.2
    Num blocks: 12
    Positional embedding type: Trainable
    Runner type: Memory efficient
    Composite layernorm: false
    Weight tying: Enabled
2025-12-02 19:15:07.274 | info     |             UMD | Starting topology discovery. (topology_discovery.cpp:69)
2025-12-02 19:15:07.278 | info     |             UMD | Established firmware bundle version: 18.10.0 (topology_discovery.cpp:369)
2025-12-02 19:15:07.278 | info     |             UMD | Established ETH FW version: 7.0.0 (topology_discovery_wormhole.cpp:324)
2025-12-02 19:15:07.278 | info     |             UMD | Completed topology discovery. (topology_discovery.cpp:73)
2025-12-02 19:15:07.278 | info     |          Device | Opening user mode device driver (tt_cluster.cpp:211)
2025-12-02 19:15:07.278 | info     |             UMD | Starting topology discovery. (topology_discovery.cpp:69)

`generate_with_tt()` uses TT hardware acceleration to generate output from the chosen LLM

In [None]:
def generate_with_tt(model, prompt_tokens):

    ttml.autograd.AutoContext.get_instance().set_gradient_mode(ttml.autograd.GradMode.DISABLED)
    model.eval()

    logits_mask_tensor = None

    if padded_vocab_size != orig_vocab_size:
        logits_mask_tensor = build_logits_mask(orig_vocab_size, padded_vocab_size)

    causal_mask = build_causal_mask(max_sequence_length)  # [1,1,seq_len,seq_len], float32
    padded_prompt_tokens = np.zeros((1, 1, 1, max_sequence_length), 
                                    dtype=np.uint32)

    start_idx = 0

    print("************************************")
    for token_idx in range(OUTPUT_TOKENS):

        if len(prompt_tokens) > max_sequence_length:
            start_idx = len(prompt_tokens) - max_sequence_length

        # padded_prompt_tokens[0, 0, 0, :transformer_cfg["max_sequence_length"]] = 0
        padded_prompt_tokens[0, 0, 0, :len(prompt_tokens)] = prompt_tokens[start_idx:]
        padded_prompt_tensor = ttml.autograd.Tensor.from_numpy(
            padded_prompt_tokens,
            ttnn.Layout.ROW_MAJOR,
            ttnn.DataType.UINT32)  # [1,1,1, max_seq_len], uint32

        logits = model(padded_prompt_tensor, causal_mask)  # out=[1,1,seq_len, vocab_size], bf16


        next_token_tensor = ttml.ops.sample.sample_op(logits, TEMPERATURE, np.random.randint(low=1e7), logits_mask_tensor)  # out=[1,1,seq_len,1], uint32

        next_token_idx = max_sequence_length - 1 if len(prompt_tokens) > max_sequence_length else len(prompt_tokens) - 1
        next_token = next_token_tensor.to_numpy().flatten()[next_token_idx]

        output = tokenizer.decode(next_token)

        prompt_tokens.append(next_token)
        print(output, end='', flush=True)

    print("\n************************************\n\n")

In [9]:
def generate_with_pytorch(prompt_tokens):
    import torch
    from transformers import AutoModelForCausalLM

    torch.manual_seed(SEED)

    torch_model = AutoModelForCausalLM.from_pretrained("gpt2", dtype=torch.bfloat16)
    torch_model.eval()
    print("************************************")
    with torch.no_grad():
        outputs = torch_model.generate(
            prompt_tokens,
            max_new_tokens=OUTPUT_TOKENS,
            do_sample=WITH_SAMPLING, # Enable sampling
            temperature=TEMPERATURE,   # Temperature for sampling
            num_beams=1 # Use multinomial sampling (standard sampling)
        )
    
    generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    for t in generated_text:
        print(t)
        
    print("\n************************************\n\n"),

In [None]:
prompt_str = "The difference between cats and dogs is:"
prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT:")
generate_with_tt(tt_model, prompt_tokens.copy())

Generating with TT:


In [11]:
prompt_str = "Compared to spoons, forks are meant to:"
prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT:")
generate_with_tt(tt_model, prompt_tokens.copy())

Generating with TT:
************************************
 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (
 (
 ( (


 F F 35 (
 F


 (



 35
 35 F ( F<|endoftext|> 35 F F F (
 (
 ( ( F ( 35 F ( F F
 35 F (<|endoftext|> 35 F F 35 A
 35 F A<|endoftext|>
 F<|endoftext|> F 35<|endoftext|><|endoftext|> F F (

 F
 F F 35 F 35 ( The
 F 35 F F F 35 I F 35 F 35 F F F A F ( F F ( F F F The F F F A F F F F A 35 F F The The F<|endoftext|> F A F F 35<|endoftext|> A<|endoftext|> F F I A F The F F The F The The F The I<|endoftext|> The F A The F F The F
 I The F F The The F 35 The F F<|endoftext|> 35 The
 The F The F The H The F The H 35 F The
 35 The The A The The F
 35 The<|endoftext|> F The F The<|endoftext|>
************************************




In [None]:
prompt_str = "Bees are similar to:"
prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT:")
generate_with_tt(tt_model, prompt_tokens.copy())

Now try your own prompt!

In [None]:
prompt_str = input("Enter your prompt: ")

prompt_tokens = tokenizer.encode(prompt_str)
print("Generating with TT model:")
generate_with_tt(tt_model, prompt_tokens)