<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 2: Generating Text with a Pre-trained LLM

Packages that are being used in this notebook:

In [1]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

reasoning_from_scratch version: 0.1.2
torch version: 2.7.1
tokenizers version: 0.21.4


<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F01_raschka.webp?1" width="500px">

&nbsp;
## 2.1 Introduction to LLMs for text generation

- No code in this section
- How do LLMs generate text?
- This chapter is a setup chapter: setting up the coding environment and LLM we will be using throughout the book
- We also code text generation functions that we will use and extend in upcoming chapters

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F02_raschka.webp?1" width="300px">

- LLM (and neural network) flowcharts are traditionally read and drawn from top to bottom

&nbsp;
## 2.2 Setting up the coding environment

- If you are reading this book, you likely coded in Python before
- The simplest way to install dependencies, if you already have a Python environment set up (with Python 3.10 or newer), is to use `pip`:

In [2]:
#!pip install -r https://raw.githubusercontent.com/rasbt/reasoning-from-scratch/refs/heads/main/requirements.txt

- For this chapter, dependencies can also be installed manually:

In [3]:
#!pip install torch>=2.7.1 tokenizers>=0.21.2

- My preferred way is to use the widely recommended [uv](https://docs.astral.sh/uv/) Python package and project manager
- To install `uv`, run the installation for your OS from the official website: https://docs.astral.sh/uv/getting-started/installation/
- Next, clone the GitHub repo:

In [4]:
#!git clone --depth 1 https://github.com/rasbt/reasoning-from-scratch.git

- If you don't have `git` installed, you can also manually download the source code repository from the Manning website or by clicking this link: https://github.com/rasbt/reasoning-from-scratch/archive/refs/heads/main.zip (unzip it after downloading)

- In the terminal, navigate to the `reasoning-from-scratch` folder
- Run `uv run jupyter lab` to launch JupyterLab and open a blank notebook or the notebook for this chapter
- This command also sets up a local virtual environment (usually in `.venv/`) and installs all dependencies from the `pyproject.toml` file inside the `reasoning-from-scratch` folder automatically

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F03_raschka.webp?1" width="500px">

- See [../02_setup-tips/python-instructions.md](../02_setup-tips/python-instructions.md) for additional installation details and options if needed

&nbsp;
## 2.3 Understanding hardware needs and recommendations

- If you are new to PyTorch, I recommend reading through my [PyTorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs](https://sebastianraschka.com/teaching/pytorch-1h/) tutorial
- If you followed the previous section, you should have PyTorch installed
- Check manually if your PyTorch installation supports GPU; see what's supported on your machine: 

In [5]:
import torch

print(f"PyTorch version {torch.__version__}")

if torch.cuda.is_available():
    print("CUDA GPU")
elif torch.mps.is_available():
    print("Apple Silicon GPU")
else:
    print("Only CPU")

PyTorch version 2.7.1
Apple Silicon GPU


- Depending on the chapter, code will automatically use NVIDIA GPU if available, otherwise run on CPU (or Apple Silicon GPU if recommended for a particular section or chapter)
- Chapters 2-4 can be executed in a reasonable time on a CPU
- Code in chapters 5-7 will be very slow when executed on a CPU, and a GPU with NVIDIA is recommended for these chapters (more on the exact resource needs in those upcoming chapters)
- My personal preference is [Lightning AI Studio](https://lightning.ai/), which offers users free compute credits after the sign-up and verification process; alternatively, [Google Colab](https://colab.research.google.com/) is another good choice

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F04_raschka.webp" width="500px">

- See [../02_setup-tips/gpu-instructions.md](../02_setup-tips/gpu-instructions.md) for cloud compute recommendations if needed
- But for now, there is no need to use GPUs yet; the first chapters run fine on non-GPU hardware

&nbsp;
## 2.4 Preparing input texts for LLMs

- In this section, we learn how to use a tokenizer; we use it to convert (encode) input text into a token ID representation as input to the LLM
- We also use the tokenizer to convert (decode) the LLM output back into a human-readable text representation

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F05_raschka.webp?1" width="500px">

- As mentioned earlier, implementing the LLM and tokenizer from scratch is outside the scope of this book, which is focused on implementing reasoning methods from scratch on top of an existing LLM and tokenizer
- In this book, we will work with a pre-trained LLM that we will load in the next section; here, we load the tokenizer that goes with it
- I prepared a `reasoning_from_scratch` Python package that provides the base LLM and the corresponding tokenizer, which I coded with the help of the [`tokenizers`](https://github.com/huggingface/tokenizers) Python library package
- The `reasoning_from_scratch` package code is part of this book's supplementary code, and it should already be installed based on the instructions in section 2.2

- Next, we download the tokenizer files (this is a tokenizer for the Qwen3 base LLM, but more on that in the next section):

In [6]:
from reasoning_from_scratch.qwen3 import download_qwen3_small

download_qwen3_small(kind="base", tokenizer_only=True, out_dir="qwen3")

✓ qwen3/tokenizer-base.json already up-to-date


- Now, we can load the tokenizer settings from the tokenizer file into the `Qwen3Tokenizer`:

In [7]:
from pathlib import Path
from reasoning_from_scratch.qwen3 import Qwen3Tokenizer

tokenizer_path = Path("qwen3") / "tokenizer-base.json"
tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

- Since we haven't loaded the LLM itself yet, we will do a simpler round-trip: we encode the text into token IDs and then encode it back into its string representation:

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F06_raschka.webp" width="500px">

In [8]:
prompt = "Explain large language models."
input_token_ids_list = tokenizer.encode(prompt)

In [9]:
for i in input_token_ids_list:
    print(f"{i} --> {tokenizer.decode([i])}")

840 --> Ex
20772 --> plain
3460 -->  large
4128 -->  language
4119 -->  models
13 --> .


In [10]:
text = tokenizer.decode(input_token_ids_list)
print(text)

Explain large language models.


- In case of the `Qwen3Tokenizer`, there are about 151 thousand unique tokens (vocabulary size)

- Additional resources on tokenization:
  - [Build a Large Language Model (from Scratch)](https://mng.bz/M96o) chapter 2
  - [Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch](https://sebastianraschka.com/blog/2025/bpe-from-scratch.html)

&nbsp;
## 2.5 Loading pre-trained models

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F07_raschka.webp" width="500px">

- As hinted at in the previous section, when loading the tokenizer, this book uses Qwen3 0.6B; after thinking long and hard about which open weight base model to use, I opted for Qwen3 because
  - Qwen3 is the leading open-weight model in terms of modeling performance as of this writing
  - Qwen3 0.6B is more memory efficient than Llama 3 1B
  - There's both a base model (which we focus on for reasoning model development) and an official reasoning variant that we can use as a reference model
- (Note that the canonical spelling does not include a whitespace in "Qwen3" whereas it includes one in "Llama 3")
- In the spirit of "from-scratch" we are using a reimplementation of Qwen3 that I wrote in pure PyTorch without any external LLM library dependencies; this from-scratch implementation is compatible with the original Qwen3 model weights
- However, we will not go over the Qwen3 code implementation in this book as this would be a whole book by itself (similar to my [Build A Large Language Model (From Scratch)](https://github.com/rasbt/LLMs-from-scratch) book; instead, this book (Build A Reasoning Model From Scratch) focuses on implementing reasoning methods from scratch on top of a base model (here Qwen3)
- See appendix C for the Qwen3 model code
- See appendix D for loading the reasoning variant and larger Qwen3 models
- See the Qwen3 [GitHub repository](https://github.com/QwenLM/Qwen3) and [technical report](https://arxiv.org/abs/2505.09388) for (even) more details

- The model is purposefully small (but still very capable) to run on consumer hardware
- It runs fine on CPU, NVIDIA GPUs (`"cuda"`), Apple Silicon GPUs (`"mps"`), and Intel GPUs (`"xpu"`); more about the performance trade-offs later in this chapter

In [11]:
from packaging import version

def get_device(enable_tensor_cores=True):
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("Using NVIDIA CUDA GPU")
        
        if enable_tensor_cores:
            if version.parse(torch.__version__) >= version.parse("2.9.0"):
                torch.backends.cuda.matmul.fp32_precision = "tf32"
                torch.backends.cudnn.conv.fp32_precision = "tf32"
            else:
                torch.backends.cuda.matmul.allow_tf32 = True
                torch.backends.cudnn.allow_tf32 = True

    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using Apple Silicon GPU (MPS)")

    elif torch.xpu.is_available():
        device = torch.device("xpu")
        print("Using Intel GPU")

    else:
        device = torch.device("cpu")
        print("Using CPU")

    return device

device = get_device()

Using Apple Silicon GPU (MPS)


- I recommend running the code on `"cpu"` on the first run-through, so we hardcode the device below: 

In [12]:
# Recommended: Use CPU on the first run-through
device = torch.device("cpu")

- Then, we download the file containing the pre-trained model weights, which is approximately 1.5 GB in size:

In [13]:
download_qwen3_small(kind="base", tokenizer_only=False, out_dir="qwen3")

✓ qwen3/qwen3-0.6B-base.pth already up-to-date
✓ qwen3/tokenizer-base.json already up-to-date


- The architectural structure of the Qwen3 0.6B model we are loading is shown below for readers who are familiar with LLM architectures, but note that for this book, it's **not** essential or important to understand this architecture as we are not modifying but rather adding reasoning techniques on top in later chapters

- I coded the Qwen3 model architecture from scratch for the [reasoning-from-scratch](https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/qwen3.py) Python package contained in this code repository; the source code is also shown in appendix C; but again, this is only as a bonus for those who are curious, and it's not necessary to look at or understand these internals to follow the rest of the book

In [14]:
from reasoning_from_scratch.qwen3 import Qwen3Model, QWEN_CONFIG_06_B

model_path = Path("qwen3") / "qwen3-0.6B-base.pth"

model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))

model.to(device)

Qwen3Model(
  (tok_emb): Embedding(151936, 1024)
  (trf_blocks): ModuleList(
    (0-27): 28 x TransformerBlock(
      (att): GroupedQueryAttention(
        (W_query): Linear(in_features=1024, out_features=2048, bias=False)
        (W_key): Linear(in_features=1024, out_features=1024, bias=False)
        (W_value): Linear(in_features=1024, out_features=1024, bias=False)
        (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
        (q_norm): RMSNorm()
        (k_norm): RMSNorm()
      )
      (ff): FeedForward(
        (fc1): Linear(in_features=1024, out_features=3072, bias=False)
        (fc2): Linear(in_features=1024, out_features=3072, bias=False)
        (fc3): Linear(in_features=3072, out_features=1024, bias=False)
      )
      (norm1): RMSNorm()
      (norm2): RMSNorm()
    )
  )
  (final_norm): RMSNorm()
  (out_head): Linear(in_features=1024, out_features=151936, bias=False)
)

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F08_raschka.webp" width="300px">

&nbsp;
## 2.6 Understanding the sequential LLM text generation process

- In this section, we code a simple wrapper function so we can use the LLM to generate text (we will extend this function with extra functionality in chapter 4)

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F09_raschka.webp?1" width="500px">

- LLMs generate one word at a time:

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F10_raschka.webp?2" width="500px">

- The figure above is a simplification, only showing the newly generated word; the figure below zooms in on the first iteration:

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F11_raschka.webp" width="3b00px">

In [15]:
example = torch.tensor([1, 2, 3]) 
print(example)
print(example.unsqueeze(0))

tensor([1, 2, 3])
tensor([[1, 2, 3]])


In [16]:
example = torch.tensor([[1, 2, 3]]) 
print(example)
print(example.squeeze(0))

tensor([[1, 2, 3]])
tensor([1, 2, 3])


<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F11_raschka.webp?2" width="300px">

In [17]:
prompt = "Explain large language models."
input_token_ids_list = tokenizer.encode(prompt)
print(f"Number of input tokens: {len(input_token_ids_list)}")

input_tensor = torch.tensor(input_token_ids_list)
input_tensor_fmt = input_tensor.unsqueeze(0).to(device)

output_tensor = model(input_tensor_fmt)
output_tensor_fmt = output_tensor.squeeze(0)
print(f"Formatted Output tensor shape: {output_tensor_fmt.shape}")

Number of input tokens: 6
Formatted Output tensor shape: torch.Size([6, 151936])


<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F12_raschka.webp" width="500px">

In [18]:
last_token = output_tensor_fmt[-1].detach()
print(last_token)

tensor([ 7.3750,  2.0312,  8.0000,  ..., -2.5469, -2.5469, -2.5469],
       dtype=torch.bfloat16)


In [19]:
print(last_token.argmax(dim=-1, keepdim=True))

tensor([20286])


In [20]:
print(tokenizer.decode([20286]))

 Large


In [21]:
example = torch.tensor([-2, 1, 3, 1])
print(torch.max(example))
print(torch.argmax(example))

tensor(3)
tensor(2)


&nbsp;
## 2.7 Coding a minimal text generation function


<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F13_raschka.webp" width="500px">

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F14_raschka.webp?2" width="500px">

- The `generate_text_basic` function implements this sequential text generation process:

In [22]:
@torch.inference_mode()
def generate_text_basic(
    model,
    token_ids,
    max_new_tokens,
    eos_token_id=None
):
    input_length = token_ids.shape[1]
    model.eval()

    for _ in range(max_new_tokens):
        out = model(token_ids)[:, -1]
        next_token = torch.argmax(out, dim=-1, keepdim=True)

        # Stop if all sequences in the batch have generated EOS
        if (eos_token_id is not None
                and next_token.item() == eos_token_id):
            break

        token_ids = torch.cat([token_ids, next_token], dim=1)
    return token_ids[:, input_length:]

- Let's use it to generate a 100-token response to a simple "Explain large language models in 2 sentences." prompt to see how it works (we get to the reasoning parts in later chapters)
- The following code will be slow and can take 1-3 minutes to complete, depending on your computer (we will speed it up in later sections) 

In [23]:
prompt = "Explain large language models in a single sentence."
input_token_ids_tensor = torch.tensor(
    tokenizer.encode(prompt),
    device=device
    ).unsqueeze(0)

max_new_tokens = 100
output_token_ids_tensor = generate_text_basic(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
)
output_text = tokenizer.decode(
    output_token_ids_tensor.squeeze(0).tolist()
)
print(output_text)

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.<|endoftext|>Human language is a complex and dynamic system that has evolved over millions of years to enable effective communication and social interaction. It is composed of a vast array of symbols, including letters, numbers, and words, which are used to convey meaning and express thoughts and ideas. The evolution of language has


- Notice that the LLM follows the instruction quite well, but the response becomes nonsensical/off-topic after `<|endoftext|>`, which is a token used as a delimiter between different documents during training
- When using the LLM, we want it to stop generating after encountering this token

In [24]:
print(tokenizer.encode("<|endoftext|>"))

[151643]


- For convenience, this token ID is stored as a tokenizer attribute (eos = end of sequence):

In [25]:
print(tokenizer.eos_token_id)

151643


- We can use it to tell the LLM (or rather the `generate_text_basic` function) when to stop generating text

In [26]:
output_token_ids_tensor = generate_text_basic(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
    eos_token_id=tokenizer.eos_token_id
)

output_text = tokenizer.decode(
    output_token_ids_tensor.squeeze(0).tolist()
)
print(output_text)

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.


- The response above is what you get when running to code on CPU, the generated text may differ slightly differ depending on the device

- Before we wrap up this section and see how we can speed up the code, let's implement a simple benchmarking function to track the computational performance

In [27]:
def generate_stats(output_token_ids, tokenizer, start_time,
                   end_time, print_tokens=True):
    total_time = end_time - start_time
    print(f"Time: {total_time:.2f} sec")
    print(f"{int(output_token_ids.numel() / total_time)} tokens/sec")

    for name, backend in (("CUDA", getattr(torch, "cuda", None)),
                          ("XPU", getattr(torch, "xpu", None))):
        if backend is not None and backend.is_available():
            max_mem_bytes = backend.max_memory_allocated()
            max_mem_gb = max_mem_bytes / (1024 ** 3)
            print(f"Max {name} memory allocated: {max_mem_gb:.2f} GB")
            backend.reset_peak_memory_stats()

    if print_tokens:
        output_text = tokenizer.decode(output_token_ids.squeeze(0).tolist())
        print(f"\n{output_text}")

In [28]:
import time

start_time = time.time()
output_token_ids_tensor = generate_text_basic(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
    eos_token_id=tokenizer.eos_token_id
)
end_time = time.time()


generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

Time: 9.23 sec
4 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.


&nbsp;
## 2.8 Faster inference via KV caching

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F15_raschka.webp?2" width="500px">

- Note that the code in this book emphasizes code readability, and a whole separate book can be written about optimizations
- Here, we look at an engineering trick called "KV caching" (KV refers to the keys and values inside the attention mechanism of the LLM)
- If you are unfamiliar with these terms, don't worry, all you need to know is that there is a way we can store (cache) intermediate values that are reused in each iteration

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F16_raschka.webp" width="500px">

- For more details on the mechanics of KV caching, see my [Understanding and Coding the KV Cache in LLMs from Scratch](https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms) article
- Below is a modified version of the `generate_text_basic` function that uses a KV cache

In [29]:
from reasoning_from_scratch.qwen3 import KVCache

@torch.inference_mode()
def generate_text_basic_cache(
    model,
    token_ids,
    max_new_tokens,
    eos_token_id=None
):

    input_length = token_ids.shape[1] 
    model.eval()
    cache = KVCache(n_layers=model.cfg["n_layers"])
    model.reset_kv_cache()

    out = model(token_ids, cache=cache)[:, -1]
    for _ in range(max_new_tokens):
        next_token = torch.argmax(out, dim=-1, keepdim=True)

        if (eos_token_id is not None 
               and next_token.item() == eos_token_id):
            break

        token_ids = torch.cat([token_ids, next_token], dim=1)
        out = model(next_token, cache=cache)[:, -1]

    return token_ids[:, input_length:]

- The usage is similar to before:

In [30]:
start_time = time.time()

output_token_ids_tensor = generate_text_basic_cache(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
    eos_token_id=tokenizer.eos_token_id,
)
end_time = time.time()

generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

Time: 1.40 sec
29 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.


- As we can see, it is magnitudes faster than before (28 tokens/sec instead of 4 tokens/sec; run on a Mac Mini M4 CPU)

&nbsp;
## 2.9 Faster inference via PyTorch model compilation

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F17_raschka.webp?2" width="500px">

- Another technique to speed up the model inference (text generation) by a lot is using `torch.compile`
- Note that this currently doesn't work on MPS (Apple Silicon GPU) devices due to `InductorError`
- The usage is simple, we just call `torch.compile` on the model (see [the documentation](https://docs.pytorch.org/docs/stable/torch.compiler.html) for additional options)

In [31]:
if device.type == "mps":
    print(f"`torch.compile` is not supported for the {model.__class__.__name__} model on MPS (Apple Silicon) as of this writing.")
    model_compiled = model
    # Assignment so that notebook doesn't stop here if someone uses "Run All Cells"
else:
    major, minor = map(int, torch.__version__.split(".")[:2])
    if (major, minor) >= (2, 8):
        # This avoids retriggering model recompilations 
        # in PyTorch 2.8 and newer
        # if the model contains code like self.pos = self.pos + 1
        torch._dynamo.config.allow_unspec_int_on_nn_module = True
    model_compiled = torch.compile(model)

---

**Windows note**

- Compilation can be tricky on Windows
- `torch.compile()` uses Inductor, which JIT-compiles kernels and needs a working C/C++ toolchain
- For CUDA, Inductor also depends on Triton, available via the community package `triton-windows`
  - If you see `cl not found`, [install Visual Studio Build Tools with the "C++ workload"](https://learn.microsoft.com/en-us/cpp/build/vscpp-step-0-installation?view=msvc-170) and run Python from the "x64 Native Tools" prompt
  - If you see `triton not found` with CUDA, install `triton-windows` (for example, `uv pip install "triton-windows<3.4"`).
- For CPU, a reader further recommended following this [PyTorch Inductor guide for Windows](https://docs.pytorch.org/tutorials/unstable/inductor_windows.html)
  - Here, it is important to install the English language package when installing Visual Studio 2022 to avoid a UTF-8 error
  - Also, please note that the code needs to be run via the "Visual Studio 2022 Developer Command Prompt" rather than a notebook
- If this setup proves tricky, you can skip compilation; **compilation is optional, and all code examples work fine without it**

---

- The first iteration can be a bit slow as it does the initial compilation and optimization; hence, we repeat the text generation multiple times
- First, let's start with the non-cached version (this can be a bit slow and might take xx minutes)

In [32]:
for i in range(3):
    start_time = time.time()
    output_token_ids_tensor = generate_text_basic(
        model=model_compiled,
        token_ids=input_token_ids_tensor,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id
    )
    end_time = time.time()

    if i == 0:
        print("Warm-up run")
    else:
        print(f"Timed run {i}:")
    generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

    print(f"\n{30*'-'}\n")

Warm-up run
Time: 27.70 sec
1 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------

Timed run 1:
Time: 7.09 sec
5 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------

Timed run 2:
Time: 7.19 sec
5 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------



- As we can see above, with 5 tokens/sec, this is only marginally faster than before (4 tokens/sec)
- Let's now see how well the KV cache version does

In [33]:
for i in range(3):
    start_time = time.time()
    output_token_ids_tensor = generate_text_basic_cache(
        model=model_compiled,
        token_ids=input_token_ids_tensor,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id
    )
    end_time = time.time()

    if i == 0:
        print("Warm-up run")
    else:
        print(f"Timed run {i}:")
    generate_stats(
        output_token_ids_tensor, tokenizer, start_time, end_time
    )

    print(f"\n{30*'-'}\n")

Warm-up run
Time: 29.87 sec
1 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------

Timed run 1:
Time: 0.60 sec
68 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------

Timed run 2:
Time: 0.62 sec
66 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------



- As we can see, the compilation resulted in a substantial 2x speed-up (64 tokens/sec versus 30 tokens/sec)
- Below is a table with additional results

| Model      | Mode              | Hardware             | Tokens/sec    | GPU Memory (VRAM) |
|------------|-------------------|----------------------|---------------|-------------------|
| Qwen3Model | Regular           | Mac Mini M4 CPU      | 6             | -                 |
| Qwen3Model | Regular compiled  | Mac Mini M4 CPU      | 6             | -                 |
| Qwen3Model | KV cache          | Mac Mini M4 CPU      | 28            | -                 |
| Qwen3Model | KV cache compiled | Mac Mini M4 CPU      | 68            | -                 |
|            |                   |                      |               |                   |
| Qwen3Model | Regular           | Mac Mini M4 GPU      | 17            | -                 |
| Qwen3Model | Regular compiled  | Mac Mini M4 GPU      | InductorError | -                 |
| Qwen3Model | KV cache          | Mac Mini M4 GPU      | 18            | -                 |
| Qwen3Model | KV cache compiled | Mac Mini M4 GPU      | InductorError | -                 |
|            |                   |                      |               |                   |
| Qwen3Model | Regular           | NVIDIA H100 GPU      | 51            | 1.55 GB           |
| Qwen3Model | Regular compiled  | NVIDIA H100 GPU      | 164           | 1.81 GB           |
| Qwen3Model | KV cache          | NVIDIA H100 GPU      | 48            | 1.52 GB           |
| Qwen3Model | KV cache compiled | NVIDIA H100 GPU      | 141           | 1.81 GB           |
|            |                   |                      |               |                   |
| Qwen3Model | Regular           | NVIDIA DGX Spark GPU | 72            | 1.53 GB           |
| Qwen3Model | Regular compiled  | NVIDIA DGX Spark GPU | 118           | 1.49 GB           |
| Qwen3Model | KV cache          | NVIDIA DGX Spark GPU | 69            | 1.47 GB           |
| Qwen3Model | KV cache compiled | NVIDIA DGX Spark GPU | 107           | 1.47 GB           |

- The NVIDIA DGX Spark above uses a GB10 (Blackwell) GPU
- Note that we ran all the examples with a single prompt (i.e., a batch size of 1); if you are curious about batched inference, see appendix E

&nbsp;
## Summary

- No code in this section