<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 2: Exercise Solutions

Packages that are being used in this notebook:

In [1]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

reasoning_from_scratch version: 0.1.2
torch version: 2.7.1
tokenizers version: 0.21.4


&nbsp;
## Exercise 2.1: Encoding unknown words

In [2]:
from pathlib import Path

from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Tokenizer,
)

download_qwen3_small(kind="base", tokenizer_only=True, out_dir="qwen3")

tokenizer_path = Path("qwen3") / "tokenizer-base.json"
tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

prompt = "Hello, Ardwarklethyrx. Haus und Garten."
input_token_ids_list = tokenizer.encode(prompt)

for i in input_token_ids_list:
    print(f"{[i]} --> {tokenizer.decode([i])}")

✓ qwen3/tokenizer-base.json already up-to-date
[9707] --> Hello
[11] --> ,
[1644] -->  Ar
[29406] --> dw
[838] --> ark
[273] --> le
[339] --> th
[10920] --> yr
[87] --> x
[13] --> .
[47375] -->  Haus
[2030] -->  und
[93912] -->  Garten
[13] --> .


- Unknown words are broken into smaller pieces of subwords or even single tokens; this allows the tokenizer and LLM to handle any input
- German words (Haus und Garten) are not broken down here, suggesting that the tokenizer has seen German texts during training, and the LLM was likely trained on German texts as well

&nbsp;
## Exercise 2.2: Run code on GPU devices

- Simply delete the line `device = torch.device("cpu")` in section 2.5, and then rerun the code
- For convenience, a minimal, self-contained example using the relevant code from chapter 2 is included below

In [5]:
from pathlib import Path
import torch

from reasoning_from_scratch.ch02 import (
    get_device,
    generate_text_basic_stream,
    generate_text_basic_stream_cache,
    generate_stats
)
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Tokenizer,
    Qwen3Model,
    QWEN_CONFIG_06_B
)

device = get_device()
device = torch.device("cpu")

download_qwen3_small(kind="base", tokenizer_only=False, out_dir="qwen3")

tokenizer_path = Path("qwen3") / "tokenizer-base.json"
model_path = Path("qwen3") / "qwen3-0.6B-base.pth"

tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)
model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))

model.to(device);

Using Apple Silicon GPU (MPS)
✓ qwen3/qwen3-0.6B-base.pth already up-to-date
✓ qwen3/tokenizer-base.json already up-to-date


In [6]:
prompt = "Explain large language models in 1 sentence."
input_token_ids_tensor = torch.tensor(
    tokenizer.encode(prompt),
    device=device
    ).unsqueeze(0)

In [7]:
import time


max_new_tokens = 100
start_time = time.time()
generated_ids = []

for token in generate_text_basic_stream(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
    eos_token_id=tokenizer.eos_token_id
):
    token_id = token.squeeze(0).tolist()
    print(
        tokenizer.decode(token_id),
        end="",
        flush=True
    )

    next_token_id = token.squeeze(0)
    generated_ids.append(next_token_id)  # Collect generated tokens

end_time = time.time()

output_token_ids_tensor = torch.cat(generated_ids, dim=0)
generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

Output length: 41
Time: 9.63 sec
4 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.


In [8]:
start_time = time.time()

for token in generate_text_basic_stream_cache(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
    eos_token_id=tokenizer.eos_token_id
):
    token_id = token.squeeze(0).tolist()
    print(
        tokenizer.decode(token_id),
        end="",
        flush=True
    )

    next_token_id = token.squeeze(0)
    generated_ids.append(next_token_id)  # Collect generated tokens

end_time = time.time()

output_token_ids_tensor = torch.cat(generated_ids, dim=0)
generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

Time: 1.51 sec
27 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays, and even creating creative content.


In [9]:
if device.type == "mps":
    print(f"`torch.compile` is not supported for the {model.__class__.__name__} model on MPS (Apple Silicon) as of this writing.")
    model_compiled = model
    # Assignment so that notebook doesn't stop here if someone uses "Run All Cells"
else:
    major, minor = map(int, torch.__version__.split(".")[:2])
    if (major, minor) >= (2, 8):
        # This avoids retriggering model recompilations 
        # in PyTorch 2.8 and newer
        # if the model contains code like self.pos = self.pos + 1
        torch._dynamo.config.allow_unspec_int_on_nn_module = True
        
    model_compiled = torch.compile(model)

In [10]:
for i in range(3):

    start_time = time.time()
    generated_ids = []
    
    for token in generate_text_basic_stream(
        model=model_compiled,
        token_ids=input_token_ids_tensor,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id
    ):
        token_id = token.squeeze(0).tolist()
        print(
            tokenizer.decode(token_id),
            end="",
            flush=True
        )
    
        next_token_id = token.squeeze(0)
        generated_ids.append(next_token_id)  # Collect generated tokens
    
    end_time = time.time()
    

    if i == 0:
        print("Warm-up run")
    else:
        print(f"Timed run {i}:")

    output_token_ids_tensor = torch.cat(generated_ids, dim=0)
    generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

    print(f"\n{30*'-'}\n")

Warm-up run
Time: 11.78 sec
3 tokens/sec

 Large language models are artificial intelligence systems that use vast amounts of text data to understand, generate, and process human language, enabling them to perform tasks such as translation, summarization, and question answering.

------------------------------

Timed run 1:
Time: 6.68 sec
5 tokens/sec

 Large language models are artificial intelligence systems that use vast amounts of text data to understand, generate, and process human language, enabling them to perform tasks such as translation, summarization, and question answering.

------------------------------

Timed run 2:
Time: 6.60 sec
6 tokens/sec

 Large language models are artificial intelligence systems that use vast amounts of text data to understand, generate, and process human language, enabling them to perform tasks such as translation, summarization, and question answering.

------------------------------



In [11]:
for i in range(3):

    start_time = time.time()
    generated_ids = []
    
    for token in generate_text_basic_stream_cache(
        model=model_compiled,
        token_ids=input_token_ids_tensor,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id
    ):
        token_id = token.squeeze(0).tolist()
        print(
            tokenizer.decode(token_id),
            end="",
            flush=True
        )
    
        next_token_id = token.squeeze(0)
        generated_ids.append(next_token_id)  # Collect generated tokens
    
    end_time = time.time()
    

    if i == 0:
        print("Warm-up run")
    else:
        print(f"Timed run {i}:")

    output_token_ids_tensor = torch.cat(generated_ids, dim=0)
    generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

    print(f"\n{30*'-'}\n")

Warm-up run
Time: 8.05 sec
5 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------

Timed run 1:
Time: 0.64 sec
64 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------

Timed run 2:
Time: 0.63 sec
64 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.

------------------------------



| Tokens Generated  | Mode              | Hardware        | Tokens/sec    | GPU Memory (VRAM) |
|-------------------|-------------------|-----------------|---------------|-------------------|
| 41                | Regular           | Mac Mini M4 CPU | 6             | -                 |
| 41                | Regular compiled  | Mac Mini M4 CPU | 6             | -                 |
| 41                | KV cache          | Mac Mini M4 CPU | 28            | -                 |
| 41                | KV cache compiled | Mac Mini M4 CPU | 68            | -                 |
|                   |                   |                 |               |                   |
| 41                | Regular           | Mac Mini M4 GPU | 17            | -                 |
| 41                | Regular compiled  | Mac Mini M4 GPU | InductorError | -                 |
| 41                | KV cache          | Mac Mini M4 GPU | 18            | -                 |
| 41                | KV cache compiled | Mac Mini M4 GPU | InductorError | -                 |
|                   |                   |                 |               |                   |
| 41                | Regular           | NVIDIA H100 GPU | 51            | 1.55 GB           |
| 41                | Regular compiled  | NVIDIA H100 GPU | 164           | 1.81 GB           |
| 41                | KV cache          | NVIDIA H100 GPU | 48            | 1.52 GB           |
| 41                | KV cache compiled | NVIDIA H100 GPU | 141           | 1.81 GB           |