<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Tiny Aya From Scratch (A Standalone Notebook)

- This notebook is purposefully minimal and focuses on the code to re-implement Tiny Aya (3.35B) models from Cohere in pure PyTorch without relying on other external LLM libraries; Tiny Aya is interesting because it is a small but strong model with good multi-lingual support
- For more information, see the official [Tiny Aya announcement](https://cohere.com/blog/cohere-labs-tiny-aya) and model cards:
  - [tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (base model)
  - [tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (best balance across languages and regions; notebook default)
  - [tiny-aya-fire](https://huggingface.co/CohereLabs/tiny-aya-fire) (optimized for South Asian languages)
  - [tiny-aya-water](https://huggingface.co/CohereLabs/tiny-aya-water) (optimized for European and Asia Pacific languages)
  - [tiny-aya-earth](https://huggingface.co/CohereLabs/tiny-aya-earth) (optimized for West Asian and African languages)


- Below is a table with more details regarding the language specialization (taken from their announcement blog post linked above)

| Region        | Languages | Optimized Model |
|---------------|-----------|----------------|
| **Asia Pacific** | Traditional Chinese, Cantonese, Vietnamese, Tagalog, Javanese, Khmer, Thai, Burmese, Malay, Korean, Lao, Indonesian, Simplified Chinese, Japanese | tiny-aya-water |
| **Africa** | Zulu, Amharic, Hausa, Igbo, Swahili, Xhosa, Wolof, Shona, Yoruba, Nigerian Pidgin, Malagasy | tiny-aya-earth |
| **South Asia** | Telugu, Marathi, Bengali, Tamil, Hindi, Punjabi, Gujarati, Urdu, Nepali | tiny-aya-fire |
| **Europe** | Catalan, Galician, Dutch, Danish, Finnish, Czech, Portuguese, French, Lithuanian, Slovak, Basque, English, Swedish, Polish, Spanish, Slovenian, Ukrainian, Greek, Bokmål, Romanian, Serbian, German, Italian, Russian, Irish, Hungarian, Bulgarian, Croatian, Estonian, Latvian, Welsh | tiny-aya-water |
| **West Asia** | Arabic, Maltese, Turkish, Hebrew, Persian | tiny-aya-earth |


- Below is a side-by-side comparison with Qwen3 4B as a reference model; if you are interested in the Qwen3 standalone notebook, you can find it [here](../11_qwen3)
<br>

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/tiny-aya/01.webp" width="900px">

  
- About the code:
  - all code is my own code, mapping the Tiny Aya architecture onto the model code implemented in my [Build A Large Language Model (From Scratch)](http://mng.bz/orYv) book; the code is released under a permissive open-source Apache 2.0 license (see [LICENSE.txt](https://github.com/rasbt/LLMs-from-scratch/blob/main/LICENSE.txt))

In [1]:
# pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt

In [2]:
from importlib.metadata import version

pkgs = [
    #"blobfile",         # to download pretrained weights
    "huggingface_hub",  # to download pretrained weights
    "tiktoken",         # to implement the tokenizer
    "torch",            # to implement the model
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

huggingface_hub version: 1.4.1
tiktoken version: 0.12.0
torch version: 2.10.0


In [3]:
from pathlib import Path

REPO_ID = "CohereLabs/tiny-aya-global"
#REPO_ID = "CohereLabs/tiny-aya-fire" 
#REPO_ID = "CohereLabs/tiny-aya-water"
#REPO_ID = "CohereLabs/tiny-aya-earth"

LOCAL_DIR = Path(REPO_ID).parts[-1]

&nbsp;
# 1. Architecture code

In [4]:
import torch
import torch.nn as nn


class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)

    def forward(self, x):
        x_fc1 = self.fc1(x)
        x_fc2 = self.fc2(x)
        x = nn.functional.silu(x_fc1) * x_fc2
        return self.fc3(x)

In [5]:
# Aya uses a bias-less LayerNorm variant. 
# The difference to classic LayerNorm is that it only 
# has a scale parameter (weight), no shift parameter (bias).

class CohereLayerNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(emb_dim))

    def forward(self, x):
        input_dtype = x.dtype
        x = x.to(torch.float32)
        mean = x.mean(dim=-1, keepdim=True)
        variance = (x - mean).pow(2).mean(dim=-1, keepdim=True)
        x = (x - mean) * torch.rsqrt(variance + self.eps)
        return (self.weight.to(torch.float32) * x).to(input_dtype)

In [6]:
def compute_rope_params(head_dim, theta_base=10_000, context_length=4096, dtype=torch.float32):
    assert head_dim % 2 == 0, "head_dim must be even"

    # Compute the inverse frequencies
    inv_freq = 1.0 / (
        theta_base ** (torch.arange(0, head_dim, 2, dtype=dtype)[: (head_dim // 2)].float() / head_dim)
    )
    positions = torch.arange(context_length, dtype=dtype)

    # Compute the angles
    angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0)  # Shape: (context_length, head_dim // 2)

    # Cohere uses interleaved even/odd angle layout per head-dim pair.
    # Llama2 notebook examples often use a split-halves layout via cat([angles, angles]).
    # Both are equivalent only when paired with the matching rotate logic:
    # - interleaved layout -> even/odd rotation implementation (below)
    # - split-halves layout -> half/half rotate implementation
    angles = torch.repeat_interleave(angles, 2, dim=1)  # Shape: (context_length, head_dim)

    # Precompute sine and cosine
    return torch.cos(angles), torch.sin(angles)

def apply_rope(x, cos, sin, offset=0):
    # x: (batch_size, num_heads, seq_len, head_dim)
    batch_size, num_heads, seq_len, head_dim = x.shape
    assert head_dim % 2 == 0, "head_dim must be even"

    # Split x into even and odd components (interleaved layout)
    x_even = x[..., ::2]
    x_odd = x[..., 1::2]

    # Adjust sin and cos shapes
    cos = cos[offset:offset + seq_len, :].unsqueeze(0).unsqueeze(0)
    sin = sin[offset:offset + seq_len, :].unsqueeze(0).unsqueeze(0)

    # Apply the rotary transformation
    x_float = x.float()
    rotated = torch.stack((-x_odd.float(), x_even.float()), dim=-1).flatten(-2)
    x_rotated = (x_float * cos) + (rotated * sin)

    return x_rotated.to(dtype=x.dtype)

In [7]:
class GroupedQueryAttention(nn.Module):
    def __init__(
        self,
        d_in,
        num_heads,
        num_kv_groups,
        head_dim=None,
        qk_norm=False,
        attention_bias=False,
        dtype=None,
        attn_type="full_attention",
    ):
        super().__init__()
        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"

        self.num_heads = num_heads
        self.num_kv_groups = num_kv_groups
        self.group_size = num_heads // num_kv_groups

        if head_dim is None:
            assert d_in % num_heads == 0, "`d_in` must be divisible by `num_heads` if `head_dim` is not set"
            head_dim = d_in // num_heads

        self.head_dim = head_dim
        self.d_out = num_heads * head_dim
        self.attn_type = attn_type

        self.W_query = nn.Linear(
            d_in,
            self.d_out,
            bias=attention_bias,
            dtype=dtype,
        )
        self.W_key = nn.Linear(
            d_in,
            num_kv_groups * head_dim,
            bias=attention_bias,
            dtype=dtype,
        )
        self.W_value = nn.Linear(
            d_in,
            num_kv_groups * head_dim,
            bias=attention_bias,
            dtype=dtype,
        )
        self.out_proj = nn.Linear(
            self.d_out,
            d_in,
            bias=attention_bias,
            dtype=dtype,
        )

        if qk_norm:
            self.q_norm = CohereLayerNorm(head_dim, eps=1e-6)
            self.k_norm = CohereLayerNorm(head_dim, eps=1e-6)
        else:
            self.q_norm = self.k_norm = None

    def forward(self, x, mask, cos, sin, start_pos=0, cache=None):
        b, num_tokens, _ = x.shape

        # Apply projections
        queries = self.W_query(x)  # (b, num_tokens, num_heads * head_dim)
        keys = self.W_key(x)       # (b, num_tokens, num_kv_groups * head_dim)
        values = self.W_value(x)   # (b, num_tokens, num_kv_groups * head_dim)

        # Reshape
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
        keys_new = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim).transpose(1, 2)
        values_new = values.view(b, num_tokens, self.num_kv_groups, self.head_dim).transpose(1, 2)

        # Optional normalization
        if self.q_norm:
            queries = self.q_norm(queries)
        if self.k_norm:
            keys_new = self.k_norm(keys_new)

        # Cohere2 applies RoPE only on sliding-attention layers.
        if self.attn_type == "sliding_attention":
            queries = apply_rope(queries, cos, sin, offset=start_pos)
            keys_new = apply_rope(keys_new, cos, sin, offset=start_pos)

        if cache is not None:
            prev_k, prev_v = cache
            keys = torch.cat([prev_k, keys_new], dim=2)
            values = torch.cat([prev_v, values_new], dim=2)
            next_cache = (keys, values)
        else:
            keys, values = keys_new, values_new
            next_cache = (keys, values)

        # Expand K and V to match number of heads
        keys = keys.repeat_interleave(self.group_size, dim=1)
        values = values.repeat_interleave(self.group_size, dim=1)

        # Attention
        attn_scores = queries @ keys.transpose(2, 3)
        attn_scores = attn_scores.masked_fill(mask, -torch.inf)

        attn_weights = torch.softmax(attn_scores / self.head_dim**0.5, dim=-1, dtype=torch.float32).to(queries.dtype)
        context = (attn_weights @ values).transpose(1, 2).reshape(b, num_tokens, self.d_out)

        return self.out_proj(context), next_cache

In [8]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg, attn_type):
        super().__init__()
        self.attn_type = attn_type

        self.att = GroupedQueryAttention(
            d_in=cfg["emb_dim"],
            num_heads=cfg["n_heads"],
            num_kv_groups=cfg["n_kv_heads"],
            head_dim=cfg["head_dim"],
            qk_norm=False,
            attention_bias=cfg["attention_bias"],
            dtype=cfg["dtype"],
            attn_type=attn_type,
        )
        self.ff = FeedForward(cfg)
        self.input_layernorm = CohereLayerNorm(cfg["emb_dim"], eps=cfg["layer_norm_eps"])

    def forward(self, x, mask_global, mask_local, cos, sin, start_pos=0, cache=None):
        attn_mask = mask_local if self.attn_type == "sliding_attention" else mask_global

        shortcut = x
        x = self.input_layernorm(x)
        x_attn, next_cache = self.att(
            x,
            attn_mask,
            cos,
            sin,
            start_pos=start_pos,
            cache=cache,
        )  # Shape [batch_size, num_tokens, emb_dim]
        x_ff = self.ff(x)

        # Cohere2 parallel residual block
        x = shortcut + x_attn + x_ff
        return x, next_cache

In [9]:
class TinyAyaModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        assert len(cfg["layer_types"]) == cfg["n_layers"], "layer_types must match n_layers"

        self.cfg = cfg

        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
        self.trf_blocks = nn.ModuleList([TransformerBlock(cfg, t) for t in cfg["layer_types"]])

        self.final_norm = CohereLayerNorm(cfg["emb_dim"], eps=cfg["layer_norm_eps"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

        self.logit_scale = cfg["logit_scale"]

        cos, sin = compute_rope_params(
            head_dim=cfg["head_dim"],
            theta_base=cfg["rope_base"],
            context_length=cfg["context_length"],
        )
        self.register_buffer("cos", cos, persistent=False)
        self.register_buffer("sin", sin, persistent=False)

        if cfg["tie_word_embeddings"]:
            self.out_head.weight = self.tok_emb.weight

        self.current_pos = 0  # Track current position in KV cache

    def create_masks(self, num_tokens, device, pos_start=0, total_kv_tokens=None):
        if total_kv_tokens is None:
            total_kv_tokens = pos_start + num_tokens

        query_positions = torch.arange(pos_start, pos_start + num_tokens, device=device).unsqueeze(1)
        key_positions = torch.arange(total_kv_tokens, device=device).unsqueeze(0)

        # Future mask
        mask_global = key_positions > query_positions

        # Sliding-window mask
        far_past = key_positions + self.cfg["sliding_window"] <= query_positions
        mask_local = mask_global | far_past

        # Expand to [batch, heads, seq, seq]-broadcastable shape
        return mask_global.unsqueeze(0).unsqueeze(0), mask_local.unsqueeze(0).unsqueeze(0)

    def forward(self, input_ids, attention_mask=None, cache=None):
        tok_embeds = self.tok_emb(input_ids)
        x = tok_embeds
        num_tokens = x.shape[1]

        if cache is not None:
            pos_start = self.current_pos
            pos_end = pos_start + num_tokens
            self.current_pos = pos_end
            total_kv_tokens = pos_end
        else:
            pos_start = 0
            total_kv_tokens = num_tokens

        mask_global, mask_local = self.create_masks(
            num_tokens,
            x.device,
            pos_start=pos_start,
            total_kv_tokens=total_kv_tokens,
        )

        if attention_mask is not None:
            # True means mask in this implementation.
            pad_mask = attention_mask[:, None, None, :total_kv_tokens].to(dtype=torch.bool).logical_not()
            mask_global = mask_global | pad_mask
            mask_local = mask_local | pad_mask

        cos = self.cos.to(x.device, dtype=x.dtype)
        sin = self.sin.to(x.device, dtype=x.dtype)

        for i, block in enumerate(self.trf_blocks):
            blk_cache = cache.get(i) if cache else None
            x, new_blk_cache = block(
                x,
                mask_global,
                mask_local,
                cos,
                sin,
                start_pos=pos_start,
                cache=blk_cache,
            )
            if cache is not None:
                cache.update(i, new_blk_cache)

        x = self.final_norm(x)
        logits = self.out_head(x.to(self.cfg["dtype"]))
        return logits * self.logit_scale

    def reset_kv_cache(self):
        self.current_pos = 0


class KVCache:
    def __init__(self, n_layers):
        self.cache = [None] * n_layers

    def get(self, layer_idx):
        return self.cache[layer_idx]

    def update(self, layer_idx, value):
        self.cache[layer_idx] = value

    def get_all(self):
        return self.cache

    def reset(self):
        for i in range(len(self.cache)):
            self.cache[i] = None

&nbsp;
# 2. Initialize model

- The remainder of this notebook uses the Llama 3.2 1B model; to use the 3B model variant, just uncomment the second configuration file in the following code cell

In [10]:
TINY_AYA_CONFIG = {
    "vocab_size": 262_144,            # Vocabulary size
    "context_length": 500_000,        # Context length in the HF config
    "emb_dim": 2048,                  # Embedding dimension
    "n_heads": 16,                    # Number of attention heads
    "n_layers": 36,                   # Number of layers
    "hidden_dim": 11_008,             # Size of the intermediate dimension in FeedForward
    "head_dim": 128,                  # Size of the heads in GQA
    "n_kv_heads": 4,                  # Number of KV heads for grouped-query attention
    "attention_bias": False,          # Whether attention projections use bias terms
    "attention_dropout": 0.0,         # Attention dropout
    "sliding_window": 4096,           # Sliding-window attention context
    "layer_types": [
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
        "sliding_attention",
        "sliding_attention",
        "sliding_attention",
        "full_attention",
    ],
    "rope_base": 50_000.0,            # The base in RoPE's "theta"
    "layer_norm_eps": 1e-5,           # Epsilon used by layer normalization
    "logit_scale": 1.0,               # Final logits scaling factor
    "tie_word_embeddings": True,      # Whether input embedding and output head are tied
    "bos_token_id": 2,
    "eos_token_id": 3,
    "pad_token_id": 0,
    "dtype": torch.bfloat16,          # Lower-precision dtype to reduce memory usage
}

In [11]:
model = TinyAyaModel(TINY_AYA_CONFIG)

In [12]:
def calc_model_memory_size(model, input_dtype=torch.float32):
    total_params = 0
    total_grads = 0
    for param in model.parameters():
        # Calculate total number of elements per parameter
        param_size = param.numel()
        total_params += param_size
        # Check if gradients are stored for this parameter
        if param.requires_grad:
            total_grads += param_size

    # Calculate buffer size (non-parameters that require memory)
    total_buffers = sum(buf.numel() for buf in model.buffers())

    # Size in bytes = (Number of elements) * (Size of each element in bytes)
    # We assume parameters and gradients are stored in the same type as input dtype
    element_size = torch.tensor(0, dtype=input_dtype).element_size()
    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size

    # Convert bytes to gigabytes
    total_memory_gb = total_memory_bytes / (1024**3)

    return total_memory_gb

print(f"float32 (PyTorch default): {calc_model_memory_size(model, input_dtype=torch.float32):.2f} GB")
print(f"bfloat16: {calc_model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")

float32 (PyTorch default): 25.43 GB
bfloat16: 12.72 GB


In [13]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

# Account for weight tying
total_params_normalized = total_params - model.tok_emb.weight.numel()
print(f"\nTotal number of unique parameters: {total_params_normalized:,}")

Total number of parameters: 3,349,227,520

Total number of unique parameters: 2,812,356,608


In [14]:
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device);

&nbsp;
# 3. Load tokenizer

In [15]:
from tokenizers import Tokenizer


class TinyAyaTokenizer:
    def __init__(self, tokenizer_file_path, eos_token_id=3, pad_token_id=0, bos_token_id=2):
        tok_file = Path(tokenizer_file_path)
        self._tok = Tokenizer.from_file(str(tok_file))

        eos_from_tok = self._tok.token_to_id("<EOS_TOKEN>")
        pad_from_tok = self._tok.token_to_id("<PAD>")
        bos_from_tok = self._tok.token_to_id("<BOS_TOKEN>")

        self.eos_token_id = eos_from_tok if eos_from_tok is not None else eos_token_id
        self.pad_token_id = pad_from_tok if pad_from_tok is not None else pad_token_id
        self.bos_token_id = bos_from_tok if bos_from_tok is not None else bos_token_id

    def encode(self, text):
        return self._tok.encode(text).ids

    def decode(self, ids):
        return self._tok.decode(ids, skip_special_tokens=False)


def apply_chat_template(user_text):
    return (
        "<BOS_TOKEN>"
        "<|START_OF_TURN_TOKEN|><|USER_TOKEN|>"
        f"{user_text}"
        "<|END_OF_TURN_TOKEN|>"
        "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>"
    )

- Please note that Cohere requires that you accept the Tiny Aya licensing terms before you can download the files; to do this, you have to create a Hugging Face Hub account and visit the [CohereLabs/tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) repository to accept the terms
- Next, you will need to create an access token; to generate an access token with READ permissions, click on the profile picture in the upper right and click on "Settings"


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/settings.webp?1" width="300px">

- Then, create and copy the access token so you can copy & paste it into the next code cell

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/access-token.webp?1" width="600px">

In [16]:
print(REPO_ID)

CohereLabs/tiny-aya-global


- Note that if you use the fire, water, base, or earth model, you'd have to accept the licensing terms separately:
  - [CohereLabs/tiny-aya-fire](https://huggingface.co/CohereLabs/tiny-aya-fire)
  - [CohereLabs/tiny-aya-water](https://huggingface.co/CohereLabs/tiny-aya-water)
  - [CohereLabs/tiny-aya-earth](https://huggingface.co/CohereLabs/tiny-aya-earth)
  - [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base)

In [17]:
# Uncomment and run the following code if you are executing the notebook for the first time

from huggingface_hub import login
login()

In [18]:
from huggingface_hub import hf_hub_download

tokenizer_file_path = Path(LOCAL_DIR) / "tokenizer.json"
if not tokenizer_file_path.exists():
    try:
        tokenizer_file_path = hf_hub_download(repo_id=REPO_ID, filename="tokenizer.json", local_dir=LOCAL_DIR)
    except Exception as e:
        print(f"Warning: failed to download tokenizer.json: {e}")
        tokenizer_file_path = "tokenizer.json"

In [19]:
tokenizer = TinyAyaTokenizer(
    tokenizer_file_path=Path(LOCAL_DIR) / "tokenizer.json",
    eos_token_id=TINY_AYA_CONFIG["eos_token_id"],
    pad_token_id=TINY_AYA_CONFIG["pad_token_id"],
    bos_token_id=TINY_AYA_CONFIG["bos_token_id"],
)

prompt = apply_chat_template("Give me a short introduction to large language models in 3 sentences.")
input_token_ids = tokenizer.encode(prompt)
text = tokenizer.decode(input_token_ids)
text

'<BOS_TOKEN><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Give me a short introduction to large language models in 3 sentences.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>'

&nbsp;
# 4. Load pretrained weights

In [20]:
def load_weights_into_tiny_aya(model, param_config, params):
    def assign(left, right, tensor_name="unknown"):
        if left.shape != right.shape:
            raise ValueError(
                f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}"
            )

        with torch.no_grad():
            if isinstance(right, torch.Tensor):
                left.copy_(right.to(dtype=left.dtype, device=left.device))
            else:
                left.copy_(torch.as_tensor(right, dtype=left.dtype, device=left.device))

        return left

    model.tok_emb.weight = assign(
        model.tok_emb.weight,
        params["model.embed_tokens.weight"],
        "model.embed_tokens.weight",
    )

    for l in range(param_config["n_layers"]):
        block = model.trf_blocks[l]
        att = block.att

        # Q, K, V projections
        att.W_query.weight = assign(
            att.W_query.weight,
            params[f"model.layers.{l}.self_attn.q_proj.weight"],
            f"model.layers.{l}.self_attn.q_proj.weight",
        )
        att.W_key.weight = assign(
            att.W_key.weight,
            params[f"model.layers.{l}.self_attn.k_proj.weight"],
            f"model.layers.{l}.self_attn.k_proj.weight",
        )
        att.W_value.weight = assign(
            att.W_value.weight,
            params[f"model.layers.{l}.self_attn.v_proj.weight"],
            f"model.layers.{l}.self_attn.v_proj.weight",
        )

        # Output projection
        att.out_proj.weight = assign(
            att.out_proj.weight,
            params[f"model.layers.{l}.self_attn.o_proj.weight"],
            f"model.layers.{l}.self_attn.o_proj.weight",
        )

        # Feedforward weights
        block.ff.fc1.weight = assign(
            block.ff.fc1.weight,
            params[f"model.layers.{l}.mlp.gate_proj.weight"],
            f"model.layers.{l}.mlp.gate_proj.weight",
        )
        block.ff.fc2.weight = assign(
            block.ff.fc2.weight,
            params[f"model.layers.{l}.mlp.up_proj.weight"],
            f"model.layers.{l}.mlp.up_proj.weight",
        )
        block.ff.fc3.weight = assign(
            block.ff.fc3.weight,
            params[f"model.layers.{l}.mlp.down_proj.weight"],
            f"model.layers.{l}.mlp.down_proj.weight",
        )

        # Layernorm
        block.input_layernorm.weight = assign(
            block.input_layernorm.weight,
            params[f"model.layers.{l}.input_layernorm.weight"],
            f"model.layers.{l}.input_layernorm.weight",
        )

    # Final normalization and output head
    model.final_norm.weight = assign(
        model.final_norm.weight,
        params["model.norm.weight"],
        "model.norm.weight",
    )

    if "lm_head.weight" in params:
        model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")
    else:
        if param_config["tie_word_embeddings"]:
            model.out_head.weight = model.tok_emb.weight
            print("Model uses weight tying.")

In [21]:
import json
from safetensors.torch import load_file
from huggingface_hub import snapshot_download


repo_dir = snapshot_download(repo_id=REPO_ID, local_dir=LOCAL_DIR)
index_path = Path(repo_dir) / "model.safetensors.index.json"
with open(index_path, "r") as f:
    index = json.load(f)

weights_dict = {}
for filename in sorted(set(index["weight_map"].values())):
    shard_path = Path(repo_dir) / filename
    shard = load_file(shard_path)
    weights_dict.update(shard)

load_weights_into_tiny_aya(model, TINY_AYA_CONFIG, weights_dict)
model.to(device)
del weights_dict

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

Model uses weight tying.


In [22]:
def count_unique_parameters(model):
    unique_params = set()
    total_unique_params = 0
    
    for param in model.parameters():
        if param.data_ptr() not in unique_params:
            total_unique_params += param.numel()
            unique_params.add(param.data_ptr())
            
    return total_unique_params

total_params_uniq = count_unique_parameters(model)
print(f"Total number of unique parameters: {total_params_uniq:,}")

Total number of unique parameters: 3,349,227,520


&nbsp;
# 5. Generate text

In [23]:
stop_ids = {
    tokenizer.eos_token_id,
    tokenizer._tok.token_to_id("<|END_RESPONSE|>"),
    tokenizer._tok.token_to_id("<|END_OF_TURN_TOKEN|>"),
}
stop_ids = {x for x in stop_ids if x is not None}


def generate_text_basic_stream(
    model,
    token_ids,
    max_new_tokens,
    stop_token_ids=None,
    context_size=None,
):
    stop_token_ids = set(stop_token_ids or [])

    model.eval()
    with torch.no_grad():
        cache = KVCache(n_layers=model.cfg["n_layers"])
        model.reset_kv_cache()

        # Prime the cache with the initial context
        logits = model(token_ids, cache=cache)

        for _ in range(max_new_tokens):
            next_token = torch.argmax(logits[:, -1], dim=-1, keepdim=True)

            if stop_token_ids and next_token.item() in stop_token_ids:
                break

            yield next_token

            token_ids = torch.cat([token_ids, next_token], dim=1)
            # Feed only the new token to the model; cache handles history
            logits = model(next_token, cache=cache)

In [24]:
prompt = apply_chat_template("Give me a short introduction to large language models in 3 sentences.")
input_token_ids = tokenizer.encode(prompt)
input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)


if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()


for token in generate_text_basic_stream(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=500,
    stop_token_ids=stop_ids
):
    token_id = token.squeeze(0).tolist()
    print(
        tokenizer.decode(token_id),
        end="",
        flush=True
    )

if torch.cuda.is_available():
    def calc_gpu_gb(x):
        return f"{x / 1024 / 1024 / 1024:.2f} GB"
    
    print(f"\n\nGPU memory used: {calc_gpu_gb(torch.cuda.max_memory_allocated())}")

Large language models are advanced AI systems trained on vast amounts of text data to understand and generate human-like language. They use deep learning techniques, particularly transformer architectures, to process and predict text patterns, enabling tasks like translation, summarization, and conversational dialogue. These models have revolutionized natural language processing, powering applications from chatbots to content creation.

&nbsp;
# What's next?

- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)

<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>