## How an LLM Learns


There are 2 main types of LLM training:
1. Pre-training. (let model learn from the data)
2. Supervised fine-tuning. (let model respons how people would like to see it)


For our case we will consider the pra-training as first step.

You can imagine giant machine with millions of adjustable knobs.

Training process consist of this loop:
1. **Predict:** Give the model a text snippet, and ask it to guess the next word.
2. **Compare:** Compare the predicted word with the actual word.
3. **Calculate Error (loss):** Calculate the error between the predicted word and the actual word.
4. **Update (Learn):** Slightly adjust all the millions of knobs in a direction that would make the guess less wrong


Ingredient #1 - The Data and Tokenization










In [1]:
from transformers import AutoTokenizer

text_data = "Hello, how are you?"

tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

encoded_text = tokenizer.encode(text_data)
print(encoded_text)

decoded_text = tokenizer.decode(encoded_text)
print(decoded_text)







ModuleNotFoundError: No module named 'transformers'

In our script, load_dataset and ds.map do this for billions of words. The DataCollatorForLanguageModeling is what will intelligently batch these tokenized sentences together to feed them to the GPUs.

Ingredient #2 - The Model (The 'Brain' Architecture)

We need to define the structure of our model's "brain." This is just a blueprint. For our 500M model project, we will define a specific architecture. Here, we'll create a tiny, "toy" version. Crucially, when it's first created, all its 'knobs' (parameters) are set to random values. It is completely untrained.

In [2]:
import torch
from transformers import GemmaConfig, GemmaForCausalLM

# Your script has a `Gemma3TextConfig` block. This is that, but much smaller.
# The `hidden_size`, `num_hidden_layers`, etc., define the total number of parameters (~500M for us).
config = GemmaConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=32,          # Our project: ~1536
    num_hidden_layers=3,     # Our project: ~18
    num_attention_heads=4,
    intermediate_size=64,
)

# This line instantiates the model. It's now an object in memory
# with millions of randomly initialized floating-point numbers.
untrained_model = GemmaForCausalLM(config)
print(f"Created a new, untrained model with {untrained_model.num_parameters():,} parameters.")
print("It currently knows nothing. Its 'knowledge' is random noise.")

NameError: name 'Ingredient' is not defined

In [None]:
# Prepare the input and the ground truth label for one training example
input_text = "The curious cat climbed the"
target_word = "tall"

# Convert to tensors (the data format GPUs use)
input_ids = tokenizer.encode(input_text, return_tensors='pt')
# The label is the ID of the *next* token
target_id = tokenizer.encode(target_word)[0]

# --- STEP 1: PREDICT ---
# The model does a "forward pass": input data flows through the network's layers.
outputs = untrained_model(input_ids)
# The output is `logits`: a raw score for every single word in the vocabulary.
# We only care about the prediction for the very last token.
last_token_logits = outputs.logits[0, -1, :] # [batch_size, sequence_position, vocab_size]

# --- STEP 2: COMPARE (Visually) ---
# Let's see how wrong the random model is.
predicted_token_id = torch.argmax(last_token_logits)
print(f"Input Text:         '{input_text}'")
print(f"Correct Next Word:  '{target_word}' (Token ID: {target_id})")
print(f"Model's Prediction: '{tokenizer.decode(predicted_token_id)}' (Token ID: {predicted_token_id.item()})")