

# Core Training Terms in LLMs

## 1. **Token**

* **Definition**: The smallest unit of text the model sees (word, subword, or character, depending on the tokenizer).
* Example: `"The cat sat." → [“The”, “cat”, “sat”, “.”]` (4 tokens).
* **Analogy**: Like breaking your book into Lego bricks.



## 2. **Sequence length (context window)**

* **Definition**: The maximum number of tokens the model can process at once.
* Example: If `sequence_length = 512`, the model can only “see” 512 tokens from your book at a time.
* **Analogy**: Like a human who can only read **one page at a time** even though the book has 1000 pages.



## 3. **Stride**

* **Definition**: How far you move forward when making new training sequences from the text.
* Example:

  * Sequence 1: tokens 1 → 512
  * With stride = 256 → Sequence 2 starts at token 257 → 768
  * This way, sequences **overlap**, so the model learns connections across page breaks.
* **Analogy**: Like sliding a magnifying glass over text — if you move it halfway each time, you still see overlapping context.



## 4. **Batch**

* **Definition**: A group of sequences processed together in **one forward + backward pass**.
* Example: If `batch_size = 8` and `sequence_length = 512`, each batch has `8 × 512 = 4096 tokens`.
* **Analogy**: Like testing yourself with 8 flashcards at the same time, instead of just 1.



## 5. **Mini-batch vs Batch**

* In practice, “batch” usually means **mini-batch** (small subset of the dataset used each step).
* Full batch = whole dataset at once (rare in LLM training because it’s too big).



## 6. **Step (iteration)**

* **Definition**: One update of the model’s weights (using 1 batch).
* Example: 1 step = take batch → compute predictions → compute loss → backpropagate → update weights.
* **Analogy**: Like practicing one set of 8 flashcards before moving on.



## 7. **Epoch**

* **Definition**: One full pass through the **entire dataset** (all sequences).
* Example: If your 20,000-token book becomes 77 sequences of length 512:

  * Epoch = the model has trained on all 77 sequences once (in shuffled order).
* **Analogy**: Like rereading your entire book once, in random snippets.



## 8. **Shuffling**

* **Definition**: Randomizing the order of sequences each epoch.
* Purpose: Prevents the model from memorizing sequence order.
* **Analogy**: Shuffling flashcards before each study session.



## 9. **Loss**

* **Definition**: A number measuring how bad the model’s predictions are.
* In LLMs → **cross-entropy loss** (how far predicted token probabilities are from the true token).
* **Analogy**: Like a score on your practice test — lower = better.



## 10. **Perplexity (PPL)**

* **Definition**: Exponential of average loss → measures how “surprised” the model is by the text.
* Lower PPL = model predicts words more confidently.
* **Analogy**: If you’re reading and every next word surprises you → high perplexity; if you can guess words easily → low perplexity.



## 11. **Checkpoint**

* **Definition**: Saved snapshot of the model’s weights during training.
* Lets you pause/resume or roll back if training crashes.
* **Analogy**: Like saving your progress in a video game so you don’t restart from page 1 if your computer dies.



## 12. **Gradient clipping**

* **Definition**: Limiting how large weight updates can get to avoid exploding gradients.
* **Analogy**: Putting a “speed limit” on how much you can change your notes at once.



# Quick Recap with Your Book Example

* **Tokens** = words broken into Lego bricks.
* **Sequence length** = how many Lego bricks the model can look at at once.
* **Stride** = how much you slide the reading window each time.
* **Batch** = how many sequences you train on in one go.
* **Step** = training update on 1 batch.
* **Epoch** = reading the whole book once (via chunks).
* **Shuffling** = randomizing flashcards before each read.
* **Loss/PPL** = how well the model is guessing the next word.
* **Checkpoints** = save points during training.


