# Scaling Laws with nanoGPT
__Designed by Edouard Oyallon__

In this lab, we’ll use **nanoGPT** (Karpathy’s minimal GPT training code) together with the character-level dataset **`shakespeare_char`** to explore **compute-optimal training**, inspired by **Chinchilla-style scaling** (see https://arxiv.org/abs/2203.15556 and Section 3 of https://arxiv.org/pdf/2401.02954).

**Resources**
- **nanoGPT repo:** https://github.com/karpathy/nanoGPT  
- **Dataset:** `data/shakespeare_char` (character-level Shakespeare)

---

## Q0 — Setup

Before we start tuning or running sweeps, make sure your environment is working end-to-end:

- **Clone and explore** the nanoGPT repository (training loop, model definition, config system).
- **Preprocess** the `shakespeare_char` dataset and confirm the dataset files are generated correctly.
- **Run a tiny training job** (a few hundred iterations) and verify that:
  - loss decreases,
  - checkpoints/logs are written,
  - you can reproduce the run from the proposed config.

> Optional: you *can* integrate nanoGPT with **Weights & Biases** for logging and sweeps, but it’s not required for this lab (and we won’t rely on it).

## Preliminary questions

### Q1 — Hardware FLOPs budget

Your first job is to quantify your **compute budget**. We’ll use this later to reason about *compute-optimal* choices (tokens vs parameters).

**Peak throughput.**  
   Find your GPU’s **peak FP16 throughput** (in TFLOP/s). Using that value, estimate how many FLOPs you can deliver in **5 minutes** at peak.  
   - Report: GPU model, peak FP16 TFLOP/s, and total FLOPs in 5 minutes.  
   - Then **update** nanoGPT’s `estimate_mfu` (in `model.py`) so it matches *your* hardware peak. MFU (in %) allows to quantify GPU utilization.


Helpful reference (hardware specs): https://epoch.ai/data/machine-learning-hardware


**Answer:** 

GPU : Nvidia RTX 4000 Ada
Peak FP16 TFLOP/s - 19.2 TFLOP/s 
Total FLOPs in 5 minutes : 4.59 * 10^16 FLOPs 

### Q2 — Hyperparameter methodology

Take one minute and propose the **simplest possible methodology** to choose hyperparameters for the **best model you can train in 5 minutes on your GPU** (i.e., a plan you can realistically follow until the end of the lab).

- What would you try **first** (and why)?
- Roughly **how many runs** would you need before you trust your choice?
- Is this approach **practical** given your time and compute constraints?

*Answer briefly, but justify your reasoning.*

**Answer:** 

### Q3 — Identify the key parameters in `train.py`

For compute-optimal training, the important knobs are the ones that control:

- **model scale** $N$ (parameter count / FLOPs per token),
- **training tokens** $D$ (how many tokens you process),
- **optimization stability** (so losses are comparable),
- **compute efficiency** (throughput / MFU).

**Answer:**

### Q4 — Instrument nanoGPT (FLOPs + saved results)

Update the nanoGPT codebase to support scaling-law experiments:

- **Add** a function `estimate_flops_per_token()` in `model.py` that estimates the model’s **FLOPs per token** (for a forward+backward training step).
- **At the end of training**, save a single file named **`output.pt`** inside `--out_dir` (using `torch.save`) that contains:
  1. `best_train_loss`
  2. `best_val_loss`
  3. `model_args`
  4. `config`
  5. `flops_per_token`
  6. `params_no_embed` (parameter count excluding token/position embeddings)

*Hint:* look at nanoGPT’s `scaling_laws.ipynb` and the existing `estimate_mfu` code path for inspiration.


## Model design

### Q5 — Minimal model configuration

Because we assume limited / slow GPU resources, we will use a **tiny baseline model** by reducing the embedding size.

Set:
- `--n_layer=4`
- `--n_head=2`
- `--block_size=1024`

What constraint does this impose on the embedding size `n_embd`?

Then plot the number of FLOPs as a function of $N$, with:
$$
n_{\text{embd}} = n_{\text{embd,min}} \cdot N
$$
and suggest a simple proxy to FLOPs-per-token. 

This will be our **minimal reference model** for the rest of the lab.

**Answer:** 

## Job wrapper

### Q6 — Write a bash wrapper for reproducible training runs

Write a bash function `run_nanogpt_job` that launches `nanoGPT/train.py` as a function of:

- `D`: total number of training tokens to process  
- `lr`: learning rate  
- `N`: a simple proxy for model scale (e.g., map `N -> N*n_embd`, while keeping `n_layer` and `n_head` fixed)

Your function must:

- compute `max_iters` so that the total processed tokens is approximately  
  $$
  D \approx \text{max\_iters} \times \text{batch\_size} \times \text{block\_size}
  $$
  (assuming `gradient_accumulation_steps=1`)
- set:
  - `batch_size = 10`
  - `block_size = 1024`
  - `warmup_iters = 0.3% * max_iters`
  - `lr_decay_iters = max_iters`
  - `min_lr = lr / 10`
- keep evaluation **disabled or negligible** during sweeps (e.g., `eval_interval=max_iters`)
- write outputs to a unique run directory named with `lr`, `D`, and `N` and save logs there

Test it.

In [None]:
%%bash
set -euo pipefail

source ./nanoGPTwrapper.sh

ROOT="out/test/run_$(date +%Y%m%d_%H%M%S)"

# run_nanogpt_job <lr> <D_tokens> <N_model_scale> <root_out_dir> <job_idx>
run_nanogpt_job 1e-3 100000 10 "$ROOT" 0


## Finding the right hyperparameters

### Optimal learning rate

We assume the **compute-optimal learning rate** follows a power law in the compute budget $C$:

$$
\eta^\star(C) = a_{\mathrm{lr}}\,C^{\,b_{\mathrm{lr}}}
$$


Following the DeepSeek approach, we will keep **all other hyperparameters fixed** and vary only the quantities of interest. For simplicity and to design experiments, we assume the training compute scales as

$$
C \propto N\,D,
$$

i.e., FLOPs per token scale like $N$ and total compute is proportional to $N$ times the number of processed tokens $D$.


#### Q7 — Estimate $a_{\mathrm{lr}}$ and $b_{\mathrm{lr}}$

Design and run a sweep of **10+ training experiments** spanning a reasonable range of compute budgets $C$.  
For each budget, you should search over a small set of candidate learning rates and keep the one that achieves the **best validation loss**.

Use the following midpoint as a reference scale:

- $D_{\text{mid}} = 1{,}250{,}000$  (tokens)
- $N_{\text{mid}} = 10$            (model scale proxy)


Your goal is to collect pairs $\big(C, \eta^*(C)\big)$ and fit $a_{\mathrm{lr}}$ and $b_{\mathrm{lr}}$ via a log–log regression.

First, run the corresponding experiments and save all outputs under `out/lr`.   Use **one subfolder per experiment** (e.g., named with the key hyperparameters like `lr`, `D`, and `N`) so that each run is easy to identify and parse later.



 Then, aggregate results from all runs by reading each `output.pt`. For every run:

- extract the best validation loss,
- group runs by $(N, D)$ iso-FLOPs,
- identify the learning rate that achieves the **lowest** validation loss for each $(N, D)$.

Finally, collect the resulting pairs $\big(C, \eta^\star(C)\big)$ (with $C \propto ND$) for fitting the learning-rate scaling law.

## Q8. Scaling rules sweep

Here, we pick:
 - $D_{\text{mid}} = 1{,}250{,}000$  (tokens)
-  $N_{\text{mid}} = 16$            (model scale proxy)

We now sweep around a midpoint $(D_{\text{mid}}, N_{\text{mid}})$ and scan along (approximate) **compute iso-FLOPs** to estimate scaling rules.

We will build multiple groups indexed by $q$ (a coarse scale factor) so that $(D_q,N_q)=(q\times D_{\text{mid}}, q\times N_{\text{mid}})$. Here we pick $q=10^{1/4}$ to span multiplicative factors from 0.1 to 10. For each group, we generate several $(N, D)$ pairs by trading off dataset size $D$ and model scale $N$ while keeping the **total compute** controlled.

For the learning rate, we use the rule fitted above:
$$
\eta^\star(C) = a_{\mathrm{lr}}\,C^{\,b_{\mathrm{lr}}}.
$$

To move along an iso-FLOP, we apply a multiplier \(m\) such that:
$$
D \leftarrow D_q \cdot m,
\qquad
N \leftarrow int(\frac{N_q}{m}),
$$
which keeps $N \cdot D$ roughly constant.

Perform such parameter sweep.

Represent the isoFLOPS plots of the corresponding experiments.

## Q9 — Scaling laws

Using the sweeps, estimate scaling laws:

- **Model size vs. compute:**  
  $$
  N = a_N\, C^{b_N}
  $$
- **Data size vs. compute:**  
  $$
  D = a_D\, C^{b_D}
  $$

where the compute proxy is:
$$
C = N \cdot D
$$

Also fit the loss–compute relationship:
$$
L = a_L\, C^{b_L}
$$

Use **one best point per** `q` **group**. You may use either:

- `scipy.optimize.minimize`, or
- `np.linalg.lstsq`.


## Q10 — Manipulation of the Scaling Law

Run an experiment with a budget **$10 \times$ larger** than the largest budget used in the previous parameter sweep. How would you pick the hyper-parameters?


In [None]:
C0 = 1250000 * 16 * 100

print("Target C:", C0)
print("Recommended D:", D_int)
print("Recommended N:", N_int)
print("Recommended lr:", lr_hat)
print("Predicted loss (rough):", pred_L)

In [None]:
%%bash
set -euo pipefail

source ./nanoGPTwrapper.sh

ROOT="out/final/tinystories_$(date +%Y%m%d_%H%M%S)"

# run_nanogpt_job <lr> <D> <N> <root> <job_idx>

## Q11 — Conclusion

Finally, plot the previous **final large-budget run** together with the **fitted curve(s)** on the same figure. Discuss.