# nanoGPT in Torchtitan (Shakespeare-char)
__Designed by Edouard Oyallon__

This notebook guides you through adding a **minimal `nanogpt` model** to torchtitan and reproducing the classic **tiny Shakespeare (char-level)** experiment. Refer to nanoGPT’s source code as the implementation reference.

You will:
- run `torchtitan`'s config
- add a `shakespeare_char` dataset loader + random-block iterable dataset
- add a nanoGPT-like character tokenizer
- add a minimal nanoGPT model under `torchtitan/models/nanogpt/`
- add nanoGPT-style optimizer grouping (decay only for 2D params)
- add the compilation technique
- reproduce a mini-gpt config
- use wandb to verify your results and compare them with `nanogpt`


## 1) Clone TorchTitan (or reuse existing)

If you already have a `torchtitan/` folder, you can skip cloning.

> TorchTitan often expects a recent PyTorch nightly when using bleeding-edge distributed/float8 features.
For this lab (single GPU debug), you can often proceed with stable PyTorch too — but if your checkout requires nightly, install it as in their README.


In [None]:
%%bash
set -euo pipefail

# If torchtitan already exists, do nothing
if [ ! -d torchtitan ]; then
  git clone https://github.com/pytorch/torchtitan.git
fi

cd torchtitan

## 2) Inspect `run_train.sh` and run the provided debug config

`torchtitan` uses a config-first engine (TOML + CLI overrides).

### How to change variables in `run_train.sh`

Typically by setting env vars before calling it:
- `NGPU=1`
- `CONFIG_FILE=...`

Let's run the official Llama3 debug config on 1 GPU. Explore the config file and train.py.


In [None]:
%%bash
set -euo pipefail

cd torchtitan

ls -la run_train.sh
sed -n '1,200p' run_train.sh

NGPU=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh

cd ..


The structure of our new folder will be given by

```text
torchtitan/
  torchtitan/
    models/
      nanogpt/
        __init__.py # registry file
        infra/
          parallelize.py
        model/
          __init__.py
          args.py
          model.py
        train_configs/
          debug_model.toml
          mini-gpt.toml
```

Look over the folder of llama3 to make sure you understand this structure. In particular, check the registry file.



## 3) Add Shakespeare-char dataset loader + random-block dataset

We will:
- register a new dataset name (e.g. `shakespeare_char_train` / `shakespeare_char_validation`)
- implement a data loader that downloads tiny Shakespeare text
- tokenize it (later) and then sample nanoGPT-style random windows via `RandomBlockTokenDataset`.

We implement these in `torchtitan/hf_datasets/text_datasets.py`. Modify the function ``build_text_dataloader`` to handle separately our new dataloader, whose codes are given below.


In [None]:
def _load_shakespeare_char_dataset(dataset_path, split: str):
    """Load Shakesperea char dataset with default configuration.
       Inspired by karpathy/tiny_shakespeare.
    """

    import requests
    from datasets import Dataset, DatasetDict

    url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
    text = requests.get(url).text

    i = int(len(text) * 0.9)
    train_text, val_text = text[:i], text[i:]
    
    ds = DatasetDict({
        "train": Dataset.from_dict({"text": [train_text]}),
        "validation": Dataset.from_dict({"text": [val_text]})
    })
    return ds[split]

class RandomBlockTokenDataset(IterableDataset, Stateful):
    """
    nanoGPT-style: sample random contiguous blocks from one long token sequence.
    Yields single examples; DataLoader batching gives you local_batch_size.
    """

    def __init__(
        self,
        tokens: torch.Tensor,          # 1D LongTensor
        seq_len: int,
        dp_rank: int = 0,
        seed: int = 64,
        infinite: bool = True,
    ) -> None:
        super().__init__()
        assert tokens.dim() == 1
        assert tokens.dtype == torch.long
        assert len(tokens) > seq_len + 1

        self.tokens = tokens
        self.seq_len = seq_len
        self.infinite = infinite

        # Per-rank RNG so different DP ranks sample different windows
        self.seed = int(seed)
        self.dp_rank = int(dp_rank)
        self._steps = 0  # for checkpointing

    def __iter__(self):
        g = torch.Generator()
        g.manual_seed(self.seed + self.dp_rank)

        max_start = len(self.tokens) - (self.seq_len + 1)

        while True:
            # draw one random start index
            i = int(torch.randint(0, max_start, (1,), generator=g).item())
            x = self.tokens[i : i + self.seq_len]
            y = self.tokens[i + 1 : i + 1 + self.seq_len]
            self._steps += 1
            yield {"input": x}, y

            if not self.infinite:
                break

    def state_dict(self):
        return {"steps": self._steps}

    def load_state_dict(self, state_dict):
        self._steps = int(state_dict.get("steps", 0))

### Quick registry test

If `torchtitan` uses `DATASETS[...]` in that file, this should allow:

```
./run_train.sh --training.dataset shakespeare_char_train
```

If this override does not work immediately, revise your adaptation of the code.


In [None]:
%%bash
set -euo pipefail

cd torchtitan

NGPU=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --training.dataset shakespeare_char_train

cd ..


## 4) Add a nanoGPT-style character tokenizer

We create `NanoGPTCharTokenizer` and a small builder function.

- vocab = `sorted(set(text))`
- encode: map each char to id
- decode: map id to char
- no BOS/EOS inserted

File: `torchtitan/components/tokenizer/nanogpt_char_tokenizer.py`. Adapt the following code:

In [None]:
class NanoGPTCharTokenizer(BaseTokenizer):
    """
    Character-level tokenizer that matches nanoGPT/data/shakespeare_char/prepare.py:
      - vocab = sorted(set(full_text))
      - ids = enumerate(vocab)
      - encode ignores bos/eos so TorchTitan won't inject extra tokens
    """
    def __init__(self, url: str,):
        super().__init__()
        import requests
        text = requests.get(url).text

        chars = sorted(set(text))  # matches nanoGPT's chars = sorted(list(set(data)))
        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for ch, i in self.stoi.items()}
        self._n_words = len(chars)

        # TorchTitan tokenizers often expose these attrs; keep them defined.
        self.bos_id = -1 # that's the token for the beginning of sequence
        self.eos_id = self._n_words # that's the token for the end of sequence

    def encode(self, *args, **kwargs) -> list[int]:
        # Extract arguments
        if len(args) >= 1:
            text = args[0]
        else:
            text = kwargs.get("text", "")
       ...
        return token_ids

    def decode(self, *args, **kwargs) -> str:
        """
        Decode token IDs back to text.

        Args:
            token_ids (list[int]): List of token IDs to decode
            **kwargs: Additional arguments passed to the underlying tokenizer's decode method
                     (e.g., skip_special_tokens)

        Returns:
            str: Decoded text
        """
        # Extract token_ids from arguments
        if len(args) >= 1:
            token_ids = args[0]
            # Pass through remaining kwargs
            return ...
        else:
            token_ids = kwargs.pop("token_ids", [])
            # Pass through remaining kwargs after removing token_ids
            return ...
    
    def get_vocab_size(self) -> int:
        return self._n_words


## 5) Add nanoGPT-style optimizer grouping (decay only on 2D tensors)

nanoGPT applies weight decay only to parameters that are **matrix-shaped** (dim >= 2) (verify this in `nanogpt` implementation), and excludes biases/LayerNorm vectors.

Implement:
- `_NanoGPTDecayByDimMixin`
- `NanoGPTAdamW`
- `NanoGPTAdam`

Then, patch `torchtitan/components/optimizer.py` by appending classes at the end and adapting the following code:


In [None]:

class _NanoGPTDecayByDimMixin:
    def __init__(self, params, *args, weight_decay: float = 0.0, **kwargs):
        params = [p for p in params if p.requires_grad]

        decay = [p for p in params if ...]
        nodecay = [p for p in params if ...]
        # adapt the reste of the code
        super().__init__(groups, *args, **kwargs)

class NanoGPTAdamW(_NanoGPTDecayByDimMixin, torch.optim.AdamW):
    pass


class NanoGPTAdam(_NanoGPTDecayByDimMixin, torch.optim.Adam):
    pass

## 6) Implement `nanogpt/model.py` (port nanoGPT to TorchTitan)

Adapt nanoGPT’s GPT model to TorchTitan’s model structure, keeping the code **minimal and clean**. We add:
- `torchtitan/models/nanogpt/model/args.py` — dataclass config
- `torchtitan/models/nanogpt/model/model.py` — minimal GPT
- `debug_model.toml`: very small, fast sanity check
- init files

**References**
- nanoGPT source: https://github.com/karpathy/nanoGPT/blob/master/model.py#L118  
- `torchtitan` style example (Llama3): https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py

### Requirements
- **No clutter:** avoid unnecessary helper functions; remove unused code.
- **Rename:** use `self.layers` instead of `self.transformer`.
- **Initialization:** add `init_weights()`, which initializes every parameters. Every submodule (Attention/MLP/Block/Model) must implement its own `init_weights()` and is responsible to initialize its own submodules.
- **Device:** don’t store a custom device; derive it from a parameter/buffer (e.g., embedding weight device).
- **Forward contract:** input format similar to TorchTitan models; output **logits only**.  
  Remove nanoGPT’s extra “optimization” logic in the forward.
- **Layer container:** replace `nn.ModuleList([...])` with a **named** container (e.g. `nn.ModuleDict`) so layers are trackable for transformations (e.g. activation checkpointing).

### Integration + test
- Export/register the model: update `torchtitan/models/__init__.py` (and local `nanogpt/__init__.py`).
- Test using a **modified Llama3 debug config**, but keep **dataset = C4** for now.
- Ensure `vocab_size` matches the active tokenizer.
- For this lab phase set:
  - `parallelize_fn = lambda m, *args, **kwargs: m`
  - `pipelining_fn = None`
- Propose several configurations which match what is in nanoGPT.

Test and a debug an initial version.


In [None]:
%%bash
set -euo pipefail

cd torchtitan/
NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/debug_model.toml" ./run_train.sh
cd ..

## 7) Add minimal `nanogpt` train configs

Create:

- `mini-gpt.toml`: using nanoGPT train_shakespeare_char defaults

Test your model on a few iterations to make sure it runs.

## 8) Compare both implementations (TorchTitan vs nanoGPT)

- Update **nanoGPT** so it logs the **training loss at every iteration** (same frequency as TorchTitan).
- Run **both** experiments with **matched hyperparameters** (same config as much as possible).
- Use **Weights & Biases** to compare:
  - training loss curve
  - validation loss curve
  - learning rate schedule

If the curves don’t match closely, re-check the usual suspects:
- learning rate (and warmup/decay schedule)
- batch size / gradient accumulation
- sequence length / block size
- weight decay and optimizer settings (decay grouping)
- model initialization (std, tied weights, bias flags)
- tokenizer / vocabulary size and dataset split
- dropout / seed / randomness



In [None]:

WANDB_PROJECT="shakespeare-char" WANDB_NAME="torchtitan-mini-gpt" \
  NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/mini-gpt.toml" \
  ./run_train.sh --metrics.enable-wandb

python train.py config/train_shakespeare_char.py   --wandb_log=True 

## 9) Load a Hugging Face checkpoint and run validation

 Load a pretrained nanoGPT checkpoint from Hugging Face into TorchTitan (via **DCP**) and verify that TorchTitan reports  **a similar validation loss** as nanoGPT on Shakespeare-char.

> **DCP** = PyTorch Distributed Checkpointing (`torch.distributed.checkpoint`), suitable for sharded checkpoints (FSDP/TP/PP).

Inspect on Hugginface sosier/nanoGPT-shakespeare-char-tied-weights. Download the model and convert it to DCP using the script below.

```bash
hf download sosier/nanoGPT-shakespeare-char-tied-weights --local-dir hf_model
```

Run the validation test and follow the same load-only + validate procedure as in the previous step, and compare the validation losses between nanoGPT and `torchtitan`. They should closely match.

In [None]:
import argparse
import torch
import torch.distributed.checkpoint as dcp

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--in_bin", required=True)
    ap.add_argument("--out_dcp", required=True)
    args = ap.parse_args()

    sd = torch.load(args.in_bin, map_location="cpu")
    dcp.save(sd, checkpoint_id=args.out_dcp)
    print("Wrote DCP checkpoint to", args.out_dcp)

if __name__ == "__main__":
    main()


In [None]:
WANDB_PROJECT="shakespeare-char" WANDB_NAME="torchtitan-mini-gpt" \
NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/mini-gpt.toml" \
./run_train.sh \
  --training.steps 1 \
  --metrics.no-enable-wandb \
  --checkpoint.enable \
  --checkpoint.load_only \
  --checkpoint.initial_load_path ../hf_model/nanogpt_char_dcp \
  --validation.freq 1 \
  --optimizer.lr 0


## 10) Compile

Take inspiration from the Llama3 infrastructure in `infra/parallelize`. Add a **nanoGPT-specific** `parallelize` function whose only job is to:

- **compile the model** (e.g. via `torch.compile`)
- **return the compiled model**
- **raise an error** if any other parallelization mode is requested (TP/PP/FSDP/activation checkpointing, etc.)

To do so, implement `parallelize_nanogpt(...)` in the appropriate `infra/parallelize` module. Run the nanoGPT with compilation. What should we think of the MFU compared to nanoGPT?

In [None]:
%%bash
set -euo pipefail

cd torchtitan

WANDB_PROJECT="shakespeare-char" WANDB_NAME="torchtitan-mini-gpt" NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/mini-gpt.toml" ./run_train.sh  --compile.enable