zero_optim_toy

A minimal, from-scratch implementation of ZeRO Stage 1 (Distributed Optimizer) for educational purposes.

Architecture

Three-layer design following the real distributed optimizer stack:

Buffer  →  DDP  →  Distributed Optimizer

Layer	File	Responsibility
Buffer	`buffer.py`	Flat contiguous storage, padding, shard view, all-gather / reduce-scatter
DDP	`ddp.py`	Param / grad buffer creation, remapping `.data` and `.main_grad`, gradient and parameter sync
Distributed Optimizer	`distributed_optimizer.py`	fp32 Adam on 1/N shard, bf16 writeback

What ZeRO-1 shards

Only optimizer state (fp32 master params + Adam m/v) is sharded. Each rank keeps full copies of bf16 model parameters and gradients — no forward/backward hooks needed.

Training step

forward           full bf16 params in param_buffer
backward          grads written to param.grad
sync_grads()      copy param.grad → grad_buffer, reduce-scatter(SUM)
optimizer.step()  grad shard bf16→fp32, Adam, fp32→bf16 writeback, all-gather

Mixed precision flow

grad_buffer (bf16, full)
  → reduce-scatter → grad shard (bf16, P/N)
  → float() → shard_fp32.grad (fp32, P/N)
  → Adam.step() → shard_fp32 (fp32, P/N)
  → bfloat16() → param_buffer shard (bf16, P/N)
  → all-gather → param_buffer (bf16, full)

Files

model.py                  Simple multi-layer MLP (test model)
buffer.py                 Buffer (contiguous storage + shard view)
ddp.py                    DistributedDataParallel (param/grad remapping + sync)
distributed_optimizer.py  DistributedOptimizer (fp32 Adam on shards)
test_zero.py              Multi-step training correctness tests
profile_memory.py         GPU memory profiling (baseline vs ZeRO-1)
DESIGN.md                 Detailed design document (Chinese)

Running tests

Requires multiple CUDA GPUs and the nccl backend:

# all tests
python -m pytest test_zero.py -v

# single test
python -m pytest test_zero.py::TestZeROTraining::test_multi_step_2gpu -v

# or directly
python test_zero.py

The tests verify that ZeRO-1 training produces bit-exact identical parameters as single-process reference training following the same precision path.

Requirements

Python >= 3.8
PyTorch >= 2.1
Multiple CUDA GPUs

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
buffer.py		buffer.py
ddp.py		ddp.py
memory_profile.png		memory_profile.png
model.py		model.py
profile_memory.py		profile_memory.py
requirements.txt		requirements.txt
test_zero.py		test_zero.py
zero.py		zero.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zero_optim_toy

Architecture

What ZeRO-1 shards

Training step

Mixed precision flow

Files

Running tests

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zero_optim_toy

Architecture

What ZeRO-1 shards

Training step

Mixed precision flow

Files

Running tests

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages