# From Scratch to Scale: Hands-on Distributed Training (from the ground up)

> Teaching you *what* `torch.distributed` and other frameworks are doing so you can scale model training the **right way**

## What is this course

- Aimed at taking you from knowing how to train a model in PyTorch, `transformers`, `axolotl`, etc on a single GPU (or many*)
- Into someone who can train models **efficiently** on a cluster of 2 to 200 GPUs
- Understand how `DistributedDataParallelism`, `Pipeline Parallelism`, etc are all implemented
- Understand how these then scale when you get into 2D, 3D+ parallelism strategies
- Aimed at doing so in a *practical* manor. Less focused on the math, more focused on the **doing**.

## Who this is *not* for:

- Experts in `torch.distributed` (you likely already know everything we're talking about)
- Someone brand new to PyTorch/Deep Learning (I'd prefer if you had a few months of knowledge, and understand things like basic tensor operations)

## Course Layout:

- Week 1:
  - Introduction
  - `nbdistributed`
  - Data Parallelism
    - `accelerate`/data sharding (recording)
    - Distributed Data Parallelism from scratch
- Week 2
  - Zero Redundancy Algorithm
    - Core concept (Guest speaker recording, Sylvain Gugger)
    - ZeRO 1, 2, and 3 from scratch
    - PyTorch FSDP and Hybrid Sharding
    - Introduction to TorchTitan (via code)
- Week 3
  - Pipeline Parallelism from scratch
  - FP8 and low-precision training workshop
  - Pipeline Parallelism with torchtitan
  - How to train your small MoE
- Week 4
  - Tensor Parallelism from scratch
  - Tensor Parallelism with TorchTitan


## The Rules:

* You can use the core of `torch.distributed` but **not** any of the major classes (so `torch` `FullyShardedDataParallelism`, etc)
* After we finish a lesson on a subject, *then* you may use it out-of-the-box (so after today's lecture, you may use `DistributedDataParallelism`)
* We are *not* building a framework, we *are* implementing the broad strokes from scratch to learn how they work