Transformer Systems Engineering Project: from-scratch LM + Triton FlashAttention + bucketed overlapped DDP, with reproducible benchmarks; Ref. Stanford cs336 assignments.
benchmarking pytorch transformer triton ddp systems-engineering distributed-training flash-attention-2 gpu-profiling
-
Updated
Mar 4, 2026 - HTML