Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[FSDP2] Changed grad acc test to use data parallel ref model (#126161)
This simplifies the test a bit. **Context** Option 1: Ref model is data parallel. Each rank's ref model receives local batch. We manually all-reduce gradients and divide them by world size to match DDP/FSDP semantics. Option 2: Ref model is not data parallel. Each rank's ref model receives the same global batch. We manually divide the ref model's gradients by world size to match DDP/FSDP semantics. (Note that all ranks have the same ref model and same global batch.) All of our other unit tests are written following Option 1, which is simpler and a more direct comparison to what our claimed semantics are. This PR switches the gradient accumulation test from being written as following Option 2 to as following Option 1. Pull Request resolved: #126161 Approved by: https://github.com/wanchaol ghstack dependencies: #126067, #126070
- Loading branch information