Closed
Description
In the FP8 training setting, I noticed that use_fast_accum
is only enabled during the forward pass:
Line 174 in 6243040
When benchmarking performance, I've observed that enabling fast accumulation can lead to a 1.1x to 1.2x speedup. I'm curious—was there a specific reason or concern behind the decision to restrict use_fast_accum
to only the forward pass?
Metadata
Metadata
Assignees
Labels
No labels