Use fixed scale for Float8 softmax quantization instead of observer#4260
Use fixed scale for Float8 softmax quantization instead of observer#4260jerryzh168 merged 4 commits intomainfrom
Conversation
Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4260
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ⏳ No Failures, 3 PendingAs of commit 2265e5a with merge base c554b1f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x ghstack-source-id: 87f4197 Pull Request resolved: #4260
… observer" Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x [ghstack-poisoned]
Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x ghstack-source-id: ff5b439 Pull Request resolved: #4260
… observer" Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x [ghstack-poisoned]
Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x ghstack-source-id: 4326f3f Pull Request resolved: #4260
|
@claude review pls |
|
Claude finished @supriyar's task in 1m 29s —— View job Review of PR #4260
The core idea is sound — softmax output is guaranteed to be in [0, 1], so a fixed scale of Issues1. This may silently work (CPU default), but it's fragile and differs from the intended device. Consider either:
Since the calibration loop is still part of the documented flow (and the test does call 2. Scale reshaping with scale = self.output_act_quant_scale
while scale.ndim < output.ndim:
scale = scale.unsqueeze(0)This is correct but creates new tensors each call. Since the scale is a fixed buffer, consider pre-computing the reshaped scale in Nits3. Unused import 4. Calibration loop is now a no-op for softmax Test CoverageThe new SummaryThe motivation and approach are clean. The fixed scale of |
… observer" Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x [ghstack-poisoned]
Summary: Softmax output is always in [0, 1], so observing during calibration is unnecessary. Replace the observer-based flow with a fixed output scale of `finfo(float8_dtype).max / 1.0` (448.0 for float8_e4m3fn). This simplifies `Float8ObservedSoftmax` to a plain marker module and removes the observer allocation + calibration overhead for softmax layers. Test Plan: pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x ghstack-source-id: 39b9d0e Pull Request resolved: #4260
|
@claude review again |
|
Claude finished @jerryzh168's task in 2m 12s —— View job Re-review of PR #4260
All three concerns from the previous review have been addressed in the latest revision:
Remaining observations1. The 2. Scale value correctness The fixed scale of 3. New test Covers CPU/CUDA and 2D/3D/4D shapes. Verifies scale device placement, output device, shape, and dtype. One minor gap: no test for the SummaryThe previous review issues have been addressed. The change is clean and correct. LGTM. |
Stack from ghstack (oldest at bottom):
Summary:
Softmax output is always in [0, 1], so observing during calibration is
unnecessary. Replace the observer-based flow with a fixed output scale
of
finfo(float8_dtype).max / 1.0(448.0 for float8_e4m3fn). Thissimplifies
Float8ObservedSoftmaxto a plain marker module and removesthe observer allocation + calibration overhead for softmax layers.
Test Plan:
pytest test/prototype/test_prototype_float8_tensor.py -k test_static_quant_softmax -x