Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Eliminate unnecessary multiplications by 1 in addmm with sparse compr…
…essed tensor operand (#114026) This PR: - updates `torch/sparse/_triton_ops_meta.py` for the API change in `triton.testing.do_bench` - force `num_stages` to be 1 when blocksize is 128x128 to avoid out of resources exception when `bsr_dense_mm` is called from `nn.linear`. - as in the title. The performance of `nn.linear` on BSR tensor weights (dtypes `float16` and `bfloat16`) is increased as follows (`NVIDIA A100-SXM4-80GB`): - for blocksize 16x16, the average/maximum speed up is about 11/20 % - for blocksize 32x32, the average/maximum speed up is about 15/24 % - for blocksize 64x64, the average/maximum speed up is about 18/26 % - for blocksize 128x128, the average/maximum speed up is about 15/28 % Pull Request resolved: #114026 Approved by: https://github.com/cpuhrsch
- Loading branch information