Skip to content

Commit

Permalink
[inductor] Decompose addmm if it's a dot product on cpu
Browse files Browse the repository at this point in the history
Generated code for dot product is often faster (on CPU) than
dispatching to aten, since it avoids op dispatch overhead and allows fusion
with surrounding ops, which in turn avoids allocations.

Differential Revision: [D49595876](https://our.internmc.facebook.com/intern/diff/D49595876/)

ghstack-source-id: 201785775
Pull Request resolved: #110010
  • Loading branch information
bertmaher committed Sep 25, 2023
1 parent 2895fbd commit 48750e8
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions torch/_inductor/decomposition.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,18 @@ def bmm(self, batch2):
return NotImplemented


@register_decomposition([aten.addmm])
@pw_cast_for_opmath
def addmm(self, mat1, mat2, beta=1, alpha=1):
if self.device.type == "cpu":
if mat1.size(0) == 1 and mat2.size(-1) == 1:
out = torch.sum(
mat1.squeeze(0) * mat2.squeeze(-1), dim=0, keepdim=True
).unsqueeze(0)
return alpha * out + beta * self
return NotImplemented


@register_decomposition([aten.mm])
def mm(self, input2):
# Our matrix vector multiplies only achieve peak bandwidth with coordinate descent tuning.
Expand Down

0 comments on commit 48750e8

Please sign in to comment.