[inductor] Decompose addmm if it's a dot product on cpu

Generated code for dot product is often faster (on CPU) than dispatching to aten, since it avoids op dispatch overhead and allows fusion with surrounding ops, which in turn avoids allocations. Differential Revision: [D49595876](https://our.internmc.facebook.com/intern/diff/D49595876/) ghstack-source-id: 201785775 Pull Request resolved: #110010
pytorch · Sep 25, 2023 · 48750e8 · 48750e8
1 parent 2895fbd
commit 48750e8
Showing 1 changed file with 12 additions and 0 deletions.
diff --git a/torch/_inductor/decomposition.py b/torch/_inductor/decomposition.py
@@ -202,6 +202,18 @@ def bmm(self, batch2):
     return NotImplemented
 
 
+@register_decomposition([aten.addmm])
+@pw_cast_for_opmath
+def addmm(self, mat1, mat2, beta=1, alpha=1):
+    if self.device.type == "cpu":
+        if mat1.size(0) == 1 and mat2.size(-1) == 1:
+            out = torch.sum(
+                mat1.squeeze(0) * mat2.squeeze(-1), dim=0, keepdim=True
+            ).unsqueeze(0)
+            return alpha * out + beta * self
+    return NotImplemented
+
+
 @register_decomposition([aten.mm])
 def mm(self, input2):
     # Our matrix vector multiplies only achieve peak bandwidth with coordinate descent tuning.