This repository was archived by the owner on Aug 1, 2025. It is now read-only.

Description
Came across the this issue when comparing performance in TorchInductor for the lowering in #1270 and the decomp in pytorch/pytorch#85403.
It seems that most of difference is coming from the fact that in the lowering I was able to use int32s for computed indices, while in the decomp I am forced to use int64s (otherwise I get an exception IndexError: tensors used as indices must be long, byte or bool tensors
). Just by changing the lowering to use int64 instead of int32d, it became 53% slower which accounts for most of the difference of performance to the decomp.
Note that all the benchmarks were ran in desktop cards (GeForce RTX 2060), this is possibly much less of an issue on server cards.
Any thoughts @ezyang, @jansel, @ngimel, @Chillee?