Skip to content
This repository was archived by the owner on Aug 1, 2025. It is now read-only.
This repository was archived by the owner on Aug 1, 2025. It is now read-only.

Poor performance for generated code for operations bottle necked on advanced indexing #1293

@fdrocha

Description

@fdrocha

Came across the this issue when comparing performance in TorchInductor for the lowering in #1270 and the decomp in pytorch/pytorch#85403.

It seems that most of difference is coming from the fact that in the lowering I was able to use int32s for computed indices, while in the decomp I am forced to use int64s (otherwise I get an exception IndexError: tensors used as indices must be long, byte or bool tensors). Just by changing the lowering to use int64 instead of int32d, it became 53% slower which accounts for most of the difference of performance to the decomp.

Note that all the benchmarks were ran in desktop cards (GeForce RTX 2060), this is possibly much less of an issue on server cards.

Any thoughts @ezyang, @jansel, @ngimel, @Chillee?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions