Poor performance for generated code for operations bottle necked on advanced indexing

Came across the this issue when comparing performance in TorchInductor for the lowering in https://github.com/pytorch/torchdynamo/pull/1270 and the decomp in https://github.com/pytorch/pytorch/pull/85403.

It seems that most of difference is coming from the fact that in the lowering I was able to use int32s for computed indices, while in the decomp I am forced to use int64s (otherwise I get an exception `IndexError: tensors used as indices must be long, byte or bool tensors`). Just by changing the lowering to use int64 instead of int32d, it became 53% slower which accounts for most of the difference of performance to the decomp.

Note that all the benchmarks were ran in desktop cards (GeForce RTX 2060), this is possibly much less of an issue on server cards.

Any thoughts @ezyang, @jansel, @ngimel, @Chillee?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance for generated code for operations bottle necked on advanced indexing #1293

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance for generated code for operations bottle necked on advanced indexing #1293

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions