Tune elementwise ops for ROCm #21754

colesbury · 2019-06-13T19:06:38Z

The stride calculation using OffsetCalculator performs poorly with
MAX_DIMS=25. This reduces MAX_DIMS (after coalescing) to 16 on ROCm.
I think it's unlikely that anyone will exceed this limit. If they do,
we can add additional specializations for ROCm with more dimensions.

I'm not sure about the underlying cause. With MAX_DIM=25, the add kernel's params
is ~648 bytes vs. ~424 bytes with MAX_DIM=16. The kernel instruction footprint is
bigger too, but most of these instructions are never executed and most kernel parameters
are never loaded because the typical dimensionality is much smaller.

Mini benchmark here:
https://gist.github.com/colesbury/1e917ae6a0ca9d24712121b92fed4c8f

(broadcasting operations are much faster)

cc @iotamudelta

The stride calculation using OffsetCalculator performs poorly with MAX_DIMS=25. This reduces MAX_DIMS (after coalescing) to 16 on ROCm. I think it's unlikely that anyone will exceed this limit. If they do, we can add additional slower specializations for ROCm. <insert benchmark here>

colesbury · 2019-06-13T19:26:00Z

This seems to be an issue with the kernel argument size and not the kernel instruction size. I've noticed that performance drops dramatically once kernarg_segment_byte_size hits 512 bytes even if there aren't even any instructions present to load from these extra kernel arguments.

aten/src/THC/THCIntegerDivider.cuh

facebook-github-bot

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: ``` The stride calculation using OffsetCalculator performs poorly with MAX_DIMS=25. This reduces MAX_DIMS (after coalescing) to 16 on ROCm. I think it's unlikely that anyone will exceed this limit. If they do, we can add additional specializations for ROCm with more dimensions. ``` I'm not sure about the underlying cause. With MAX_DIM=25, the add kernel's params is ~648 bytes vs. ~424 bytes with MAX_DIM=16. The kernel instruction footprint is bigger too, but most of these instructions are never executed and most kernel parameters are never loaded because the typical dimensionality is much smaller. Mini benchmark here: https://gist.github.com/colesbury/1e917ae6a0ca9d24712121b92fed4c8f (broadcasting operations are much faster) cc iotamudelta Pull Request resolved: pytorch/pytorch#21754 Reviewed By: bddppq Differential Revision: D15811906 Pulled By: colesbury fbshipit-source-id: 063f92c083d26e2ef2edc98df7ff0400f9432b9d

facebook-github-bot · 2019-06-14T02:35:38Z

@colesbury merged this pull request in cfd8c58.

colesbury requested review from bddppq and iotamudelta June 13, 2019 19:06

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 13, 2019

colesbury requested a review from gchanan June 13, 2019 19:06

bddppq approved these changes Jun 13, 2019

View reviewed changes

aten/src/THC/THCIntegerDivider.cuh Show resolved Hide resolved

facebook-github-bot reviewed Jun 13, 2019

View reviewed changes

bddppq added the module: rocm AMD GPU support for Pytorch label Jun 13, 2019

facebook-github-bot closed this in cfd8c58 Jun 13, 2019

facebook-github-bot added the merged label Jun 14, 2019

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune elementwise ops for ROCm #21754

Tune elementwise ops for ROCm #21754

colesbury commented Jun 13, 2019

colesbury commented Jun 13, 2019

facebook-github-bot left a comment

facebook-github-bot commented Jun 14, 2019

Tune elementwise ops for ROCm #21754

Tune elementwise ops for ROCm #21754

Conversation

colesbury commented Jun 13, 2019

colesbury commented Jun 13, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 14, 2019