Skip to content

CUDA optimization: using __restrict__ whenever possible #19335

@ThisIsIsaac

Description

@ThisIsIsaac

🚀 Feature

Motivation

to increase throughput

Pitch

for every feasible CUDA kernel, take input as THCDeviceTensor<T, DIM, IndexT, RestrictPtrTraits> instead of DefaultPtrTraits in order to add __restrict__ keyword to the device tensors. This alone seems to increase throughput about 3~5%. The increase of throughput is tested with upsampling bilinear code.

Alternatives

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions