Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup bincount and histc on CUDA #97090

Conversation

yuantailing
Copy link
Contributor

@yuantailing yuantailing commented Mar 18, 2023

This is to speed up torch.bincount and torch.histc on CUDA.

  1. Speed up int64_t gpuAtomicAdd,
  2. and optimize the histogram kernel.

Fixes #96626

After speedup, time cost in #96626 would be

... (run 2 times and ignore the first run)
case 1 CPU  0.0003631114959716797 seconds
case 1 CUDA 0.0005860328674316406 seconds
case 2 CPU  0.0013742446899414062 seconds
case 2 CUDA 0.0008623600006103516 seconds

Note that in "case 1 CUDA", the max op takes the most time, i.e.,

const int64_t nbins =
std::max(self.max().item<input_t>() + (int64_t)1, minlength);
, which is not to be optimized in this PR.

Benchmark

Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median.

torch.bincount

#elem nbins distribution CPU PyTorch 2.0.0 this PR speedup
2**20 80 random.uniform 0.000834 0.005783 0.000266 21.8x
2**20 80 narrow in 1 bin 0.001576 0.003967 0.000563 7.0x
2**20 500 random.uniform 0.000852 0.003641 0.000334 10.9x
2**20 500 narrow in 1% bins 0.000894 0.001878 0.000349 5.4x
2**20 2048 random.uniform 0.000891 0.000820 0.000298 2.8x
2**20 2048 narrow in 1% bins 0.000958 1.043251 0.000335 3,116.6x
2**26 80 random.uniform 0.067715 0.322409 0.003032 106.3x
2**26 80 narrow in 1 bin 0.110940 0.194644 0.017651 11.0x
2**26 500 random.uniform 0.066666 0.192302 0.002535 75.8x
2**26 500 narrow in 1% bins 0.066130 0.092237 0.005462 16.9x
2**26 2048 random.uniform 0.066371 0.035308 0.002476 14.3x
2**26 2048 narrow in 1% bins 0.068453 72.122858 0.003185 22,644.3x

torch.histc (float32)

#elem nbins distribution CPU PyTorch 2.0.0 this PR speedup
2**20 80 random.uniform 0.001261 0.000145 9.47E-05 1.5x
2**20 80 narrow in 1 bin 0.001074 0.000356 0.000311 1.1x
2**20 500 random.uniform 0.001162 0.000227 9.18E-05 2.5x
2**20 500 narrow in 1% bins 0.001082 0.000201 0.000152 1.3x
2**20 2048 random.uniform 0.001100 0.000203 0.000118 1.7x
2**20 2048 narrow in 1% bins 0.001089 0.000396 0.000107 3.7x
2**26 80 random.uniform 0.064219 0.001170 0.000786 1.5x
2**26 80 narrow in 1 bin 0.056471 0.013283 0.011939 1.1x
2**26 500 random.uniform 0.078183 0.003411 0.000562 6.1x
2**26 500 narrow in 1% bins 0.056711 0.002763 0.002738 1.0x
2**26 2048 random.uniform 0.059296 0.003503 0.000533 6.6x
2**26 2048 narrow in 1% bins 0.061754 0.015703 0.000962 16.3x

torch.histc (int64)

#elem nbins distribution CPU PyTorch 2.0.0 this PR speedup
2**20 80 random.uniform N/A 0.005614 9.47E-05 59.3x
2**20 80 narrow in 1 bin N/A 0.003799 0.000395 9.6x
2**20 500 random.uniform N/A 0.003665 9.58E-05 38.2x
2**20 500 narrow in 1% bins N/A 0.001760 0.000178 9.9x
2**20 2048 random.uniform N/A 0.000693 0.000111 6.2x
2**20 2048 narrow in 1% bins N/A 1.082904 0.000123 8,802.4x
2**26 80 random.uniform N/A 0.320400 0.001145 279.9x
2**26 80 narrow in 1 bin N/A 0.193668 0.015229 12.7x
2**26 500 random.uniform N/A 0.182897 0.000823 222.2x
2**26 500 narrow in 1% bins N/A 0.089363 0.00376 23.8x
2**26 2048 random.uniform N/A 0.033190 0.000832 39.9x
2**26 2048 narrow in 1% bins N/A 71.721012 0.001525 47,017.8x

Banchmark code

Here is the benchmark code:

import time
import torch

cases = [
    ("bincount    bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.bincount(x, minlength=2048)),
    ("bincount    bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.bincount(x, minlength=2048)),
    ("histc_float bins=80   wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=80   narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=500  wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=500  narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=2048 wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_float bins=2048 narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_int   bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
    ("histc_int   bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
]

def test(case, device):
    name, x, func = case
    x = x.to(device)
    time_samples = []
    for _ in range(15):
        torch.cuda.synchronize()
        t1 = time.time()
        func(x)
        torch.cuda.synchronize()
        t2 = time.time()
        time_samples.append(t2 - t1)
    median = sorted(time_samples)[len(time_samples) // 2]
    print(device, name, median)

for case in cases:
    test(case, device="cuda")

# for case in cases:
#     test(case, device="cpu")

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 18, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97090

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 15c18f0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 18, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: yuantailing / name: Tailing Yuan (8c0ae4d)

@yuantailing yuantailing force-pushed the yuantailing/speedup-bincount-and-histc branch from 8c0ae4d to 6b9b382 Compare March 18, 2023 17:01
@cpuhrsch cpuhrsch requested a review from ngimel March 20, 2023 23:34
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2023
@ngimel
Copy link
Collaborator

ngimel commented Mar 23, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 23, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@ngimel ngimel added topic: performance topic category release notes: cuda release notes category labels Mar 23, 2023
@ngimel
Copy link
Collaborator

ngimel commented Mar 23, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cuda release notes category topic: performance topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch.bincount is very slow on CUDA
5 participants