Speedup bincount and histc on CUDA #97090

yuantailing · 2023-03-18T15:20:29Z

This is to speed up torch.bincount and torch.histc on CUDA.

Speed up int64_t gpuAtomicAdd,
and optimize the histogram kernel.

Fixes #96626

After speedup, time cost in #96626 would be

... (run 2 times and ignore the first run)
case 1 CPU  0.0003631114959716797 seconds
case 1 CUDA 0.0005860328674316406 seconds
case 2 CPU  0.0013742446899414062 seconds
case 2 CUDA 0.0008623600006103516 seconds

Note that in "case 1 CUDA", the max op takes the most time, i.e.,

pytorch/aten/src/ATen/native/cuda/SummaryOps.cu

Lines 334 to 335 in 5ee5a16

    
           const int64_t nbins = 
        
               std::max(self.max().item<input_t>() + (int64_t)1, minlength);

, which is not to be optimized in this PR.

Benchmark

Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median.

torch.bincount

#elem	nbins	distribution	CPU	PyTorch 2.0.0	this PR	speedup
2**20	80	random.uniform	0.000834	0.005783	0.000266	21.8x
2**20	80	narrow in 1 bin	0.001576	0.003967	0.000563	7.0x
2**20	500	random.uniform	0.000852	0.003641	0.000334	10.9x
2**20	500	narrow in 1% bins	0.000894	0.001878	0.000349	5.4x
2**20	2048	random.uniform	0.000891	0.000820	0.000298	2.8x
2**20	2048	narrow in 1% bins	0.000958	1.043251	0.000335	3,116.6x
2**26	80	random.uniform	0.067715	0.322409	0.003032	106.3x
2**26	80	narrow in 1 bin	0.110940	0.194644	0.017651	11.0x
2**26	500	random.uniform	0.066666	0.192302	0.002535	75.8x
2**26	500	narrow in 1% bins	0.066130	0.092237	0.005462	16.9x
2**26	2048	random.uniform	0.066371	0.035308	0.002476	14.3x
2**26	2048	narrow in 1% bins	0.068453	72.122858	0.003185	22,644.3x

torch.histc (float32)

#elem	nbins	distribution	CPU	PyTorch 2.0.0	this PR	speedup
2**20	80	random.uniform	0.001261	0.000145	9.47E-05	1.5x
2**20	80	narrow in 1 bin	0.001074	0.000356	0.000311	1.1x
2**20	500	random.uniform	0.001162	0.000227	9.18E-05	2.5x
2**20	500	narrow in 1% bins	0.001082	0.000201	0.000152	1.3x
2**20	2048	random.uniform	0.001100	0.000203	0.000118	1.7x
2**20	2048	narrow in 1% bins	0.001089	0.000396	0.000107	3.7x
2**26	80	random.uniform	0.064219	0.001170	0.000786	1.5x
2**26	80	narrow in 1 bin	0.056471	0.013283	0.011939	1.1x
2**26	500	random.uniform	0.078183	0.003411	0.000562	6.1x
2**26	500	narrow in 1% bins	0.056711	0.002763	0.002738	1.0x
2**26	2048	random.uniform	0.059296	0.003503	0.000533	6.6x
2**26	2048	narrow in 1% bins	0.061754	0.015703	0.000962	16.3x

torch.histc (int64)

#elem	nbins	distribution	CPU	PyTorch 2.0.0	this PR	speedup
2**20	80	random.uniform	N/A	0.005614	9.47E-05	59.3x
2**20	80	narrow in 1 bin	N/A	0.003799	0.000395	9.6x
2**20	500	random.uniform	N/A	0.003665	9.58E-05	38.2x
2**20	500	narrow in 1% bins	N/A	0.001760	0.000178	9.9x
2**20	2048	random.uniform	N/A	0.000693	0.000111	6.2x
2**20	2048	narrow in 1% bins	N/A	1.082904	0.000123	8,802.4x
2**26	80	random.uniform	N/A	0.320400	0.001145	279.9x
2**26	80	narrow in 1 bin	N/A	0.193668	0.015229	12.7x
2**26	500	random.uniform	N/A	0.182897	0.000823	222.2x
2**26	500	narrow in 1% bins	N/A	0.089363	0.00376	23.8x
2**26	2048	random.uniform	N/A	0.033190	0.000832	39.9x
2**26	2048	narrow in 1% bins	N/A	71.721012	0.001525	47,017.8x

Banchmark code

Here is the benchmark code:

import time
import torch

cases = [
    ("bincount    bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.bincount(x, minlength=2048)),
    ("bincount    bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.bincount(x, minlength=2048)),
    ("histc_float bins=80   wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=80   narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=500  wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=500  narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=2048 wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_float bins=2048 narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_int   bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
    ("histc_int   bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
]

def test(case, device):
    name, x, func = case
    x = x.to(device)
    time_samples = []
    for _ in range(15):
        torch.cuda.synchronize()
        t1 = time.time()
        func(x)
        torch.cuda.synchronize()
        t2 = time.time()
        time_samples.append(t2 - t1)
    median = sorted(time_samples)[len(time_samples) // 2]
    print(device, name, median)

for case in cases:
    test(case, device="cuda")

# for case in cases:
#     test(case, device="cpu")

pytorch-bot · 2023-03-18T15:20:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97090

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 15c18f0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-03-18T15:20:35Z

The committers listed above are authorized under a signed CLA.

✅ login: yuantailing / name: Tailing Yuan (8c0ae4d)

ngimel · 2023-03-23T21:47:12Z

@pytorchbot merge

pytorchmergebot · 2023-03-23T21:49:10Z

Merge failed

Reason: This PR needs a label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ngimel · 2023-03-23T22:25:47Z

@pytorchbot merge

pytorchmergebot · 2023-03-23T22:27:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchbot added the open source label Mar 18, 2023

Speedup bincount and histc on CUDA

6b9b382

yuantailing force-pushed the yuantailing/speedup-bincount-and-histc branch from 8c0ae4d to 6b9b382 Compare March 18, 2023 17:01

cpuhrsch requested a review from ngimel March 20, 2023 23:34

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2023

ngimel approved these changes Mar 23, 2023

View reviewed changes

remove dead code in SummaryOps.cu

15c18f0

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 23, 2023

ngimel added topic: performance topic category release notes: cuda release notes category labels Mar 23, 2023

pytorchmergebot added the Merged label Mar 24, 2023

pytorchmergebot closed this in 63e1f12 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup bincount and histc on CUDA #97090

Speedup bincount and histc on CUDA #97090

yuantailing commented Mar 18, 2023 •

edited

pytorch-bot bot commented Mar 18, 2023 •

edited

linux-foundation-easycla bot commented Mar 18, 2023 •

edited

ngimel commented Mar 23, 2023

pytorchmergebot commented Mar 23, 2023

ngimel commented Mar 23, 2023

pytorchmergebot commented Mar 23, 2023

	const int64_t nbins =
	std::max(self.max().item<input_t>() + (int64_t)1, minlength);

Speedup bincount and histc on CUDA #97090

Speedup bincount and histc on CUDA #97090

Conversation

yuantailing commented Mar 18, 2023 • edited

Fixes #96626

Benchmark

torch.bincount

torch.histc (float32)

torch.histc (int64)

Banchmark code

pytorch-bot bot commented Mar 18, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97090

✅ No Failures

linux-foundation-easycla bot commented Mar 18, 2023 • edited

ngimel commented Mar 23, 2023

pytorchmergebot commented Mar 23, 2023

Merge failed

ngimel commented Mar 23, 2023

pytorchmergebot commented Mar 23, 2023

Merge started

yuantailing commented Mar 18, 2023 •

edited

pytorch-bot bot commented Mar 18, 2023 •

edited

linux-foundation-easycla bot commented Mar 18, 2023 •

edited