[fix] torch.multinomial : fix for 0 size dim #43775

kshitij12345 · 2020-08-28T16:49:38Z

Fixes #43768

TO-DO:

Add test

dr-ci · 2020-08-28T17:01:06Z

💊 CI failures summary and remediations

As of commit 6e1ed7d (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 11 times.

kshitij12345 · 2020-08-28T20:16:05Z

test/test_torch.py

-        probs = torch.ones(0, 3)
-        num_samples = 1
+        probs = torch.ones(0, 128, device=device)
+        num_samples = 64


Curiously, test passed with

probs = torch.ones(0, 3, device=device) num_samples = 1

Is it smth with replacement, as it fails only when num_samples > 1?

Possible. I haven't actually stepped through the exact kernel code where it is failing.

Test passes with replacement=False, so you have to separately test different replacement modes.

>>> torch.multinomial(x, 1, True) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: CUDA error: invalid configuration argument

The above error comes from

pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu

Lines 320 to 342 in 58148c8

if (n_sample == 1 && maxShared >= requiredShared) {

// Optimized allocation-free implementation

// To exploit greater parallelism for the sampling, generate the

// Uniform random samples in a separate kernel launch, into

// temporarily allocated memory. The device RNG is thread-limited

Tensor sampled = native::empty_cuda({numDist, n_sample}, self_v.options());

at::native::uniform_(sampled, 0.0, 1.0, generator);

dim3 block(numCategories < maxThreads ? numCategories : maxThreads);

dim3 grid(numDist < numSM * 4 ? numDist : numSM * 4);

sampleMultinomialOnce<scalar_t, accscalar_t>

<<<grid, block,

requiredShared,

at::cuda::getCurrentCUDAStream()>>>(

result.data_ptr<int64_t>(),

numDist,

numCategories,

sampled.data_ptr<scalar_t>(),

self_v.data_ptr<scalar_t>(),

self_v.stride(0),

self_v.stride(1)

);

For num_samples > 1,

>>> torch.multinomial(x, 2, True) Floating point exception (core dumped)

pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu

Lines 364 to 383 in 58148c8

if (with_replacement) {

// Binary search is warp divergent (so effectively we're running

// with just a single thread), but for better utilization,

// we need each block to have at least 4 warps.

dim3 block(128);

// Each block will generate a sample from one

// distribution concurrently.

int grid_y=std::min<int>(numDist, at::cuda::getCurrentDeviceProperties()->maxGridSize[1]);

dim3 grid((n_sample-1)/block.x+1, grid_y);

{

// See Note [Acquire lock when using random generators]

std::lock_guard<std::mutex> lock(gen->mutex_);

// each thread generates a single sample for (numdist/numblocks.y) distributions, however, since we have to use

// curand_uniform4 (See Note [Register spilling in curand call for CUDA < 10]),

// offset is 4 times that.

auto offset = ((numDist-1)/grid.y+1)*4;

rng_engine_inputs = gen->philox_engine_inputs(offset);

}

Here grid_y is 0 and thus in the snippet below, we get floating point exception due to Divide By Zero.

pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu

Line 381 in 58148c8

auto offset = ((numDist-1)/grid.y+1)*4;

The fix takes care of both these cases.

kshitij12345 · 2020-08-28T20:38:31Z

@ngimel Please review :)

ngimel · 2020-08-28T20:42:46Z

Thanks for the fix. I think that fixing root cause would be better, otherwise it is probably triggered in other situations.

ngimel · 2020-08-28T22:24:01Z

It looks like the error is in multinomial_kernel_impl, so the fix is ok. Can you also please add missing

AT_CUDA_CHECK(cudaGetLastError());

at line 396 of MultinomialKernel.cu?
And test replacement=True and replacement=False.

codecov · 2020-08-29T08:56:43Z

Codecov Report

Merging #43775 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #43775   +/-   ##
=======================================
  Coverage   69.34%   69.34%           
=======================================
  Files         378      378           
  Lines       46698    46698           
=======================================
  Hits        32381    32381           
  Misses      14317    14317

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58a7e73...6e1ed7d. Read the comment docs.

kshitij12345 · 2020-08-31T05:03:38Z

Gentle Ping :)

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-08-31T20:21:16Z

@ngimel merged this pull request in 0394c5a.

fix for 0 size axis

e07c9fd

pytorchbot added the open source label Aug 28, 2020

kshitij12345 added 2 commits August 28, 2020 22:57

Merge branch 'master' into fix/multinomial/0-dist

8ba7a93

update existing test

0b21262

kshitij12345 commented Aug 28, 2020

View reviewed changes

malfet added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 28, 2020

malfet requested a review from ngimel August 28, 2020 20:51

kshitij12345 added 2 commits August 29, 2020 10:18

add AT_CUDA_CHECK

c2bc6d7

update test

6e1ed7d

ngimel approved these changes Aug 31, 2020

View reviewed changes

facebook-github-bot reviewed Aug 31, 2020

View reviewed changes

facebook-github-bot closed this in 0394c5a Aug 31, 2020

facebook-github-bot added the merged label Aug 31, 2020

kshitij12345 deleted the fix/multinomial/0-dist branch September 11, 2020 09:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] torch.multinomial : fix for 0 size dim #43775

[fix] torch.multinomial : fix for 0 size dim #43775

kshitij12345 commented Aug 28, 2020 •

edited

dr-ci bot commented Aug 28, 2020 •

edited

kshitij12345 Aug 28, 2020

DomainFlag Aug 28, 2020

kshitij12345 Aug 28, 2020

ngimel Aug 28, 2020

kshitij12345 Aug 29, 2020

kshitij12345 commented Aug 28, 2020

ngimel commented Aug 28, 2020

ngimel commented Aug 28, 2020

codecov bot commented Aug 29, 2020 •

edited

kshitij12345 commented Aug 31, 2020

facebook-github-bot left a comment

facebook-github-bot commented Aug 31, 2020

	if (n_sample == 1 && maxShared >= requiredShared) {
	// Optimized allocation-free implementation
	// To exploit greater parallelism for the sampling, generate the
	// Uniform random samples in a separate kernel launch, into
	// temporarily allocated memory. The device RNG is thread-limited
	Tensor sampled = native::empty_cuda({numDist, n_sample}, self_v.options());
	at::native::uniform_(sampled, 0.0, 1.0, generator);

	dim3 block(numCategories < maxThreads ? numCategories : maxThreads);
	dim3 grid(numDist < numSM * 4 ? numDist : numSM * 4);

	sampleMultinomialOnce<scalar_t, accscalar_t>
	<<<grid, block,
	requiredShared,
	at::cuda::getCurrentCUDAStream()>>>(
	result.data_ptr<int64_t>(),
	numDist,
	numCategories,
	sampled.data_ptr<scalar_t>(),
	self_v.data_ptr<scalar_t>(),
	self_v.stride(0),
	self_v.stride(1)
	);

	if (with_replacement) {
	// Binary search is warp divergent (so effectively we're running
	// with just a single thread), but for better utilization,
	// we need each block to have at least 4 warps.
	dim3 block(128);

	// Each block will generate a sample from one
	// distribution concurrently.
	int grid_y=std::min<int>(numDist, at::cuda::getCurrentDeviceProperties()->maxGridSize[1]);
	dim3 grid((n_sample-1)/block.x+1, grid_y);
	{
	// See Note [Acquire lock when using random generators]
	std::lock_guard<std::mutex> lock(gen->mutex_);

	// each thread generates a single sample for (numdist/numblocks.y) distributions, however, since we have to use
	// curand_uniform4 (See Note [Register spilling in curand call for CUDA < 10]),
	// offset is 4 times that.
	auto offset = ((numDist-1)/grid.y+1)*4;
	rng_engine_inputs = gen->philox_engine_inputs(offset);
	}

[fix] torch.multinomial : fix for 0 size dim #43775

[fix] torch.multinomial : fix for 0 size dim #43775

Conversation

kshitij12345 commented Aug 28, 2020 • edited

dr-ci bot commented Aug 28, 2020 • edited

💊 CI failures summary and remediations

kshitij12345 Aug 28, 2020

Choose a reason for hiding this comment

DomainFlag Aug 28, 2020

Choose a reason for hiding this comment

kshitij12345 Aug 28, 2020

Choose a reason for hiding this comment

ngimel Aug 28, 2020

Choose a reason for hiding this comment

kshitij12345 Aug 29, 2020

Choose a reason for hiding this comment

kshitij12345 commented Aug 28, 2020

ngimel commented Aug 28, 2020

ngimel commented Aug 28, 2020

codecov bot commented Aug 29, 2020 • edited

Codecov Report

kshitij12345 commented Aug 31, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Aug 31, 2020

kshitij12345 commented Aug 28, 2020 •

edited

dr-ci bot commented Aug 28, 2020 •

edited

codecov bot commented Aug 29, 2020 •

edited