Location-shift MKL Exponential Distribution #101720

min-jean-cho · 2023-05-17T20:38:03Z

Fixes #48841 , #101620

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2023-05-17T20:38:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101720

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9946aa4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mingfeima · 2023-05-18T01:25:49Z

aten/src/ATen/native/Distributions.cpp

+    // we add a small constant, c, to the denominator.
+    // s =  argmax( p / (q+c) ) where c > 0.
+    // Here we use c=1.0 for conveninence.
+


@min-jean-cho this is a very clever move to solve this issue :)

A few suggestions:

since CUDA has done protection, adding 1.0 won't be needed for CUDA device. Otherwise it is an additional overhead.

we may have other issues related to this one, for example, the one from gumbel_softmax produces NaN on CPU only #101620. The question is that can we apply the same trick it ?

Also can we come up with an solution that fix exponential_, otherwise i am afraid there might be other issues popping out in the future with operators which rely on exponential_.

echo @mingfeima that we'd better explore fixes from the exponential_ side instead of working around the caller side. Is torch.exponential_ supposed to exclude zeros?

Yes, it's supposed to exclude zeros

I think we have two options:

As @mingfeima suggested in offline discussion, re-sample in the unlikely event exactly zero is sampled.

I'm not sure if this will negatively affect performance in the unlikely event exactly zero is sampled. The probability that the sample includes exactly zero increases as lambda (exponential parameter, lambda) increases.

Shifting the MKL exponential distribution location by just adding a very small constant.

I believe this is OK to do. If X ~ Exp(lambda), then E(X) = 1/lambda and V(X) = 1/lambda**2. If you linearly transform like Y = X + c, where c ~= 0, then the distribution of Y is very similar to X. But the expected value slightly changes, E(Y) = E(X + c) = E(X) + c = (1/lambda) + c. Variance remains the same, V(Y) = 1/lambda**2. If c is very small (e.g., c = 10**(-6), etc), the two distributions are indistinguishable. That means they are almost identical.

What do you think?

IMO, as I said in #48841 (comment), we should go with the correct one. The perf hit is minimal and since it's amortised it's basically free. You can even surround it with C10_UNLIKELY so that the perf hit in the common case is literally zero.

I'm concerned about the performance impact and variance of the resampling approach. I guess we can have a benchmark to evaluate the impact.

Preliminary performance comparison of method 1: iterative resample when sampled points include 0.0 vs. method 2: shift exponential D location:
Tested on 28 physical cores/socket, 1 socket on Skylake.

avg time (ms)

lambda method 1: iterative resample when sampled points include 0.0 method 2: shift exponential D location

10 ** 0 318.66 238.5

10 ** 5 318.13 249.03

10 ** 10 318.42 248.83

10 ** 15 318.35 238.46

10 ** 20 318.34 249.17

10 ** 25 318.47 248.43

10 ** 30 318.32 238.45

10 ** 35 318.46 248.63

10 ** 40 427.02 248.67

10 ** 45 4579.04 238.27

avg time(ms): avg time taken for _exponential(lambda, n=1000000000)

Takeaways:

method 1 generally always performs worse than method 2, even for lambda = 1

for reasonable λ (e.g., < 100), the probability of observing zero will be very low and far less than 0.01% of your sample will contain zeros

number of iterations required to remove zeros when using method 1 iterative resampling approach

in order to remove all zeros i ≤ ⌈e**(λc) ⋅ log(n)⌉ iteration is required
i: number of iterations to remove zeros
λ: exponential distribution parameter lambda
c: x=x if x >c, x=0* if x <= c
n: number of samples

at most one iteration is enough to remove all zeros for reasonable values of λ

for a very large λ (e.g., between 10 ** 40 and 10 ** 45, which is unreasonably large), you need more iterations to resample, and hence the worser performance (e.g., assuming c=10 ** -45, when λ=10 ** 45, i ≤ ⌈e**(10 ** 45 ⋅ 10 ** -45) ⋅ log(n)⌉ => i <= ⌈e ⋅ log(n)⌉

but note that such large λ is not used for exponential distribution. If λ is huge, the distribution would look like a peak rather than exponentially decreasing (e.g., If λ = 10 ** 45, E(X) = 1/λ = 10 ** -45 and V(X) = 1/λ ** 2 = 10 ** -90 ~= 0). V(X) = 0 means X is not a random variable but a constant

Thanks @min-jean-cho for the evaluation. So, let's proceed with method 2?

Thanks @jgong5 , yes proceed with method 2.

github-actions · 2023-05-18T17:47:47Z

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

ngimel · 2023-05-22T23:31:53Z

aten/src/ATen/native/cpu/DistributionKernels.cpp

@@ -149,13 +149,13 @@ void exponential_kernel(TensorIteratorBase &iter, double lambda, c10::optional<G
            vslNewStream(&stream, VSL_BRNG_MCG31, seed);
            vslSkipAheadStream(stream, begin);
            vdRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, stream, len,
-              (double *)(sample_ptr + begin), 0, 1./lambda);
+              (double *)(sample_ptr + begin), 1.4013e-45, 1./lambda);


Please don't hardcode values, use numeric_limits::denorm_min instead https://en.cppreference.com/w/cpp/types/numeric_limits/denorm_min. Do you want denorm_min here or just min though? Some systems might compile without denorm support.

Thanks @ngimel , I'll also shortly share a preliminary performance comparison of method 1: resample when sampled points include 0.0 vs. method 2: shift exponential D location.

Thanks @lezcano, eps can be any arbitrary value. So I'll just use std::numeric_limits::min instead of std::numeric_limits::denorm_min.

mingfeima · 2023-05-23T01:32:58Z

aten/src/ATen/native/Distributions.cpp

+    int64_t n_zeros;
+    bool has_zero;
+
+    zero_idx = (self == 0.).nonzero();


if a700ccc can fix this issue, then it is fine.

the fix from this PR is going to be a lot slower since it scans the whole tensor multiple times, this PR can be further optimized if we plan the memory access more carefully, but it is still going to be slower since you have to scan whether the output tensor (if has zero) anyway.

mingfeima · 2023-05-23T01:36:17Z

@min-jean-cho can we update the test cases as well ?

lezcano · 2023-05-24T07:57:41Z

Yeah, I'm pretty sure that what we want here, given that the API allows for that, is to shift the distribution, that is, method 2. The only thing left to address is Natalia's comment in #101720 (comment) about using std::numeric_limits::min.

min-jean-cho · 2023-05-24T20:45:33Z

@min-jean-cho can we update the test cases as well ?

Thanks @mingfeima, I can add test to check the minimum value is not zero -- for a very large lambda (e.g., 10**45), approximately 40% of the sample will contain zero (before the current fix). But I'm not sure if we should add such test case that rely on probability.

mingfeima · 2023-05-25T01:09:52Z

@pytorchbot merge

pytorchmergebot · 2023-05-25T01:12:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-05-25T01:26:44Z

Merge failed

Reason: Comment with id 1562131265 not found

Details for Dev Infra team

Raised by workflow job

min-jean-cho · 2023-05-25T02:27:19Z

@pytorchbot merge

pytorchmergebot · 2023-05-25T02:29:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

argmax(P/E) -> argmax(P/(E+c))

6429543

pytorchbot added the open source label May 17, 2023

update

3b30983

min-jean-cho marked this pull request as ready for review May 18, 2023 00:05

min-jean-cho requested review from jgong5, ngimel and lezcano May 18, 2023 00:05

mingfeima reviewed May 18, 2023

View reviewed changes

mingfeima mentioned this pull request May 18, 2023

gumbel_softmax produces NaN on CPU only #101620

Closed

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 18, 2023

min-jean-cho added 5 commits May 22, 2023 14:31

undo wrong fix

d32a559

undo wrong fix

99dbf6b

method 1: resample when sampled points include 0.0

7615b4d

undo method1

6a721d7

method 2: shift exponential D location

a700ccc

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 22, 2023

ngimel reviewed May 22, 2023

View reviewed changes

mingfeima reviewed May 23, 2023

View reviewed changes

mingfeima added the topic: not user facing topic category label May 23, 2023

min-jean-cho added 2 commits May 24, 2023 12:57

proceed with method 2

83af4f5

commentation

3cc6ef6

min-jean-cho changed the title ~~fix torch.multinomial zero-prob sampling~~ Location-shift MKL Exponential Distribution May 24, 2023

lint

9946aa4

lezcano approved these changes May 24, 2023

View reviewed changes

ngimel approved these changes May 24, 2023

View reviewed changes

mingfeima approved these changes May 25, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 25, 2023

pytorchmergebot added the merging label May 25, 2023

jgong5 approved these changes May 25, 2023

View reviewed changes

pytorch deleted a comment from pytorchmergebot May 25, 2023

pytorchmergebot removed the merging label May 25, 2023

pytorchmergebot added the merging label May 25, 2023

pytorchmergebot added Merged and removed merging labels May 25, 2023

pytorchmergebot closed this in 3ca068b May 25, 2023

min-jean-cho mentioned this pull request May 31, 2023

Multinomial without replacement produces samples that have zero probability #50034

Open

myazdani mentioned this pull request May 31, 2023

transition scores can be negative infinity huggingface/transformers#22979

Closed

4 tasks

min-jean-cho mentioned this pull request Jun 13, 2023

torch.nn.functional.gumbel_softmax reporting NaNs in Pytorch 2.0 #103459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Location-shift MKL Exponential Distribution #101720

Location-shift MKL Exponential Distribution #101720

min-jean-cho commented May 17, 2023 •

edited

pytorch-bot bot commented May 17, 2023 •

edited

mingfeima May 18, 2023

jgong5 May 18, 2023

ngimel May 18, 2023

min-jean-cho May 18, 2023 •

edited

lezcano May 18, 2023

jgong5 May 19, 2023

min-jean-cho May 23, 2023

jgong5 May 24, 2023

min-jean-cho May 24, 2023

github-actions bot commented May 18, 2023

ngimel May 22, 2023

min-jean-cho May 22, 2023 •

edited

min-jean-cho May 24, 2023

mingfeima May 23, 2023

mingfeima commented May 23, 2023

lezcano commented May 24, 2023

min-jean-cho commented May 24, 2023

mingfeima commented May 25, 2023

pytorchmergebot commented May 25, 2023

pytorchmergebot commented May 25, 2023

min-jean-cho commented May 25, 2023

pytorchmergebot commented May 25, 2023

	avg time (ms)
lambda	method 1: iterative resample when sampled points include 0.0	method 2: shift exponential D location
10 ** 0	318.66	238.5
10 ** 5	318.13	249.03
10 ** 10	318.42	248.83
10 ** 15	318.35	238.46
10 ** 20	318.34	249.17
10 ** 25	318.47	248.43
10 ** 30	318.32	238.45
10 ** 35	318.46	248.63
10 40**	427.02	248.67
10 45**	4579.04	238.27

Location-shift MKL Exponential Distribution #101720

Location-shift MKL Exponential Distribution #101720

Conversation

min-jean-cho commented May 17, 2023 • edited

pytorch-bot bot commented May 17, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101720

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

min-jean-cho May 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 18, 2023

This PR needs a label

Choose a reason for hiding this comment

min-jean-cho May 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingfeima commented May 23, 2023

lezcano commented May 24, 2023

min-jean-cho commented May 24, 2023

mingfeima commented May 25, 2023

pytorchmergebot commented May 25, 2023

Merge started

pytorchmergebot commented May 25, 2023

Merge failed

min-jean-cho commented May 25, 2023

pytorchmergebot commented May 25, 2023

Merge started

min-jean-cho commented May 17, 2023 •

edited

pytorch-bot bot commented May 17, 2023 •

edited

min-jean-cho May 18, 2023 •

edited

min-jean-cho May 22, 2023 •

edited