Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Location-shift MKL Exponential Distribution #101720

Closed
wants to merge 10 commits into from

Conversation

min-jean-cho
Copy link
Collaborator

@min-jean-cho min-jean-cho commented May 17, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented May 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101720

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9946aa4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@min-jean-cho min-jean-cho marked this pull request as ready for review May 18, 2023 00:05
// we add a small constant, c, to the denominator.
// s = argmax( p / (q+c) ) where c > 0.
// Here we use c=1.0 for conveninence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@min-jean-cho this is a very clever move to solve this issue :)

A few suggestions:

  • since CUDA has done protection, adding 1.0 won't be needed for CUDA device. Otherwise it is an additional overhead.
  • we may have other issues related to this one, for example, the one from gumbel_softmax produces NaN on CPU only #101620. The question is that can we apply the same trick it ?

Also can we come up with an solution that fix exponential_, otherwise i am afraid there might be other issues popping out in the future with operators which rely on exponential_.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo @mingfeima that we'd better explore fixes from the exponential_ side instead of working around the caller side. Is torch.exponential_ supposed to exclude zeros?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's supposed to exclude zeros

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have two options:

  1. As @mingfeima suggested in offline discussion, re-sample in the unlikely event exactly zero is sampled.
  • I'm not sure if this will negatively affect performance in the unlikely event exactly zero is sampled. The probability that the sample includes exactly zero increases as lambda (exponential parameter, lambda) increases.
  1. Shifting the MKL exponential distribution location by just adding a very small constant.
  • I believe this is OK to do. If X ~ Exp(lambda), then E(X) = 1/lambda and V(X) = 1/lambda**2. If you linearly transform like Y = X + c, where c ~= 0, then the distribution of Y is very similar to X. But the expected value slightly changes, E(Y) = E(X + c) = E(X) + c = (1/lambda) + c. Variance remains the same, V(Y) = 1/lambda**2. If c is very small (e.g., c = 10**(-6), etc), the two distributions are indistinguishable. That means they are almost identical.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, as I said in #48841 (comment), we should go with the correct one. The perf hit is minimal and since it's amortised it's basically free. You can even surround it with C10_UNLIKELY so that the perf hit in the common case is literally zero.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the performance impact and variance of the resampling approach. I guess we can have a benchmark to evaluate the impact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary performance comparison of method 1: iterative resample when sampled points include 0.0 vs. method 2: shift exponential D location:
Tested on 28 physical cores/socket, 1 socket on Skylake.

avg time (ms)
lambda method 1: iterative resample when sampled points include 0.0 method 2: shift exponential D location
10 ** 0 318.66 238.5
10 ** 5 318.13 249.03
10 ** 10 318.42 248.83
10 ** 15 318.35 238.46
10 ** 20 318.34 249.17
10 ** 25 318.47 248.43
10 ** 30 318.32 238.45
10 ** 35 318.46 248.63
10 ** 40 427.02 248.67
10 ** 45 4579.04 238.27

avg time(ms): avg time taken for _exponential(lambda, n=1000000000)

Takeaways:

  • method 1 generally always performs worse than method 2, even for lambda = 1
  • for reasonable λ (e.g., < 100), the probability of observing zero will be very low and far less than 0.01% of your sample will contain zeros
  • number of iterations required to remove zeros when using method 1 iterative resampling approach
    • in order to remove all zeros i ≤ ⌈e**(λc) ⋅ log(n)⌉ iteration is required
      i: number of iterations to remove zeros
      λ: exponential distribution parameter lambda
      c: x=x if x >c, x=0* if x <= c
      n: number of samples
    • at most one iteration is enough to remove all zeros for reasonable values of λ
    • for a very large λ (e.g., between 10 ** 40 and 10 ** 45, which is unreasonably large), you need more iterations to resample, and hence the worser performance (e.g., assuming c=10 ** -45, when λ=10 ** 45, i ≤ ⌈e**(10 ** 45 ⋅ 10 ** -45) ⋅ log(n)⌉ => i <= ⌈e ⋅ log(n)⌉
      • but note that such large λ is not used for exponential distribution. If λ is huge, the distribution would look like a peak rather than exponentially decreasing (e.g., If λ = 10 ** 45, E(X) = 1/λ = 10 ** -45 and V(X) = 1/λ ** 2 = 10 ** -90 ~= 0). V(X) = 0 means X is not a random variable but a constant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @min-jean-cho for the evaluation. So, let's proceed with method 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jgong5 , yes proceed with method 2.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 18, 2023
@github-actions
Copy link

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 22, 2023
@@ -149,13 +149,13 @@ void exponential_kernel(TensorIteratorBase &iter, double lambda, c10::optional<G
vslNewStream(&stream, VSL_BRNG_MCG31, seed);
vslSkipAheadStream(stream, begin);
vdRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, stream, len,
(double *)(sample_ptr + begin), 0, 1./lambda);
(double *)(sample_ptr + begin), 1.4013e-45, 1./lambda);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't hardcode values, use numeric_limits::denorm_min instead https://en.cppreference.com/w/cpp/types/numeric_limits/denorm_min. Do you want denorm_min here or just min though? Some systems might compile without denorm support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ngimel , I'll also shortly share a preliminary performance comparison of method 1: resample when sampled points include 0.0 vs. method 2: shift exponential D location.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lezcano, eps can be any arbitrary value. So I'll just use std::numeric_limits::min instead of std::numeric_limits::denorm_min.

int64_t n_zeros;
bool has_zero;

zero_idx = (self == 0.).nonzero();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a700ccc can fix this issue, then it is fine.

the fix from this PR is going to be a lot slower since it scans the whole tensor multiple times, this PR can be further optimized if we plan the memory access more carefully, but it is still going to be slower since you have to scan whether the output tensor (if has zero) anyway.

@mingfeima mingfeima added the topic: not user facing topic category label May 23, 2023
@mingfeima
Copy link
Collaborator

@min-jean-cho can we update the test cases as well ?

@lezcano
Copy link
Collaborator

lezcano commented May 24, 2023

Yeah, I'm pretty sure that what we want here, given that the API allows for that, is to shift the distribution, that is, method 2. The only thing left to address is Natalia's comment in #101720 (comment) about using std::numeric_limits::min.

@min-jean-cho
Copy link
Collaborator Author

@min-jean-cho can we update the test cases as well ?

Thanks @mingfeima, I can add test to check the minimum value is not zero -- for a very large lambda (e.g., 10**45), approximately 40% of the sample will contain zero (before the current fix). But I'm not sure if we should add such test case that rely on probability.

@min-jean-cho min-jean-cho changed the title fix torch.multinomial zero-prob sampling Location-shift MKL Exponential Distribution May 24, 2023
@mingfeima
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 25, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch pytorch deleted a comment from pytorchmergebot May 25, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Comment with id 1562131265 not found

Details for Dev Infra team Raised by workflow job

@min-jean-cho
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

torch.multinomial selects elements with zero weight
8 participants