-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Location-shift MKL Exponential Distribution #101720
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101720
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9946aa4: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
// we add a small constant, c, to the denominator. | ||
// s = argmax( p / (q+c) ) where c > 0. | ||
// Here we use c=1.0 for conveninence. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@min-jean-cho this is a very clever move to solve this issue :)
A few suggestions:
- since CUDA has done protection, adding 1.0 won't be needed for CUDA device. Otherwise it is an additional overhead.
- we may have other issues related to this one, for example, the one from gumbel_softmax produces NaN on CPU only #101620. The question is that can we apply the same trick it ?
Also can we come up with an solution that fix exponential_
, otherwise i am afraid there might be other issues popping out in the future with operators which rely on exponential_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo @mingfeima that we'd better explore fixes from the exponential_
side instead of working around the caller side. Is torch.exponential_
supposed to exclude zeros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's supposed to exclude zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have two options:
- As @mingfeima suggested in offline discussion, re-sample in the unlikely event exactly zero is sampled.
- I'm not sure if this will negatively affect performance in the unlikely event exactly zero is sampled. The probability that the sample includes exactly zero increases as lambda (exponential parameter, lambda) increases.
- Shifting the MKL exponential distribution location by just adding a very small constant.
- I believe this is OK to do. If
X ~ Exp(lambda)
, thenE(X) = 1/lambda
andV(X) = 1/lambda**2
. If you linearly transform likeY = X + c
, wherec ~= 0
, then the distribution ofY
is very similar toX
. But the expected value slightly changes,E(Y) = E(X + c) = E(X) + c = (1/lambda) + c
. Variance remains the same,V(Y) = 1/lambda**2
. Ifc
is very small (e.g., c = 10**(-6), etc), the two distributions are indistinguishable. That means they are almost identical.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, as I said in #48841 (comment), we should go with the correct one. The perf hit is minimal and since it's amortised it's basically free. You can even surround it with C10_UNLIKELY
so that the perf hit in the common case is literally zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about the performance impact and variance of the resampling approach. I guess we can have a benchmark to evaluate the impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary performance comparison of method 1: iterative resample when sampled points include 0.0 vs. method 2: shift exponential D location:
Tested on 28 physical cores/socket, 1 socket on Skylake.
avg time (ms) | ||
---|---|---|
lambda | method 1: iterative resample when sampled points include 0.0 | method 2: shift exponential D location |
10 ** 0 | 318.66 | 238.5 |
10 ** 5 | 318.13 | 249.03 |
10 ** 10 | 318.42 | 248.83 |
10 ** 15 | 318.35 | 238.46 |
10 ** 20 | 318.34 | 249.17 |
10 ** 25 | 318.47 | 248.43 |
10 ** 30 | 318.32 | 238.45 |
10 ** 35 | 318.46 | 248.63 |
10 ** 40 | 427.02 | 248.67 |
10 ** 45 | 4579.04 | 238.27 |
avg time(ms): avg time taken for _exponential(lambda, n=1000000000)
Takeaways:
- method 1 generally always performs worse than method 2, even for lambda = 1
- for reasonable λ (e.g., < 100), the probability of observing zero will be very low and far less than 0.01% of your sample will contain zeros
- number of iterations required to remove zeros when using method 1 iterative resampling approach
- in order to remove all zeros
i ≤ ⌈e**(λc) ⋅ log(n)⌉
iteration is required
i
: number of iterations to remove zeros
λ
: exponential distribution parameter lambda
c
: x=x if x >c, x=0* if x <= c
n
: number of samples - at most one iteration is enough to remove all zeros for reasonable values of λ
- for a very large λ (e.g., between 10 ** 40 and 10 ** 45, which is unreasonably large), you need more iterations to resample, and hence the worser performance (e.g., assuming c=10 ** -45, when λ=10 ** 45,
i ≤ ⌈e**(10 ** 45 ⋅ 10 ** -45) ⋅ log(n)⌉
=>i <= ⌈e ⋅ log(n)⌉
- but note that such large λ is not used for exponential distribution. If λ is huge, the distribution would look like a peak rather than exponentially decreasing (e.g., If λ = 10 ** 45,
E(X) = 1/λ = 10 ** -45
andV(X) = 1/λ ** 2 = 10 ** -90 ~= 0
). V(X) = 0 means X is not a random variable but a constant
- but note that such large λ is not used for exponential distribution. If λ is huge, the distribution would look like a peak rather than exponentially decreasing (e.g., If λ = 10 ** 45,
- in order to remove all zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @min-jean-cho for the evaluation. So, let's proceed with method 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jgong5 , yes proceed with method 2.
This PR needs a labelIf your changes are user facing and intended to be a part of release notes, please use a label starting with If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see |
@@ -149,13 +149,13 @@ void exponential_kernel(TensorIteratorBase &iter, double lambda, c10::optional<G | |||
vslNewStream(&stream, VSL_BRNG_MCG31, seed); | |||
vslSkipAheadStream(stream, begin); | |||
vdRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, stream, len, | |||
(double *)(sample_ptr + begin), 0, 1./lambda); | |||
(double *)(sample_ptr + begin), 1.4013e-45, 1./lambda); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't hardcode values, use numeric_limits::denorm_min instead https://en.cppreference.com/w/cpp/types/numeric_limits/denorm_min. Do you want denorm_min here or just min though? Some systems might compile without denorm support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ngimel , I'll also shortly share a preliminary performance comparison of method 1: resample when sampled points include 0.0 vs. method 2: shift exponential D location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lezcano, eps can be any arbitrary value. So I'll just use std::numeric_limits::min
instead of std::numeric_limits::denorm_min
.
int64_t n_zeros; | ||
bool has_zero; | ||
|
||
zero_idx = (self == 0.).nonzero(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a700ccc can fix this issue, then it is fine.
the fix from this PR is going to be a lot slower since it scans the whole tensor multiple times, this PR can be further optimized if we plan the memory access more carefully, but it is still going to be slower since you have to scan whether the output tensor (if has zero) anyway.
@min-jean-cho can we update the test cases as well ? |
Yeah, I'm pretty sure that what we want here, given that the API allows for that, is to shift the distribution, that is, method 2. The only thing left to address is Natalia's comment in #101720 (comment) about using |
Thanks @mingfeima, I can add test to check the minimum value is not zero -- for a very large lambda (e.g., 10**45), approximately 40% of the sample will contain zero (before the current fix). But I'm not sure if we should add such test case that rely on probability. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Comment with id 1562131265 not found Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #48841 , #101620
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10