-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[primTorch] Minor improvements to doc and impl of gaussian_nll_loss
#85612
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85612
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit bd18a12: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: e6ae96ccb06d9f070d87c55daf3a9b7b5bf3e140 Pull Request resolved: #85612
…uts" [ghstack-poisoned]
ghstack-source-id: a3dad8e4d7bd1d8e86a2dff56bd60468eb507e65 Pull Request resolved: #85612
…uts" [ghstack-poisoned]
ghstack-source-id: 149d24524a2f38c2195efaeb6bbf6e74fff729b1 Pull Request resolved: #85612
…uts" [ghstack-poisoned]
ghstack-source-id: bb6522bc64188cc25ec32173ee8a670d2d193218 Pull Request resolved: #85612
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting one. This function is already implemented in Python in core, so it's not clear to me whether we want to re-implement it in PrimTorch. The only benefit I see in doing this is that we can implement some promotion rules for it, and the fact that this way we would have all our implementations in the same place... WDYT @mruberry? Do
torch/nn/functional.py
Outdated
@@ -2777,8 +2777,10 @@ def gaussian_nll_loss( | |||
Args: | |||
input: expectation of the Gaussian distribution. | |||
target: sample from the Gaussian distribution. | |||
var: tensor of positive variance(s), one for each of the expectations | |||
in the input (heteroscedastic), or a single one (homoscedastic). | |||
var: same shape as the input, or same shape as the input but with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cool this PR is updating the documentation for the function.
When describing the parameters it's important to start with what's most important about them. In this case, I don't think it's the shape of var
that's most important, but that var
is a tensor describing the variances of either a multivariate normal distribution or multiple independent distributions (see question above).
This documentation also seems a little odd to me because input
and target
refer to "the Gaussian distribution", but this seems wrong because
- the loss can be used for multiple Gaussian distributions simultaneously (because it supports batches)
- I believe the correct semantic interpretation for this loss is that it works on multivariate normal distributions OR multiple normal distributions, and not just one?
So there may be more we can do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the doc from the functional part. See the above comment WRT the Gaussian bit.
reminder to close this one once we fix the docs: #53392 |
gaussian_nll_loss
, add error inputsgaussian_nll_loss
…n_nll_loss`" [ghstack-poisoned]
ghstack-source-id: 86473881086564d9b4f502b9325dadac7bc83643 Pull Request resolved: #85612
…n_nll_loss`" Fixes #53392. [ghstack-poisoned]
@@ -2817,14 +2817,15 @@ def gaussian_nll_loss( | |||
raise ValueError(reduction + " is not a valid value for reduction") | |||
|
|||
# Clamp for stability | |||
var = var.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what was the purpose of cloning here since the same variable is used later anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clone + in-place was a worse version of doing clamp
out of place I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I know why. It's either the original behavior (with no_grad
and inplace clamp) or just this (without any context): var = var.clamp(min=eps)
. I couldn't find any other code that wouldn't break the following tests:
python -m pytest test/test_modules.py -k GaussianNLLLoss -vvv
python -m pytest test/test_ops_gradients.py -k gaussian_nll_loss -vvv
This claims that doing it without no_grad
will "cause divergence," but the tests pass locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any thoughts @albanD ? Clamping without no_grad
LGTM, but I know we've historically done it with the no_grad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clamping without no_grad
will zero out a bunch of gradients. We definitely don't want that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about figuring out what's the value we want this function to take at var = 0
and return that? Although we would need to somehow deal with the gradients as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: @lezcano told me offline there are plans (IIUC) to have a "framework" that would help with numerical issues like this one, so postponing for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is that, at the moment, we don't care about gradients on PrimTorch (just yet). But we do care about gradients in PyTorch. As such, given that this is a function exposed in PyTorch, it should be correct. In particular, this change makes the gradients of this function to be incorrect and should be reverted.
Then, we should revisit at some point how to approach the whole point of the gradients in PrimTorch.
@@ -2817,14 +2817,15 @@ def gaussian_nll_loss( | |||
raise ValueError(reduction + " is not a valid value for reduction") | |||
|
|||
# Clamp for stability | |||
var = var.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clone + in-place was a worse version of doing clamp
out of place I think.
…n_nll_loss`" Fixes #53392. [ghstack-poisoned]
…n_nll_loss`" Fixes #53392. [ghstack-poisoned]
Summary:
|
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
|
@@ -2762,6 +2762,7 @@ def poisson_nll_loss( | |||
return ret | |||
|
|||
|
|||
# TODO: Pure Python impl - don't add a primTorch ref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this comment -- it's not really a TODO since there's nothing to do
# If var is the same shape as input, it's the heteroscedastic case. | ||
# If var is *not* the same shape as input, it's the homoscedastic case. | ||
# | ||
# To support broadcasting, the following sub-cases are allowed in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really broadcasting, though. We should be clear about what, exactly, it does, and not use a similar concept, which could confuse the reader
@@ -314,13 +314,13 @@ class GaussianNLLLoss(_Loss): | |||
where :attr:`eps` is used for stability. By default, the constant term of | |||
the loss function is omitted unless :attr:`full` is ``True``. If ``var`` is not the same | |||
size as ``input`` (due to a homoscedastic assumption), it must either have a final dimension | |||
of 1 or have one fewer dimension (with all other sizes being the same) for correct broadcasting. | |||
of 1 or have one fewer dimension (when comparing from the outermost dimension, with all other | |||
sizes being the same) for correct later broadcasting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use the term "broadcasting" here because this is not the same thing
@@ -314,13 +314,13 @@ class GaussianNLLLoss(_Loss): | |||
where :attr:`eps` is used for stability. By default, the constant term of | |||
the loss function is omitted unless :attr:`full` is ``True``. If ``var`` is not the same | |||
size as ``input`` (due to a homoscedastic assumption), it must either have a final dimension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the phrase "final dimension" is commonly understood
@@ -314,13 +314,13 @@ class GaussianNLLLoss(_Loss): | |||
where :attr:`eps` is used for stability. By default, the constant term of | |||
the loss function is omitted unless :attr:`full` is ``True``. If ``var`` is not the same | |||
size as ``input`` (due to a homoscedastic assumption), it must either have a final dimension | |||
of 1 or have one fewer dimension (with all other sizes being the same) for correct broadcasting. | |||
of 1 or have one fewer dimension (when comparing from the outermost dimension, with all other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just break this into cases. I'm not sure the order of comparison makes it that clear. Either var
has the same shape, has the same shape except its innermost dimension is 1, or has the same shape except it's "missing" the innermost dimension. In the last two cases the variance is assumed to be the same for each distribution (the distributions are assumed to be homoscedastic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functional and test changes look good, but I think we should take the time to be more diligent with the docs. It could save us a lot of headaches in the future.
One option would be to separate the doc changes into another PR.
@@ -2773,19 +2774,6 @@ def gaussian_nll_loss( | |||
r"""Gaussian negative log likelihood loss. | |||
|
|||
See :class:`~torch.nn.GaussianNLLLoss` for details. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this.
While we do like to reduce document redundancy, in a more perfect world modules would just wrap functions, and the functions would be documented.
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack:
gaussian_nll_loss
#85612Fixes #53392.
cc @ezyang @mruberry @ngimel @lezcano @peterbell10