-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated test_torch.py to use new OptimizerInfo infrastructure #125538
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125538
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 5499662 with merge base f9a7033 (): FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! let's see if ci passes
(a good next step after this is to change line 5884 to just optim_db,
to test all optims, but that may run into failures and can be in a different PR.
Hi @janeyx99, it seems there are some failed jobs. How do I fix them? |
test/test_torch.py
Outdated
def test_params_invalidated_with_grads_invalidated_between_unscale_and_step(self, device, dtype, optim_info): | ||
optimizer_ctor = optim_info.optim_cls | ||
all_optim_inputs = _get_optim_inputs_including_global_cliquey_kwargs( | ||
device, dtype, optim_info, skip=("differentiable")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device, dtype, optim_info, skip=("differentiable")) | |
device, dtype, optim_info, skip=("differentiable",)) |
make it a tuple! right now it’s just a string
test/test_torch.py
Outdated
self._grad_scaling_autocast_test(device=device.type, optimizer_ctor=optimizer_ctor, optimizer_kwargs={"fused": True}) | ||
optimizer_ctor = optim_info.optim_cls | ||
all_optim_inputs = _get_optim_inputs_including_global_cliquey_kwargs( | ||
device, dtype, optim_info, skip=("differentiable")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here—try running locally first :)
Thank you for the pointer @janeyx99! I couldn't run the test locally due to environment setup issues before. I fixed it now. This resolves
The tests run fine with the original hard-coded optimizer_kwargs. Looks like some kwargs from _get_optim_inputs_including_global_cliquey_kwargs do not pass the tests. |
test/test_torch.py
Outdated
context = contextlib.nullcontext | ||
if optimizer_ctor in (torch.optim.Adam, torch.optim.AdamW): | ||
from functools import partial | ||
context = partial(self.assertRaises, AssertionError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like one of the errors is no longer getting raised...which sounds like a good thing!
It'd be good to know which configs do not pass in the failing SGD tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I remove the if
block, ie: keep context = contextlib.nullcontext
for all cases of optimizer_ctor
, I get the following errors for Adam and AdamW optims
AssertionError: Tensor-likes are not close!
Mismatched elements: 64 / 64 (100.0%)
Greatest absolute difference: 5.1396894454956055 at index (6, 6) (up to 0.001 allowed)
Greatest relative difference: 10.834877967834473 at index (6, 6) (up to 1e-05 allowed)
To execute this test, run the following from the base repo dir:
python test/test_torch.py -k test_grad_scaling_autocast_AdamW_cpu_float32
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
AssertionError: Tensor-likes are not close!
Mismatched elements: 64 / 64 (100.0%)
Greatest absolute difference: 5.523576259613037 at index (6, 6) (up to 0.001 allowed)
Greatest relative difference: 28.40498161315918 at index (6, 6) (up to 1e-05 allowed)
To execute this test, run the following from the base repo dir:
python test/test_torch.py -k test_grad_scaling_autocast_Adam_cpu_float32
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Adam and AdamW tests fail when configs (optim_input.kwargs
) are {'lr': 0.01, 'fused': False}
, {'lr': 0.01, 'fused': True})
. This is with the AssertionError
context
.
The SGD optimizer test fails for optim_input.kwargs == {'weight_decay': 0.1, 'maximize': True, 'fused': True}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The absolute difference of 5 is pretty major and signifies something seems wrong.
The Adam and AdamW tests fail when configs (optim_input.kwargs) are {'lr': 0.01, 'fused': False}, {'lr': 0.01, 'fused': True}). This is with the AssertionError context.
When you say "with the AssertionError context", do you mean with the context enabled? If so, I am not surprised that these two basic configs now pass + fail the assertion error. Or do you mean that these configs are what causes the major errors?
Ah, is that the only SGD config failing? If you comment it out, does the rest of the test pass?
And are these all for CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't clear about the failing cases.
- When the
context
ispartial(self.assertRaises, AssertionError)
for Adam and AdamW, the tests fail for configs{'lr': 0.01, 'fused': False}
,{'lr': 0.01, 'fused': True}
with the errorAssertionError: AssertionError not raised
. - When the
context
iscontextlib.nullcontext
for Adam and AdamW, the tests fail for all the configs with the errorAssertionError: Tensor-likes are not close!
. In this case, I am confused as to why is the error being thrown even for the configs where case 1 fails, the mismatch elements percentage is around 3.1% for{'lr': 0.01, 'fused': False}
,{'lr': 0.01, 'fused': True}
but either 39.1% or 100% for other configs.
For SGD, that is the only config that fails. I am able to pass the test if I skip this config.
These all tests are for CPU.
I see that the Native devices are ('cpu', 'cuda', 'meta'). I understand cuda tests are skipped since my machine does not have cuda support. What about meta device type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, debugging these errors will likely be a more involved task. I propose:
- opening an issue tracking your observations here about this particular test (what fails, what doesn't when the context is on/off).
- no longer using the _get_optim_inputs_including_global_cliquey_kwargs to get all the kwargs, but just having the kwargs be parametrized like in https://github.com/pytorch/pytorch/blob/main/test/test_optim.py#L689 but with the impl being "forloop", "foreach", "fused".
- Depending on the impl, you can make the kwargs {"foreach": False}, {"foreach": True}, and {"fused": True} respectively.
- This way we consolidate into one test but avoid signing up to address all the issues now.
And to answer your q about the meta device, it does input and output shape checking, as the tensors are not actually backed by real storage!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can pass the tests locally by skipping the failing configs from the _get_optim_inputs_including_global_cliquey_kwargs
.
Should I try pushing it and checking if the CI passes?
I will raise an issue covering the failing cases with _get_optim_inputs_including_global_cliquey_kwargs
configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting not using _get_optim_inputs_including_global_cliquey_kwargs at all, but to just use
kwargs = {"foreach": False} if forloop else ...
optim_cls(params, **kwargs)
Once you push that change, the tests should pass since it is not adding more coverage compared to before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes, please file the issue!
…grad_scaling_autocast for test coverage as some configs fail, raising separate issue for documenting failed configs
4032a11
to
5499662
Compare
There may be lint failures--please run
and running it is
on uncommitted changes. (So after git add but before git commit) |
Hi @janeyx99, I did run |
wonderful! thank you @pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…h#125538) Fixes pytorch#123451 (only addresses test_torch.py cases) This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure. I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations. ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Pull Request resolved: pytorch#125538 Approved by: https://github.com/janeyx99
Fixes #123451 (only addresses test_torch.py cases)
This PR solves the specific task to update
test_grad_scaling_autocast
andtest_params_invalidated_with_grads_invalidated_between_unscale_and_step
intest/test_torch.py
to use the new OptimizerInfo infrastructure.I have combined tests that call
_grad_scaling_autocast_test
into one calledtest_grad_scaling_autocast
and used_get_optim_inputs_including_global_cliquey_kwargs
to avoid hard-coded configurations.cc @janeyx99 @crcrpar @vincentqb @jbschlosser @albanD