-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Do not use fork to invoke test scripts in pytorch rocm CI #14600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TBH, I'd prefer it if we just arranged things so that |
@ezyang The usage of GPU could come from other places that run_test imports (e.g. run_test imports common_utils which has used device_count, and functions other than device_count could trigger HIP initialization as well). If some future changes accidentally introduce gpu usage to run_test, it's hard to track because the ihipException failure are not 100% reproducible in all CI boxes. |
It shouldn't import common_utils either. run_test.py literally used to be a shell script, and it should stay that: a dumb shell script. |
Since this is going to be a temporary workaround (at the end amd should fix the underlying issue, otherwise distributed training will run into problem), I think it's not worth the effort of rewriting a 400 lines python script into a shell script. e.g. you will need to re-implement the TEST_WITH_ROCM and black_list logic (not saying it's hard, just it needs reimplementation). |
Is there a way we can try this PR on worker 20 and 30 only? If the SIGIOT error rate does not drop to the similar level with other workers; there might be other issues |
@petrex The ihipException issue has been reliably reproducible on worker 20 and 30 (that's why we have pull them out of the CI pool). I have tested on these two machines that changes in this PR work around the issue (but again this is just a workaround, we need to root cause and fix this issue, otherwise I believe we will run into problems when enabling multi-gpu training). @ezyang There are indeed failures with running all tests in the same process, any hypothesis? :-)
|
Yes, probably some of the tests are not properly resetting the seed, and thus by running everything in the same process RNG generation has been perturbed. |
@pytorchbot retest this please |
8efeba4
to
db0cfad
Compare
('index_copy', (), (0, torch.tensor(0, dtype=torch.int64), torch.tensor(2.)), 'scalar_all_dim', [0]), | ||
('index_fill', (S, S), (0, index_variable(2, S), 2), 'dim', [0]), | ||
# FIXME: we should compute the derivative w.r.t torch.tensor(2) | ||
def method_tests(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test cases are used in test_autograd and test_jit. I suspect test_autograd has some places inplace modify the test inputs/outputs then causing test_jit failed when trying to reuse the same test inputs & outputs. After changing this into a function so test_autograd and test_jit get their own copy the tests are now passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
M = 10 | ||
S = 5 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprised this didn't cause lint to fail lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this wasn't a function before.
expected = f(sample, *values) | ||
actual = traced_f(sample, *values) | ||
self.assertEqual(expected, actual, | ||
self.assertEqual(expected, actual, allow_inf=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one elem being inf in both the expected and actual results (#14600 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, I guess? There's a feel sprinkled seeding calls which look a bit unmotivated. But I guess it's all fine since we're still doing regular old out-of-process for non-ROCm.
#14497