Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not start gRPC server flakiness in XLA tests #77808

Open
suo opened this issue May 19, 2022 · 10 comments
Open

Could not start gRPC server flakiness in XLA tests #77808

suo opened this issue May 19, 2022 · 10 comments
Labels
module: xla Related to XLA support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@suo
Copy link
Member

suo commented May 19, 2022

For some examples, see here.

Can we add some retries or something to this test?

cc @bdhirsh

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 19, 2022
@suo suo added module: xla Related to XLA support and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 19, 2022
@suo
Copy link
Member Author

suo commented May 19, 2022

cc @JackCaoG can you have someone take a look

@JackCaoG
Copy link
Collaborator

Hmm, I had pytorch/xla@935b602 which suppose to make this situation slightly better but it got reverted due to some CI issue. Let me try to reland this change.

@suo suo added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 19, 2022
@suo
Copy link
Member Author

suo commented May 26, 2022

Another example from today on master https://github.com/pytorch/pytorch/runs/6615801243?check_suite_focus=true

@JackCaoG
Copy link
Collaborator

Hmm, I merge the fix pytorch/xla#3605 and pytorch pr #78327 but I see that now it is a different test trigger this failure.

I will do some further clean up

@JackCaoG
Copy link
Collaborator

pytorch/xla#3615 should help a bit more. Let me know if you still see this kind of error in pytorch CI.

@suo
Copy link
Member Author

suo commented May 31, 2022

Another recent failure on master https://github.com/pytorch/pytorch/runs/6659602672?check_suite_focus=true

@suo
Copy link
Member Author

suo commented May 31, 2022

another recent failure on master; this one is not Could not start gRPC server but also appears multiprocessing related https://github.com/pytorch/pytorch/runs/6676424724?check_suite_focus=true

@JackCaoG
Copy link
Collaborator

I think those failure are not recent, if they rebase their pytorch branch, mp test should not being run on CI.

@suo
Copy link
Member Author

suo commented May 31, 2022

This failure was on the master branch :)

Maybe we didn't update the pin in a while?

@JackCaoG
Copy link
Collaborator

I think so, I did not see the call of run_mp_op_tests which should be in https://github.com/pytorch/xla/blob/master/test/run_tests.sh#L113. I think we update pin nightly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: xla Related to XLA support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants