`Could not start gRPC server` flakiness in XLA tests #77808

suo · 2022-05-19T00:16:08Z

For some examples, see here.

Can we add some retries or something to this test?

suo · 2022-05-19T00:16:33Z

cc @JackCaoG can you have someone take a look

JackCaoG · 2022-05-19T01:24:55Z

Hmm, I had pytorch/xla@935b602 which suppose to make this situation slightly better but it got reverted due to some CI issue. Let me try to reland this change.

suo · 2022-05-26T23:34:27Z

Another example from today on master https://github.com/pytorch/pytorch/runs/6615801243?check_suite_focus=true

JackCaoG · 2022-05-26T23:39:06Z

Hmm, I merge the fix pytorch/xla#3605 and pytorch pr #78327 but I see that now it is a different test trigger this failure.

I will do some further clean up

JackCaoG · 2022-05-27T18:47:54Z

pytorch/xla#3615 should help a bit more. Let me know if you still see this kind of error in pytorch CI.

suo · 2022-05-31T17:01:51Z

Another recent failure on master https://github.com/pytorch/pytorch/runs/6659602672?check_suite_focus=true

suo · 2022-05-31T20:56:16Z

another recent failure on master; this one is not Could not start gRPC server but also appears multiprocessing related https://github.com/pytorch/pytorch/runs/6676424724?check_suite_focus=true

JackCaoG · 2022-05-31T20:58:08Z

I think those failure are not recent, if they rebase their pytorch branch, mp test should not being run on CI.

suo · 2022-05-31T21:00:31Z

This failure was on the master branch :)

Maybe we didn't update the pin in a while?

JackCaoG · 2022-05-31T21:02:50Z

I think so, I did not see the call of run_mp_op_tests which should be in https://github.com/pytorch/xla/blob/master/test/run_tests.sh#L113. I think we update pin nightly?

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 19, 2022

suo added module: xla Related to XLA support and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 19, 2022

suo added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Could not start gRPC server` flakiness in XLA tests #77808

`Could not start gRPC server` flakiness in XLA tests #77808

suo commented May 19, 2022 •

edited by pytorch-bot bot

suo commented May 19, 2022

JackCaoG commented May 19, 2022

suo commented May 26, 2022

JackCaoG commented May 26, 2022

JackCaoG commented May 27, 2022

suo commented May 31, 2022

suo commented May 31, 2022

JackCaoG commented May 31, 2022

suo commented May 31, 2022 •

edited

JackCaoG commented May 31, 2022

Could not start gRPC server flakiness in XLA tests #77808

Could not start gRPC server flakiness in XLA tests #77808

Comments

suo commented May 19, 2022 • edited by pytorch-bot bot

suo commented May 19, 2022

JackCaoG commented May 19, 2022

suo commented May 26, 2022

JackCaoG commented May 26, 2022

JackCaoG commented May 27, 2022

suo commented May 31, 2022

suo commented May 31, 2022

JackCaoG commented May 31, 2022

suo commented May 31, 2022 • edited

JackCaoG commented May 31, 2022

`Could not start gRPC server` flakiness in XLA tests #77808

`Could not start gRPC server` flakiness in XLA tests #77808

suo commented May 19, 2022 •

edited by pytorch-bot bot

suo commented May 31, 2022 •

edited