Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS state_dict tests in CI are failing during shutdown #1256

Closed
andrewkho opened this issue May 9, 2024 · 2 comments · Fixed by #1257
Closed

MacOS state_dict tests in CI are failing during shutdown #1256

andrewkho opened this issue May 9, 2024 · 2 comments · Fixed by #1257
Assignees

Comments

@andrewkho
Copy link
Contributor

🐛 Describe the bug

MacOS tests of StatefulDataLoader CI action fail intermittently during shutdown. on Mac it also takes a lot longer than both windows and ubuntu to shut down (10 minutes vs 1s). I'm not sure what causes Github Actions to mark the test as failed, but
Created an issue here on actions/setup-python but still no response: actions/setup-python#857

Although we still get positive signals from the test, it shows up as an X in Github

Versions

Nightly / main branch in CI,

@andrewkho andrewkho self-assigned this May 9, 2024
@andrewkho
Copy link
Contributor Author

I've been trying to isolate the problem here on this branch #1255. I'm unable to repro on my mac laptop, so i'm just trying to bisect it by kicking off so far it's definitely due to test_state_dict.py.

The best sign I get is sometimes the complete jobs or cleanup python logs will show a bunch of Terminate orphan process: pid (xxxxx) (torch_shm_manager).

Digging in to the docs and code, it looks like on MacOS, the default sharing strategy is file_system (instead of file_descriptor) which will launch torch_shm_manager process in the background. It gets launched here, but the PID is never held on to, and there is no obvious clean up code that gets called here.
https://github.com/pytorch/pytorch/blob/main/torch/lib/libshm/core.cpp?fbclid=IwAR0DG3o68svdVDUkMCbb-0KM95IzxpsAeWS27m57fWAx84su9stZbsa3H_4#L25

https://github.com/pytorch/pytorch/blob/main/torch/lib/libshm/core.cpp?fbclid=IwAR0DG3o68svdVDUkMCbb-0KM95IzxpsAeWS27m57fWAx84su9stZbsa3H_4#L25

image

@andrewkho
Copy link
Contributor Author

It seems like on MacOS, multiprocessing fork is more like a spawn and requires importing all the modules again. Something about increasing the total number of worker subprocesses in the test causes massive slowdowns in clean up. The simplest thing to do at this point is to shard the tests. I'll probably give this a shot tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant