-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch Distributed Elastic Launch Segmentation Fault with Python 3.12 #116423
Comments
Can you attach a backtrace please? |
@XilunWu I believe |
Summary: Disabling python 3.12 for validating binaries until Torch elastic issue: pytorch/pytorch#116423 is resolved Reviewed By: henrylhtsang Differential Revision: D52809181
Summary: Disabling python 3.12 for validating binaries until Torch elastic issue: pytorch/pytorch#116423 is resolved Reviewed By: henrylhtsang Differential Revision: D52809181
Summary: Pull Request resolved: #1633 Disabling python 3.12 for validating binaries until Torch elastic issue: pytorch/pytorch#116423 is resolved Reviewed By: henrylhtsang Differential Revision: D52809181 fbshipit-source-id: b24af6e07b1c91981ee127e9069f8a77d9297258
Can confirm that TCPStore has issue under Python 3.12. Test steps:
Output:
|
Did you narrow this down to a specific root cause? @XilunWu |
Any update on this? It would be nice to know if the segfault is happening on the client or server side, and also whether it persists when we use the LibUV backend. We plan to roll out libuv as default anyway, so that would be an easy fix if it were the case. |
Issue persists even with LibUV. See #125990 for additional easier repro steps. |
馃悰 Describe the bug
With Python 3.12, using torch.distributed elastic_launch results in segmentation fault. Python 3.11 with the same code works.
Versions
[conda] numpy 1.26.2 pypi_0 pypi
[conda] torch 2.3.0.dev20231218+cu121 pypi_0 pypi
[conda] torchmetrics 1.0.3 pypi_0 pypi
[conda] torchrec 0.5.0.dev20231218+cu121 pypi_0 pypi
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @dzhulgakov
The text was updated successfully, but these errors were encountered: