Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Distributed Elastic Launch Segmentation Fault with Python 3.12 #116423

Open
PaulZhang12 opened this issue Dec 26, 2023 · 8 comments
Open
Labels
module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: elastic Related to torch.distributed.elastic oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@PaulZhang12
Copy link
Contributor

PaulZhang12 commented Dec 26, 2023

馃悰 Describe the bug

With Python 3.12, using torch.distributed elastic_launch results in segmentation fault. Python 3.11 with the same code works.

import torch.distributed.launcher as pet
import uuid
import tempfile
import os


def get_launch_config(world_size: int, rdzv_endpoint: str) -> pet.LaunchConfig:
    return pet.LaunchConfig(
        min_nodes=1,
        max_nodes=1,
        nproc_per_node=world_size,
        run_id=str(uuid.uuid4()),
        rdzv_backend="c10d",
        rdzv_endpoint=rdzv_endpoint,
        rdzv_configs={"store_type": "file"},
        start_method="spawn",
        monitor_interval=1,
        max_restarts=0,
    )

def entrypoint():
    print("Here!")


if __name__ == '__main__':
    with tempfile.TemporaryDirectory() as tmpdir:
        launch_config = get_launch_config(2, os.path.join(tmpdir, "rdzv"))
        pet.elastic_launch(launch_config, entrypoint=entrypoint)()

Versions

[conda] numpy 1.26.2 pypi_0 pypi
[conda] torch 2.3.0.dev20231218+cu121 pypi_0 pypi
[conda] torchmetrics 1.0.3 pypi_0 pypi
[conda] torchrec 0.5.0.dev20231218+cu121 pypi_0 pypi

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @dzhulgakov

@jbschlosser jbschlosser added oncall: distributed Add this issue/PR to distributed oncall triage queue module: elastic Related to torch.distributed.elastic module: crash Problem manifests as a hard crash, as opposed to a RuntimeError labels Dec 26, 2023
@malfet
Copy link
Contributor

malfet commented Dec 27, 2023

Can you attach a backtrace please?

@PaulZhang12
Copy link
Contributor Author

Running the script does not give a stack trace. This simple repro is based off of torchrec metric tests that fail with Python 3.12, to which there is a stack trace with the seg fault:
Screenshot 2023-12-27 at 6 04 23鈥疨M

Also attaching the backtrace obtained when running the script with gdb:
Screenshot 2023-12-27 at 6 03 40鈥疨M

@XilunWu
Copy link
Contributor

XilunWu commented Dec 28, 2023

@kurman The stacktrace shows segfault is raised when calling store.compare_set(). Not sure if it has anything to do with my libuv PR #116141 but our tests showed no failure IIRC. Not necessarily issue on our side but it's a heads-up.

@kurman
Copy link
Contributor

kurman commented Dec 28, 2023

Not sure if it has anything to do with my libuv PR

@XilunWu I believe useLibUV false by default so this should not be a root cause. I am testing this now but I take that we should triage this under general python 3.12 support.

PaulZhang12 added a commit to PaulZhang12/torchrec that referenced this issue Jan 16, 2024
Summary:

Disabling python 3.12 for validating binaries until Torch elastic issue: pytorch/pytorch#116423 is resolved

Reviewed By: henrylhtsang

Differential Revision: D52809181
PaulZhang12 added a commit to PaulZhang12/torchrec that referenced this issue Jan 16, 2024
Summary:

Disabling python 3.12 for validating binaries until Torch elastic issue: pytorch/pytorch#116423 is resolved

Reviewed By: henrylhtsang

Differential Revision: D52809181
facebook-github-bot pushed a commit to pytorch/torchrec that referenced this issue Jan 17, 2024
Summary:
Pull Request resolved: #1633

Disabling python 3.12 for validating binaries until Torch elastic issue: pytorch/pytorch#116423 is resolved

Reviewed By: henrylhtsang

Differential Revision: D52809181

fbshipit-source-id: b24af6e07b1c91981ee127e9069f8a77d9297258
@XilunWu
Copy link
Contributor

XilunWu commented Mar 21, 2024

Can confirm that TCPStore has issue under Python 3.12.

Test steps:

  1. checkout the latest pytorch main branch
  2. build from source with a python 3.12 conda environment
  3. run pytest test/distributed/test_store.py

Output:

================================================================================================================= test session starts =================================================================================================================
platform linux -- Python 3.12.2, pytest-7.4.0, pluggy-1.0.0
rootdir: /data/users/xilunwu/oss/pytorch
configfile: pytest.ini
plugins: hypothesis-6.99.11
collected 85 items                                                                                                                                                                                                                                    
Running 85 items in this shard

test/distributed/test_store.py sFatal Python error: Segmentation fault

Current thread 0x00007f741b395740 (most recent call first):
  File "/data/users/xilunwu/oss/pytorch/test/distributed/test_store.py", line 105 in _test_compare_set
  File "/data/users/xilunwu/oss/pytorch/test/distributed/test_store.py", line 121 in test_compare_set
  File "/data/users/xilunwu/oss/pytorch/torch/testing/_internal/common_utils.py", line 2739 in wrapper
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/unittest/case.py", line 589 in _callTestMethod
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/unittest/case.py", line 634 in run
  File "/data/users/xilunwu/oss/pytorch/torch/testing/_internal/common_utils.py", line 2838 in _run_custom
  File "/data/users/xilunwu/oss/pytorch/torch/testing/_internal/common_utils.py", line 2866 in run
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/unittest/case.py", line 690 in __call__
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/unittest.py", line 333 in runtest
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/runner.py", line 341 in from_call
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/main.py", line 324 in _main
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/config/__init__.py", line 166 in main
  File "/home/xilunwu/local/miniconda3/envs/py312/lib/python3.12/site-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/home/xilunwu/local/miniconda3/envs/py312/bin/pytest", line 11 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix (total: 22)
Segmentation fault (core dumped)

cc @wconstab @kwen2501 @kurman

@wconstab
Copy link
Contributor

Did you narrow this down to a specific root cause? @XilunWu

@wconstab
Copy link
Contributor

Any update on this?

It would be nice to know if the segfault is happening on the client or server side, and also whether it persists when we use the LibUV backend. We plan to roll out libuv as default anyway, so that would be an easy fix if it were the case.

@c-p-i-o
Copy link
Contributor

c-p-i-o commented May 13, 2024

whether it persists when we use the LibUV backend.

Issue persists even with LibUV. See #125990 for additional easier repro steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: elastic Related to torch.distributed.elastic oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

7 participants