torchrun AttributeError caused by `file_based_local_timer` on Windows #85427

ejguan · 2022-09-21T18:47:35Z

🐛 Describe the bug

During import time of torchrun, an AttributeError is raised because module 'signal' has no attribute 'SIGKILL'. Here is the culprit:

pytorch/torch/distributed/elastic/timer/file_based_local_timer.py

Line 81 in 8e1ae1c

def __init__(self, file_path: str, signal=signal.SIGKILL) -> None:

I encountered such problem when running subprocess.run(["torchrun", ..., "script.py"])

Versions

PyTorch Nightly

cc @ezyang @gchanan @zou3519 @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @peterjc123 @mszhanyi @skyline75489 @nbcsm @pietern @SciPioneer

The text was updated successfully, but these errors were encountered:

daavoo · 2022-10-31T08:15:04Z

This has been now released in v1.13.0 and it has caused some CI failures on windows

Per pytorch/pytorch#85427

* fastai: Remove ProgressCallback in tests. Per fastai/fastai#3809 * pin PytTorch version. Per pytorch/pytorch#85427

H-Huang · 2022-10-31T13:49:16Z

cc @bchen2020 @d4l3k windows breakage

d4l3k · 2022-10-31T17:50:24Z

cc @clee2000 @janeyx99 who disabled distributed tests in windows in #76848

Are we not supporting distributed on windows now? "minimal interest in maintaining distributed for windows". If that's the case we should make that explicit and not do it via quiet breakages like this

cc @malfet

malfet · 2022-10-31T18:22:15Z

I'm willing to stamp #76848 post factum: Windows distributed issues has been piling up and nobody on the engineering side were looking into fixing those (which this issue perfectly highlights, as it was filed on Sep 30th and nobody cared to submit a fix for it or mark it as release blocking.

By the way, I've added the test that imports all public package names, let me check if torch.distribtued. is just blocklisted on Windows, as I could not get an ack from any maintainers that they care about it.

pytorch/test/test_testing.py

Lines 1785 to 1787 in ff94494

    
           if IS_WINDOWS or IS_MACOS: 
        
               # Distributed does not work on Windows or by default on Mac 
        
               ignored_modules.append("torch.distributed.")

d4l3k · 2022-10-31T19:33:56Z

We can just add an import test for torch.distributed.elastic if distributed is broken. Though if distributed doesn't work doesn't seem like torchrun has much value

@daavoo @ejguan can you share a bit about how you're using torchrun on windows?

cc @kiukchung

kiukchung · 2022-10-31T19:39:21Z

The straight forward fix is to get the platform-based kill signal from

pytorch/torch/distributed/elastic/multiprocessing/api.py

Line 65 in 8e1ae1c

def _get_kill_signal() -> signal.Signals:

instead of defaulting it to signal.SIGKILL in the offending ctor above.

daavoo · 2022-10-31T23:18:45Z

@daavoo @ejguan can you share a bit about how you're using torchrun on windows?

Well ... I am not using torchrun on windows at all 😅

I just encountered the error in CI while importing a downstream library (transformers):

Traceback

_________________ ERROR collecting tests/test_huggingface.py __________________
.nox\tests-3-9\lib\site-packages\transformers\utils\import_utils.py:1063: in _get_module
    return importlib.import_module("." + module_name, self.__name__)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1030: in _gcd_import
    ???
<frozen importlib._bootstrap>:1007: in _find_and_load
    ???
<frozen importlib._bootstrap>:986: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:680: in _load_unlocked
    ???
<frozen importlib._bootstrap_external>:850: in exec_module
    ???
<frozen importlib._bootstrap>:228: in _call_with_frames_removed
    ???
.nox\tests-3-9\lib\site-packages\transformers\modeling_utils.py:78: in <module>
    from accelerate import __version__ as accelerate_version
.nox\tests-3-9\lib\site-packages\accelerate\__init__.py:7: in <module>
    from .accelerator import Accelerator
.nox\tests-3-9\lib\site-packages\accelerate\accelerator.py:27: in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
.nox\tests-3-9\lib\site-packages\accelerate\checkpointing.py:24: in <module>
    from .utils import (
.nox\tests-3-9\lib\site-packages\accelerate\utils\__init__.py:96: in <module>
    from .launch import PrepareForLaunch, _filter_args, get_launch_prefix
.nox\tests-3-9\lib\site-packages\accelerate\utils\launch.py:25: in <module>
    import torch.distributed.run as distrib_run
.nox\tests-3-9\lib\site-packages\torch\distributed\run.py:386: in <module>
    from torch.distributed.launcher.api import LaunchConfig, elastic_launch
.nox\tests-3-9\lib\site-packages\torch\distributed\launcher\__init__.py:10: in <module>
    from torch.distributed.launcher.api import (  # noqa: F401
.nox\tests-3-9\lib\site-packages\torch\distributed\launcher\api.py:16: in <module>
    from torch.distributed.elastic.agent.server.local_elastic_agent import LocalElasticAgent
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\agent\server\__init__.py:40: in <module>
    from .local_elastic_agent import TORCHELASTIC_ENABLE_FILE_TIMER, TORCHELASTIC_TIMER_FILE
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py:19: in <module>
    import torch.distributed.elastic.timer as timer
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\timer\__init__.py:44: in <module>
    from .file_based_local_timer import FileTimerClient, FileTimerServer, FileTimerRequest  # noqa: F401
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\timer\file_based_local_timer.py:63: in <module>
    class FileTimerClient(TimerClient):
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\timer\file_based_local_timer.py:81: in FileTimerClient
    def __init__(self, file_path: str, signal=signal.SIGKILL) -> None:
E   AttributeError: module 'signal' has no attribute 'SIGKILL'

The above exception was the direct cause of the following exception:
tests\test_huggingface.py:7: in <module>
    from transformers import (
<frozen importlib._bootstrap>:1055: in _handle_fromlist
    ???
.nox\tests-3-9\lib\site-packages\transformers\utils\import_utils.py:1053: in __getattr__
    module = self._get_module(self._class_to_module[name])
.nox\tests-3-9\lib\site-packages\transformers\utils\import_utils.py:1065: in _get_module
    raise RuntimeError(
E   RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
E   module 'signal' has no attribute 'SIGKILL'

awaelchli · 2022-11-01T21:29:09Z

Encountered this issue the same way as @daavoo in CI when importing torchrun. I believe a fix could be to change the implementation to:

    def __init__(self, file_path: str, signal=None) -> None:
        super().__init__()
        self._file_path = file_path
        self.signal = signal.SIGKILL if signal is None else signal

Also, add `torch.distributed` to test imports, so that we would not regress in the future Fixes pytorch#85427 Pull Request resolved: pytorch#88522 Approved by: https://github.com/d4l3k (cherry picked from commit f98edfc)

Also, add `torch.distributed` to test imports, so that we would not regress in the future Fixes #85427 Pull Request resolved: #88522 Approved by: https://github.com/d4l3k (cherry picked from commit f98edfc) Co-authored-by: Nikita Shulga <nshulga@meta.com>

Also, add `torch.distributed` to test imports, so that we would not regress in the future Fixes pytorch#85427 Pull Request resolved: pytorch#88522 Approved by: https://github.com/d4l3k

ezyang added oncall: distributed Add this issue/PR to distributed oncall triage queue module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 21, 2022

UnicodeTreason mentioned this issue Oct 27, 2022

Windows 10 installation problem invoke-ai/InvokeAI#1264

Closed

daavoo added a commit to iterative/dvclive that referenced this issue Oct 31, 2022

pin PytTorch version.

23ffd43

Per pytorch/pytorch#85427

daavoo added a commit to iterative/dvclive that referenced this issue Oct 31, 2022

fastai: Remove ProgressCallback in tests. (#347)

256bbd2

* fastai: Remove ProgressCallback in tests. Per fastai/fastai#3809 * pin PytTorch version. Per pytorch/pytorch#85427

H-Huang added the high priority label Oct 31, 2022

pytorch-bot bot added the triage review label Oct 31, 2022

H-Huang added oncall: r2p Add this issue/PR to R2P (elastic) oncall triage queue and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 31, 2022

BenoitDalFerro mentioned this issue Nov 1, 2022

RuntimeError: Failed to import transformers.models.flaubert.modeling_flaubert because of the following error (look up to see its traceback): module 'signal' has no attribute 'SIGKILL' huggingface/transformers#20007

Closed

4 tasks

awaelchli mentioned this issue Nov 1, 2022

Fix torchelastic import error due to unsupported signal SIGKILL on Windows #88250

Closed

malfet added this to the 1.13.1 milestone Nov 2, 2022

sgugger mentioned this issue Nov 3, 2022

v0.13.2 Cannot from accelerate import Accelerator in Windows since module 'signal' has no attribute 'SIGKILL' huggingface/accelerate#817

Closed

4 tasks

malfet mentioned this issue Nov 4, 2022

Make TorchElastic timer importable on Windows #88522

Closed

anton-l mentioned this issue Nov 7, 2022

Diffusers 0.7.0 - Torch Accelerator - "import OnnxStableDiffusionPipeline" results in Traceback Error (DmlExecutionProvider) huggingface/diffusers#1127

Closed

T-Atlas mentioned this issue Nov 8, 2022

import torchkeras报错 lyhue1991/torchkeras#11

Closed

pytorchmergebot closed this as completed in f98edfc Nov 10, 2022

awaelchli mentioned this issue Nov 11, 2022

Re-enable Lite CLI on Windows + PyTorch 1.13 Lightning-AI/pytorch-lightning#15645

Merged

11 tasks

This was referenced Dec 2, 2022

Make TorchElastic timer importable on Windows (#88522) #90045

Merged

[v.1.13.1] Release Tracker #89855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchrun AttributeError caused by `file_based_local_timer` on Windows #85427

torchrun AttributeError caused by `file_based_local_timer` on Windows #85427

ejguan commented Sep 21, 2022 •

edited by pytorch-bot bot

Loading

daavoo commented Oct 31, 2022 •

edited

Loading

H-Huang commented Oct 31, 2022

d4l3k commented Oct 31, 2022

malfet commented Oct 31, 2022 •

edited

Loading

d4l3k commented Oct 31, 2022

kiukchung commented Oct 31, 2022

daavoo commented Oct 31, 2022

awaelchli commented Nov 1, 2022

torchrun AttributeError caused by file_based_local_timer on Windows #85427

torchrun AttributeError caused by file_based_local_timer on Windows #85427

Comments

ejguan commented Sep 21, 2022 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

daavoo commented Oct 31, 2022 • edited Loading

H-Huang commented Oct 31, 2022

d4l3k commented Oct 31, 2022

malfet commented Oct 31, 2022 • edited Loading

d4l3k commented Oct 31, 2022

kiukchung commented Oct 31, 2022

daavoo commented Oct 31, 2022

awaelchli commented Nov 1, 2022

torchrun AttributeError caused by `file_based_local_timer` on Windows #85427

torchrun AttributeError caused by `file_based_local_timer` on Windows #85427

ejguan commented Sep 21, 2022 •

edited by pytorch-bot bot

Loading

daavoo commented Oct 31, 2022 •

edited

Loading

malfet commented Oct 31, 2022 •

edited

Loading