-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchrun AttributeError caused by file_based_local_timer
on Windows
#85427
Comments
This has been now released in v1.13.0 and it has caused some CI failures on windows |
* fastai: Remove ProgressCallback in tests. Per fastai/fastai#3809 * pin PytTorch version. Per pytorch/pytorch#85427
cc @bchen2020 @d4l3k windows breakage |
I'm willing to stamp #76848 post factum: Windows distributed issues has been piling up and nobody on the engineering side were looking into fixing those (which this issue perfectly highlights, as it was filed on Sep 30th and nobody cared to submit a fix for it or mark it as release blocking. By the way, I've added the test that imports all public package names, let me check if Lines 1785 to 1787 in ff94494
|
We can just add an import test for torch.distributed.elastic if distributed is broken. Though if distributed doesn't work doesn't seem like torchrun has much value @daavoo @ejguan can you share a bit about how you're using torchrun on windows? cc @kiukchung |
The straight forward fix is to get the platform-based kill signal from
instead of defaulting it to signal.SIGKILL in the offending ctor above. |
Well ... I am not using torchrun on windows at all 😅 I just encountered the error in CI while importing a downstream library ( Traceback_________________ ERROR collecting tests/test_huggingface.py __________________
.nox\tests-3-9\lib\site-packages\transformers\utils\import_utils.py:1063: in _get_module
return importlib.import_module("." + module_name, self.__name__)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\importlib\__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1030: in _gcd_import
???
<frozen importlib._bootstrap>:1007: in _find_and_load
???
<frozen importlib._bootstrap>:986: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:680: in _load_unlocked
???
<frozen importlib._bootstrap_external>:850: in exec_module
???
<frozen importlib._bootstrap>:228: in _call_with_frames_removed
???
.nox\tests-3-9\lib\site-packages\transformers\modeling_utils.py:78: in <module>
from accelerate import __version__ as accelerate_version
.nox\tests-3-9\lib\site-packages\accelerate\__init__.py:7: in <module>
from .accelerator import Accelerator
.nox\tests-3-9\lib\site-packages\accelerate\accelerator.py:27: in <module>
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
.nox\tests-3-9\lib\site-packages\accelerate\checkpointing.py:24: in <module>
from .utils import (
.nox\tests-3-9\lib\site-packages\accelerate\utils\__init__.py:96: in <module>
from .launch import PrepareForLaunch, _filter_args, get_launch_prefix
.nox\tests-3-9\lib\site-packages\accelerate\utils\launch.py:25: in <module>
import torch.distributed.run as distrib_run
.nox\tests-3-9\lib\site-packages\torch\distributed\run.py:386: in <module>
from torch.distributed.launcher.api import LaunchConfig, elastic_launch
.nox\tests-3-9\lib\site-packages\torch\distributed\launcher\__init__.py:10: in <module>
from torch.distributed.launcher.api import ( # noqa: F401
.nox\tests-3-9\lib\site-packages\torch\distributed\launcher\api.py:16: in <module>
from torch.distributed.elastic.agent.server.local_elastic_agent import LocalElasticAgent
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\agent\server\__init__.py:40: in <module>
from .local_elastic_agent import TORCHELASTIC_ENABLE_FILE_TIMER, TORCHELASTIC_TIMER_FILE
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py:19: in <module>
import torch.distributed.elastic.timer as timer
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\timer\__init__.py:44: in <module>
from .file_based_local_timer import FileTimerClient, FileTimerServer, FileTimerRequest # noqa: F401
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\timer\file_based_local_timer.py:63: in <module>
class FileTimerClient(TimerClient):
.nox\tests-3-9\lib\site-packages\torch\distributed\elastic\timer\file_based_local_timer.py:81: in FileTimerClient
def __init__(self, file_path: str, signal=signal.SIGKILL) -> None:
E AttributeError: module 'signal' has no attribute 'SIGKILL'
The above exception was the direct cause of the following exception:
tests\test_huggingface.py:7: in <module>
from transformers import (
<frozen importlib._bootstrap>:1055: in _handle_fromlist
???
.nox\tests-3-9\lib\site-packages\transformers\utils\import_utils.py:1053: in __getattr__
module = self._get_module(self._class_to_module[name])
.nox\tests-3-9\lib\site-packages\transformers\utils\import_utils.py:1065: in _get_module
raise RuntimeError(
E RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
E module 'signal' has no attribute 'SIGKILL' |
Encountered this issue the same way as @daavoo in CI when importing torchrun. I believe a fix could be to change the implementation to: def __init__(self, file_path: str, signal=None) -> None:
super().__init__()
self._file_path = file_path
self.signal = signal.SIGKILL if signal is None else signal |
Also, add `torch.distributed` to test imports, so that we would not regress in the future Fixes pytorch#85427 Pull Request resolved: pytorch#88522 Approved by: https://github.com/d4l3k (cherry picked from commit f98edfc)
Also, add `torch.distributed` to test imports, so that we would not regress in the future Fixes #85427 Pull Request resolved: #88522 Approved by: https://github.com/d4l3k (cherry picked from commit f98edfc) Co-authored-by: Nikita Shulga <nshulga@meta.com>
Also, add `torch.distributed` to test imports, so that we would not regress in the future Fixes pytorch#85427 Pull Request resolved: pytorch#88522 Approved by: https://github.com/d4l3k
🐛 Describe the bug
During
import
time oftorchrun
, an AttributeError is raised because module 'signal' has no attribute 'SIGKILL'. Here is the culprit:pytorch/torch/distributed/elastic/timer/file_based_local_timer.py
Line 81 in 8e1ae1c
I encountered such problem when running
subprocess.run(["torchrun", ..., "script.py"])
Versions
PyTorch Nightly
cc @ezyang @gchanan @zou3519 @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @peterjc123 @mszhanyi @skyline75489 @nbcsm @pietern @SciPioneer
The text was updated successfully, but these errors were encountered: