-
Notifications
You must be signed in to change notification settings - Fork 615
Closed
Description
Bug description
Hi team,
I noticed a recent change which moved train.py from the top level fold in the project to torchtitan sub folder. This caused the failure of run_train.sh with following error msg.
It cased the following error with import message "from torchtitan.components.checkpoint import CheckpointManager, TrainState" at the beginning of train.py. This is because the train.py can not find a submodule named "torchtitan" cause train.py is already part of torchtitan.
I fixed by some hacky way but looking forward to more suggestions on this
Thank you!
(/home/jianiw/local/jiani/pytorch-env) [jianiw@devvm7508]~/local/jiani/torchtitan% LOG_RANK=0,1 NGPU=4 ./run_train.sh
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./torchtitan/models/llama/train_configs/debug_model.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0,1 --role rank --tee 3 torchtitan/train.py --job.config_file ./torchtitan/models/llama/train_configs/debug_model.toml
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763]
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] *****************************************
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] *****************************************
[rank0]:Traceback (most recent call last):
[rank0]: File "/data/users/jianiw/jiani/torchtitan/torchtitan/train.py", line 14, in <module>
[rank0]: from torchtitan.components.checkpoint import CheckpointManager, TrainState
[rank0]:ModuleNotFoundError: No module named 'torchtitan'
[rank1]:Traceback (most recent call last):
[rank1]: File "/data/users/jianiw/jiani/torchtitan/torchtitan/train.py", line 14, in <module>
[rank1]: from torchtitan.components.checkpoint import CheckpointManager, TrainState
[rank1]:ModuleNotFoundError: No module named 'torchtitan'
E0226 15:57:44.126000 2461839 torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 2462029) of binary: /home/jianiw/local/jiani/pytorch-env/bin/python
Traceback (most recent call last):
File "/home/jianiw/local/jiani/pytorch-env/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
File "/data/users/jianiw/jiani/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 354, in wrapper
return f(*args, **kwargs)
File "/data/users/jianiw/jiani/pytorch/torch/distributed/run.py", line 889, in main
run(args)
File "/data/users/jianiw/jiani/pytorch/torch/distributed/run.py", line 880, in run
elastic_launch(
File "/data/users/jianiw/jiani/pytorch/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/users/jianiw/jiani/pytorch/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
torchtitan/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-02-26_15:57:43
host : devvm7508.cco0.facebook.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2462030)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2025-02-26_15:57:43
host : devvm7508.cco0.facebook.com
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2462032)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2025-02-26_15:57:43
host : devvm7508.cco0.facebook.com
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2462033)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-26_15:57:43
host : devvm7508.cco0.facebook.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2462029)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Versions
Current main branch after #894 merged (I don't t
Metadata
Metadata
Assignees
Labels
No labels