Skip to content

Moving train.py to torchtitan submodule makes run_train.sh failed with "Can not find module" #897

@jianiw25

Description

@jianiw25

Bug description

Hi team,

I noticed a recent change which moved train.py from the top level fold in the project to torchtitan sub folder. This caused the failure of run_train.sh with following error msg.

It cased the following error with import message "from torchtitan.components.checkpoint import CheckpointManager, TrainState" at the beginning of train.py. This is because the train.py can not find a submodule named "torchtitan" cause train.py is already part of torchtitan.

I fixed by some hacky way but looking forward to more suggestions on this

Image

Thank you!

(/home/jianiw/local/jiani/pytorch-env) [jianiw@devvm7508]~/local/jiani/torchtitan% LOG_RANK=0,1 NGPU=4 ./run_train.sh
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./torchtitan/models/llama/train_configs/debug_model.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0,1 --role rank --tee 3 torchtitan/train.py --job.config_file ./torchtitan/models/llama/train_configs/debug_model.toml
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] 
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] *****************************************
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] *****************************************
[rank0]:Traceback (most recent call last):
[rank0]:  File "/data/users/jianiw/jiani/torchtitan/torchtitan/train.py", line 14, in <module>
[rank0]:    from torchtitan.components.checkpoint import CheckpointManager, TrainState
[rank0]:ModuleNotFoundError: No module named 'torchtitan'
[rank1]:Traceback (most recent call last):
[rank1]:  File "/data/users/jianiw/jiani/torchtitan/torchtitan/train.py", line 14, in <module>
[rank1]:    from torchtitan.components.checkpoint import CheckpointManager, TrainState
[rank1]:ModuleNotFoundError: No module named 'torchtitan'
E0226 15:57:44.126000 2461839 torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 2462029) of binary: /home/jianiw/local/jiani/pytorch-env/bin/python
Traceback (most recent call last):
  File "/home/jianiw/local/jiani/pytorch-env/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 354, in wrapper
    return f(*args, **kwargs)
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/run.py", line 889, in main
    run(args)
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/run.py", line 880, in run
    elastic_launch(
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
torchtitan/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2462030)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2462032)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2462033)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2462029)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Versions

Current main branch after #894 merged (I don't t

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions