Moving train.py to torchtitan submodule makes run_train.sh failed with "Can not find module"

### Bug description

Hi team, 

I noticed a recent change which moved train.py from the top level fold in the project to torchtitan sub folder. This caused the failure of run_train.sh with following error msg.

It cased the following error with import message "from torchtitan.components.checkpoint import CheckpointManager, TrainState" at the beginning of train.py. This is because the train.py can not find a submodule named "torchtitan" cause train.py is already part of torchtitan.

I fixed by some hacky way but looking forward to more suggestions on this

<img width="1208" alt="Image" src="https://github.com/user-attachments/assets/3a4358ad-e5a0-4fae-8ebe-1dfb3589da44" />

Thank you!

```
(/home/jianiw/local/jiani/pytorch-env) [jianiw@devvm7508]~/local/jiani/torchtitan% LOG_RANK=0,1 NGPU=4 ./run_train.sh
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./torchtitan/models/llama/train_configs/debug_model.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0,1 --role rank --tee 3 torchtitan/train.py --job.config_file ./torchtitan/models/llama/train_configs/debug_model.toml
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] 
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] *****************************************
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0226 15:57:42.491000 2461839 torch/distributed/run.py:763] *****************************************
[rank0]:Traceback (most recent call last):
[rank0]:  File "/data/users/jianiw/jiani/torchtitan/torchtitan/train.py", line 14, in <module>
[rank0]:    from torchtitan.components.checkpoint import CheckpointManager, TrainState
[rank0]:ModuleNotFoundError: No module named 'torchtitan'
[rank1]:Traceback (most recent call last):
[rank1]:  File "/data/users/jianiw/jiani/torchtitan/torchtitan/train.py", line 14, in <module>
[rank1]:    from torchtitan.components.checkpoint import CheckpointManager, TrainState
[rank1]:ModuleNotFoundError: No module named 'torchtitan'
E0226 15:57:44.126000 2461839 torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 2462029) of binary: /home/jianiw/local/jiani/pytorch-env/bin/python
Traceback (most recent call last):
  File "/home/jianiw/local/jiani/pytorch-env/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 354, in wrapper
    return f(*args, **kwargs)
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/run.py", line 889, in main
    run(args)
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/run.py", line 880, in run
    elastic_launch(
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/users/jianiw/jiani/pytorch/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
torchtitan/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2462030)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2462032)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2462033)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-26_15:57:43
  host      : devvm7508.cco0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2462029)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```

### Versions

Current main branch after #894 merged (I don't t

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Moving train.py to torchtitan submodule makes run_train.sh failed with "Can not find module" #897

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Moving train.py to torchtitan submodule makes run_train.sh failed with "Can not find module" #897

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions