Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Add env support for the training script argument #128

Closed
kuikuikuizzZ opened this issue Sep 30, 2020 · 4 comments
Closed

Add env support for the training script argument #128

kuikuikuizzZ opened this issue Sep 30, 2020 · 4 comments

Comments

@kuikuikuizzZ
Copy link

kuikuikuizzZ commented Sep 30, 2020

Description

A custom use for this elastic tools is as below:

python -m torchelastic.distributed.launch
--nnodes=$NUM_NODES
--nproc_per_node=$WORKERS_PER_NODE
--rdzv_id=$JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=$ETCD_HOST:$ETCD_PORT
main.py
--arch resnet18
--epochs 20
--batch-size 32
<DATA_DIR>

Is that possible support args: nnodes, rdzv_id, rdzv_backend, rdzv_endpoint ... in env like $NUM_NODES, $JOB_ID,
$RDZV_BACKEND, $RDZV_ENDPOINT, and no need to present in the args.

Motivation/Background

It can make this elastic tools more smoothly in k8s and no need to take care of the args in the controller reconcile logic.

Detailed Proposal

One possible proposal is that support env in torchelastic/distributed/launch.py. if args not present, it will look for the env.

Alternatives

Additional context/links

@kiukchung
Copy link
Contributor

thanks for the feedback, should be simple enough to add. If we follow a convention of making the env var setting as TORCHELASTIC_{argument} (e.g. TORCHELASTIC_NNODES, TORCHELASTIC_NPROC_PER_NODE, TORCHELASTIC_RDZV_ID, etc) would this work?

@kuikuikuizzZ
Copy link
Author

LGTM, thanks for the support.

kiukchung pushed a commit to kiukchung/elastic-1 that referenced this issue Oct 3, 2020
…name PET_ARG

Summary:
See: pytorch#128

Allows users to specify program args via env var as such:

```
PET_NNODES="1:2" python -m torchelastic.distributed.launch \
   --rdzv_id 123 \
   my_script.py script_args
```

Differential Revision: D24098270

fbshipit-source-id: e6cccaecc852840a44fc941017bf361a4769b698
@kiukchung
Copy link
Contributor

I had to use PET_{argument} (e.g. PET_NNODES, PET_NPROC_PER_NODE) - PET stands for Pytorch Elastic Trainer (our codename) - because TORCHELASTIC_ prefix is used internally by the torchelastic agent for certain things. The behavior is that the command-line argument wins over env vars and finally defaults to the default set in the program. For instance:

PET_NNODES=2 python -m torchelastic.distributed.launch --nnodes 3 script.py
# runs with 3 nodes not 2

positionals (script and script args) cannot be set via env vars. so you can't do something like this:

PET_TRAINING_SCRIPT="script.py" python -m torchelastic.distributed.launch --nnodes 3

Also flags can be set via env vars:

PET_STANDALONE=1 python -m torchelastic.distributed.launch --nnodes 3 script.py
#equivalent to
python -m torchelastic.distributed.launch --nnodes 3 --standalone script.py

PET_STANDALONE=0 python -m torchelastic.distributed.launch --nnodes 3 script.py
#equivalent to
python -m torchelastic.distributed.launch --nnodes 3 script.py

facebook-github-bot pushed a commit that referenced this issue Oct 3, 2020
…name PET_ARG (#129)

Summary:
Pull Request resolved: #129

See: #128

Allows users to specify program args via env var as such:

```
PET_NNODES="1:2" python -m torchelastic.distributed.launch \
   --rdzv_id 123 \
   my_script.py script_args
```

Reviewed By: yifuwang

Differential Revision: D24098270

fbshipit-source-id: 5501c4331939df468fba5811f7b7e3b74e100da3
@kiukchung
Copy link
Contributor

PR merged and released as part of torchelastic.0.2.1

fotstrt pushed a commit to eth-easl/elastic that referenced this issue Feb 17, 2022
…name PET_ARG (pytorch#129)

Summary:
Pull Request resolved: pytorch#129

See: pytorch#128

Allows users to specify program args via env var as such:

```
PET_NNODES="1:2" python -m torchelastic.distributed.launch \
   --rdzv_id 123 \
   my_script.py script_args
```

Reviewed By: yifuwang

Differential Revision: D24098270

fbshipit-source-id: 5501c4331939df468fba5811f7b7e3b74e100da3
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants