Skip to content

Commit

Permalink
Update docs for --lazy-load and torch.distributed.launch
Browse files Browse the repository at this point in the history
Summary: Pull Request resolved: facebookresearch/fairseq#433

Differential Revision: D13588032

Pulled By: myleott

fbshipit-source-id: 0e5ff361e27b206c4490264f0f51863367499e81
  • Loading branch information
myleott authored and yzpang committed Feb 19, 2021
1 parent a159582 commit eb4ed19
Showing 1 changed file with 25 additions and 41 deletions.
66 changes: 25 additions & 41 deletions docs/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,51 +154,35 @@ Fairseq supports FP16 training with the ``--fp16`` flag:
> python train.py --fp16 (...)
Distributed training
--------------------
Lazily loading large training datasets
--------------------------------------

Distributed training in fairseq is implemented on top of
`torch.distributed <http://pytorch.org/docs/master/distributed.html>`__.
Training begins by launching one worker process per GPU. These workers
discover each other via a unique host and port (required) that can be
used to establish an initial connection. Additionally, each worker has a
rank, that is a unique number from 0 to n-1 where n is the total number
of GPUs.
By default fairseq loads the entire training set into system memory. For large
datasets, the ``--lazy-load`` option can be used to instead load batches on-demand.
For optimal performance, use the ``--num-workers`` option to control the number
of background processes that will load batches.

If you run on a cluster managed by
`SLURM <https://slurm.schedmd.com/>`__ you can train a large
English-French model on the WMT 2014 dataset on 16 nodes with 8 GPUs
each (in total 128 GPUs) using this command:
Distributed training
--------------------

.. code-block:: console
Distributed training in fairseq is implemented on top of ``torch.distributed``.
The easiest way to launch jobs is with the `torch.distributed.launch
<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.

> DATA=... # path to the preprocessed dataset, must be visible from all nodes
> PORT=9218 # any available TCP port that can be used by the trainer to establish initial connection
> sbatch --job-name fairseq-py --gres gpu:8 --cpus-per-task 10 \
--nodes 16 --ntasks-per-node 8 \
--wrap 'srun --output train.log.node%t --error train.stderr.node%t.%j \
python train.py $DATA \
--distributed-world-size 128 \
--distributed-port $PORT \
--force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
--arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
--clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 --wd 0.0001'
Alternatively you can manually start one process per GPU:
For example, to train a large English-German Transformer model on 2 nodes each
with 8 GPUs (in total 16 GPUs), run the following command on each node,
replacing ``node_rank=0`` with ``node_rank=1`` on the second node:

.. code-block:: console
> DATA=... # path to the preprocessed dataset, must be visible from all nodes
> HOST_PORT=master.example.com:9218 # one of the hosts used by the job
> RANK=... # the rank of this process, from 0 to 127 in case of 128 GPUs
> LOCAL_RANK=... # the local rank of this process, from 0 to 7 in case of 8 GPUs per machine
> python train.py $DATA \
--distributed-world-size 128 \
--distributed-init-method 'tcp://$HOST_PORT' \
--distributed-rank $RANK \
--device-id $LOCAL_RANK \
--force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
--arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
--clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 --wd 0.0001
> python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
--master_port=1234 \
train.py data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--fp16

0 comments on commit eb4ed19

Please sign in to comment.