jupyter | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
This is a self contained guide on how to write a simple app and start launching distributed jobs on local and remote clusters.
First thing we need to do is to install the TorchX python package which includes the CLI and the library.
# install torchx with all dependencies
$ pip install "torchx[dev]"
See the README for more information on installation.
torchx --help
Lets start off with writing a simple "Hello World" python app. This is just a normal python program and can contain anything you'd like.
%%writefile my_app.py
import sys
print(f"Hello, {sys.argv[1]}!")
We can execute our app via torchx run
. The
local_cwd
scheduler executes the app relative to the current directory.
For this we'll use the utils.python
component:
torchx run --scheduler local_cwd utils.python --help
The component takes in the script name and any extra arguments will be passed to the script itself.
torchx run --scheduler local_cwd utils.python --script my_app.py "your name"
We can run the exact same app via the local_docker
scheduler. This scheduler
will package up the local workspace as a layer on top of the specified image.
This provides a very similar environment to the container based remote
schedulers.
torchx run --scheduler local_docker utils.python --script my_app.py "your name"
TorchX defaults to using the ghcr.io/pytorch/torchx Docker container image which contains the PyTorch libraries, TorchX and related dependencies.
TorchX's dist.ddp
component uses
TorchElastic
to manage the workers. This means you can launch multi-worker and multi-host
jobs out of the box on all of the schedulers we support.
torchx run --scheduler local_docker dist.ddp --help
Lets create a slightly more interesting app to leverage the TorchX distributed support.
%%writefile dist_app.py
import torch
import torch.distributed as dist
dist.init_process_group(backend="gloo")
print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!")
a = torch.tensor([dist.get_rank()])
dist.all_reduce(a)
print(f"all_reduce output = {a}")
Let launch a small job with 2 nodes and 2 worker processes per node:
torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py
For each scheduler there's a concept of an image
. For local_cwd
and slurm
it uses the current working directory. For container based schedulers such as
local_docker
, kubernetes
and aws_batch
it uses a docker container.
To provide the same environment between local and remote jobs, TorchX CLI uses workspaces to automatically patch images for remote jobs on a per scheduler basis.
When you launch a job via torchx run
it'll overlay the current directory on
top of the provided image so your code is available in the launched job.
For docker
based schedulers you'll need a local docker daemon to build and
push the image to your remote docker repository.
Arguments to schedulers can be specified either via a command line flag to
torchx run -s <scheduler> -c <args>
or on a per scheduler basis via a
.torchxconfig
file.
%%writefile .torchxconfig
[kubernetes]
queue=torchx
image_repo=<your docker image repository>
[slurm]
partition=torchx
TorchX supports a large number of schedulers. Don't see yours? Request it!
Remote schedulers operate the exact same way the local schedulers do. The same run command for local works out of the box on remote.
$ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler ray dist.ddp -j 2x2 --script dist_app.py
Depending on the scheduler there may be a few extra configuration parameters so
TorchX knows where to run the job and upload built images. These can either be
set via -c
or in the .torchxconfig
file.
All config options:
torchx runopts
If you want more than the standard PyTorch libraries you can add custom Dockerfile or build your own docker container and use it as the base image for your TorchX jobs.
%%writefile timm_app.py
import timm
print(timm.models.resnet18())
%%writefile Dockerfile.torchx
FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
RUN pip install timm
COPY . .
Once we have the Dockerfile created we can launch as normal and TorchX will automatically build the image with the newly provided Dockerfile instead of the default one.
torchx run --scheduler local_docker utils.python --script timm_app.py
The slurm
and local_cwd
use the current environment so you can use pip
and
conda
as normal.
- Checkout other features of the torchx CLI
- Take a look at the list of schedulers supported by the runner
- Browse through the collection of builtin components
- See which ML pipeline platforms you can run components on
- See a training app example