jupyter

jupytext

kernelspec

text_representation

extension	format_name	format_version	jupytext_version
.md	markdown	1.1	1.1.0

display_name	language	name
Python 3	python	python3

Quickstart

This is a self contained guide on how to write a simple app and start launching distributed jobs on local and remote clusters.

Installation

First thing we need to do is to install the TorchX python package which includes the CLI and the library.

# install torchx with all dependencies
$ pip install "torchx[dev]"

See the README for more information on installation.

torchx --help

Hello World

Lets start off with writing a simple "Hello World" python app. This is just a normal python program and can contain anything you'd like.

Note

This example uses Jupyter Notebook `%%writefile` to create local files for example purposes. Under normal usage you would have these as standalone files.

%%writefile my_app.py

import sys

print(f"Hello, {sys.argv[1]}!")

Launching

We can execute our app via torchx run. The local_cwd scheduler executes the app relative to the current directory.

For this we'll use the utils.python component:

torchx run --scheduler local_cwd utils.python --help

The component takes in the script name and any extra arguments will be passed to the script itself.

torchx run --scheduler local_cwd utils.python --script my_app.py "your name"

We can run the exact same app via the local_docker scheduler. This scheduler will package up the local workspace as a layer on top of the specified image. This provides a very similar environment to the container based remote schedulers.

Note

This requires Docker installed and won't work in environments such as Google Colab. See the Docker install instructions: [https://docs.docker.com/get-docker/](https://docs.docker.com/get-docker/)

torchx run --scheduler local_docker utils.python --script my_app.py "your name"

TorchX defaults to using the ghcr.io/pytorch/torchx Docker container image which contains the PyTorch libraries, TorchX and related dependencies.

Distributed

TorchX's dist.ddp component uses TorchElastic to manage the workers. This means you can launch multi-worker and multi-host jobs out of the box on all of the schedulers we support.

torchx run --scheduler local_docker dist.ddp --help

Lets create a slightly more interesting app to leverage the TorchX distributed support.

%%writefile dist_app.py

import torch
import torch.distributed as dist

dist.init_process_group(backend="gloo")
print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!")

a = torch.tensor([dist.get_rank()])
dist.all_reduce(a)
print(f"all_reduce output = {a}")

Let launch a small job with 2 nodes and 2 worker processes per node:

torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py

Workspaces / Patching

For each scheduler there's a concept of an image. For local_cwd and slurm it uses the current working directory. For container based schedulers such as local_docker, kubernetes and aws_batch it uses a docker container.

To provide the same environment between local and remote jobs, TorchX CLI uses workspaces to automatically patch images for remote jobs on a per scheduler basis.

When you launch a job via torchx run it'll overlay the current directory on top of the provided image so your code is available in the launched job.

For docker based schedulers you'll need a local docker daemon to build and push the image to your remote docker repository.

`.torchxconfig`

Arguments to schedulers can be specified either via a command line flag to torchx run -s <scheduler> -c <args> or on a per scheduler basis via a .torchxconfig file.

%%writefile .torchxconfig

[kubernetes]
queue=torchx
image_repo=<your docker image repository>

[slurm]
partition=torchx

Remote Schedulers

TorchX supports a large number of schedulers. Don't see yours? Request it!

Remote schedulers operate the exact same way the local schedulers do. The same run command for local works out of the box on remote.

$ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler ray dist.ddp -j 2x2 --script dist_app.py

Depending on the scheduler there may be a few extra configuration parameters so TorchX knows where to run the job and upload built images. These can either be set via -c or in the .torchxconfig file.

All config options:

torchx runopts

Custom Images

Docker-based Schedulers

If you want more than the standard PyTorch libraries you can add custom Dockerfile or build your own docker container and use it as the base image for your TorchX jobs.

%%writefile timm_app.py

import timm

print(timm.models.resnet18())

%%writefile Dockerfile.torchx

FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

RUN pip install timm

COPY . .

Once we have the Dockerfile created we can launch as normal and TorchX will automatically build the image with the newly provided Dockerfile instead of the default one.

torchx run --scheduler local_docker utils.python --script timm_app.py

Slurm

The slurm and local_cwd use the current environment so you can use pip and conda as normal.

Next Steps

Checkout other features of the torchx CLI
Take a look at the list of schedulers supported by the runner
Browse through the collection of builtin components
See which ML pipeline platforms you can run components on
See a training app example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!