# Quickstart

This is a self contained guide on how to write a simple app and start launching
distributed jobs on local and remote clusters.

## Installation

First thing we need to do is to install the TorchX python package which includes
the CLI and the library.

```sh
# install torchx with all dependencies
$ pip install "torchx[dev]"
```

See the [README](https://github.com/pytorch/torchx) for more
information on installation.

In [1]:
%%sh
torchx --help

usage: torchx [-h] [--log_level LOG_LEVEL] [--version]
              {builtins,cancel,configure,desc

ribe,list,log,run,runopts,status,tracker}
              ...

torchx CLI

options:
  -h, --help      

      show this help message and exit
  --log_level LOG_LEVEL
                        Python logging

 log level
  --version             show program's version number and exit

sub-commands:
  Use the f

ollowing commands to run operations, e.g.: torchx run ${JOB_NAME}

  {builtins,cancel,configure,desc

ribe,list,log,run,runopts,status,tracker}


## Hello World

Lets start off with writing a simple "Hello World" python app. This is just a
normal python program and can contain anything you'd like.

<div class="admonition note">
<div class="admonition-title">Note</div>
This example uses Jupyter Notebook `%%writefile` to create local files for
example purposes. Under normal usage you would have these as standalone files.
</div>

In [2]:
%%writefile my_app.py

import sys

print(f"Hello, {sys.argv[1]}!")

Overwriting my_app.py


## Launching

We can execute our app via `torchx run`. The
`local_cwd` scheduler executes the app relative to the current directory.

For this we'll use the `utils.python` component:

In [3]:
%%sh
torchx run --scheduler local_cwd utils.python --help

usage: torchx run <run args...> python  [--help] [-m str] [-c str]
                                 

       [--script str] [--image str]
                                        [--name str] [--cpu int]

 [--gpu int]
                                        [--memMB int] [-h str]
                        

                [--num_replicas int]
                                        ...

Runs ``python`` wi

th the specified module, command or script on the specified
image and host. Use ``--`` to separate c

omponent args and program args
(e.g. ``torchx run utils.python --m foo.main -- --args to --main``)



Note: (cpu, gpu, memMB) parameters are mutually exclusive with ``h`` (named resource) where
      ``

h`` takes precedence if specified for setting resource requirements.
      See `registering named re

sources <https://pytorch.org/torchx/latest/advanced.html#registering-named-resources>`_.

positional

 arguments:
  str                 arguments passed to the program in sys.argv[1:] (ignored
         

             with `--c`) (required)

options:
  --help              show this help message and exit


  -m str, --m str     run library module as a script (default: None)
  -c str, --c str     program p

assed as string (may error if scheduler has a
                      length limit on args) (default: 

None)
  --script str        .py script to run (default: None)
  --image str         image to run on 

(default:
                      ghcr.io/pytorch/torchx:0.8.0dev0)
  --name str          name of the 

job (default: torchx_utils_python)
  --cpu int           number of cpus per replica (default: 1)
  -

-gpu int           number of gpus per replica (default: 0)
  --memMB int         cpu memory in MB pe

r replica (default: 1024)
  -h str, --h str     a registered named resource (if specified takes
    

                  precedence over cpu, gpu, memMB) (default: None)
  --num_replicas int  number of c

opies to run (each on its own container)
                      (default: 1)


The component takes in the script name and any extra arguments will be passed to
the script itself.

In [4]:
%%sh
torchx run --scheduler local_cwd utils.python --script my_app.py "your name"

torchx 2025-05-05 21:53:46 INFO     Tracker configurations: {}


torchx 2025-05-05 21:53:46 INFO     Log directory not set in scheduler cfg. Creating a temporary log

 dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 202

5-05-05 21:53:46 INFO     Log directory is: /tmp/torchx_fmx1cnxh


torchx 2025-05-05 21:53:46 INFO     Waiting for the app to finish...


python/0 Hello, your name!


torchx 2025-05-05 21:53:47 INFO     Job finished: SUCCEEDED


local_cwd://torchx/torchx_utils_python-jvbhlz5zq4nz0c


We can run the exact same app via the `local_docker` scheduler. This scheduler
will package up the local workspace as a layer on top of the specified image.
This provides a very similar environment to the container based remote
schedulers.

<div class="admonition note">
<div class="admonition-title">Note</div>
This requires Docker installed and won't work in environments such as Google
Colab. See the Docker install instructions:
[https://docs.docker.com/get-docker/](https://docs.docker.com/get-docker/)</a>
</div>

In [5]:
%%sh
torchx run --scheduler local_docker utils.python --script my_app.py "your name"

torchx 2025-05-05 21:53:48 INFO     Tracker configurations: {}


torchx 2025-05-05 21:53:48 INFO     Checking for changes in workspace `file:///home/runner/work/torc

hx/torchx/docs/source`...
torchx 2025-05-05 21:53:48 INFO     To disable workspaces pass: --workspac

e="" from CLI or workspace=None programmatically.


torchx 2025-05-05 21:53:48 INFO     Workspace `file:///home/runner/work/torchx/torchx/docs/source` r

esolved to filesystem path `/home/runner/work/torchx/torchx/docs/source`


torchx 2025-05-05 21:53:48 INFO     Building workspace docker image (this may take a while)...


torchx 2025-05-05 21:53:48 INFO     Step 1/4 : ARG IMAGE


torchx 2025-05-05 21:53:48 INFO     Step 2/4 : FROM $IMAGE
torchx 2025-05-05 21:53:48 INFO      --->

 61d02cc06a11
torchx 2025-05-05 21:53:48 INFO     Step 3/4 : COPY . .


torchx 2025-05-05 21:53:56 INFO      ---> e68974b8ac52


torchx 2025-05-05 21:53:56 INFO     Step 4/4 : LABEL torchx.pytorch.org/version=0.8.0dev0


torchx 2025-05-05 21:53:56 INFO      ---> Running in 32a864d12138


torchx 2025-05-05 21:54:04 INFO      ---> Removed intermediate container 32a864d12138


torchx 2025-05-05 21:54:04 INFO      ---> b65e613b6c57




torchx 2025-05-05 21:54:04 INFO     Successfully built b65e613b6c57


torchx 2025-05-05 21:54:04 INFO     Built new image `sha256:b65e613b6c57714003a679200d211fc6fffee16f

cf5ade3079d61f565b941aaf` based on original image `ghcr.io/pytorch/torchx:0.8.0dev0` and changes in 

workspace `file:///home/runner/work/torchx/torchx/docs/source` for role[0]=python.


torchx 2025-05-05 21:54:04 INFO     Waiting for the app to finish...


python/0 Hello, your name!


torchx 2025-05-05 21:54:05 INFO     Job finished: SUCCEEDED


local_docker://torchx/torchx_utils_python-ztd9qbbbngs3z


TorchX defaults to using the
[ghcr.io/pytorch/torchx](https://ghcr.io/pytorch/torchx) Docker container image
which contains the PyTorch libraries, TorchX and related dependencies.

## Distributed

TorchX's `dist.ddp` component uses
[TorchElastic](https://pytorch.org/docs/stable/distributed.elastic.html)
to manage the workers. This means you can launch multi-worker and multi-host
jobs out of the box on all of the schedulers we support.

In [6]:
%%sh
torchx run --scheduler local_docker dist.ddp --help

usage: torchx run <run args...> ddp  [--help] [--script str] [-m str]
                              

       [--image str] [--name str] [-h str]
                                     [--cpu int] [--gpu i

nt] [--memMB int]
                                     [-j str] [--env str] [--max_retries int]
    

                                 [--rdzv_port int] [--rdzv_backend str]
                            

         [--mounts str] [--debug str] [--tee int]
                                     ...

Distribu

ted data parallel style application (one role, multi-replica).
Uses `torch.distributed.run <https://

pytorch.org/docs/stable/distributed.elastic.html>`_
to launch and coordinate PyTorch worker processe

s. Defaults to using ``c10d`` rendezvous backend
on rendezvous_endpoint ``$rank_0_host:$rdzv_port``.

 Note that ``rdzv_port`` parameter is ignored
when running on single node, and instead we use port 0

 which instructs torchelastic to chose
a free random port on the host.

Note: (cpu, gpu, memMB) para

meters are mutually exclusive with ``h`` (named resource) where
      ``h`` takes precedence if spec

ified for setting resource requirements.
      See `registering named resources <https://pytorch.org

/torchx/latest/advanced.html#registering-named-resources>`_.

positional arguments:
  str           

      arguments to the main module (required)

options:
  --help              show this help message

 and exit
  --script str        script or binary to run within the image (default: None)
  -m str, -

-m str     the python module path to run (default: None)
  --image str         image (e.g. docker) (

default:
                      ghcr.io/pytorch/torchx:0.8.0dev0)
  --name str          job name over

ride in the following format:
                      ``{experimentname}/{runname}`` or ``{experimentn

ame}/``
                      or ``/{runname}`` or ``{runname}``. Uses the script or
               

       module name if ``{runname}`` not specified. (default: /)
  -h str, --h str     a registered n

amed resource (if specified takes
                      precedence over cpu, gpu, memMB) (default: N

one)
  --cpu int           number of cpus per replica (default: 2)
  --gpu int           number of g

pus per replica (default: 0)
  --memMB int         cpu memory in MB per replica (default: 1024)
  -j

 str, --j str     [{min_nnodes}:]{nnodes}x{nproc_per_node}, for gpu hosts,
                      npr

oc_per_node must not exceed num gpus (default: 1x2)
  --env str           environment varibles to be

 passed to the run (e.g.
                      ENV1=v1,ENV2=v2,ENV3=v3) (default: None)
  --max_retr

ies int   the number of scheduler retries allowed (default: 0)
  --rdzv_port int     the port on ran

k0's host to use for hosting the c10d
                      store used for rendezvous. Only takes ef

fect when
                      running multi-node. When running single node, this
                 

     parameter is ignored and a random free port is chosen.
                      (default: 29500)
 

 --rdzv_backend str  the rendezvous backend to use. Only takes effect when
                      run

ning multi-node. (default: c10d)
  --mounts str        mounts to mount into the worker environment/c

ontainer
                      (ex. type=<bind/volume>,src=/host,dst=/job[,readonly]).
             

         See scheduler documentation for more info. (default:
                      None)
  --debug 

str         whether to run with preset debug flags enabled (default:
                      False)
  

--tee int           tees the specified std stream(s) to console + file. 0:
                      non

e, 1: stdout, 2: stderr, 3: both (default: 3)


Lets create a slightly more interesting app to leverage the TorchX distributed
support.

In [7]:
%%writefile dist_app.py

import torch
import torch.distributed as dist

dist.init_process_group(backend="gloo")
print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!")

a = torch.tensor([dist.get_rank()])
dist.all_reduce(a)
print(f"all_reduce output = {a}")

Writing dist_app.py


Let launch a small job with 2 nodes and 2 worker processes per node:

In [8]:
%%sh
torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py

torchx 2025-05-05 21:54:07 INFO     Tracker configurations: {}


torchx 2025-05-05 21:54:07 INFO     Checking for changes in workspace `file:///home/runner/work/torc

hx/torchx/docs/source`...
torchx 2025-05-05 21:54:07 INFO     To disable workspaces pass: --workspac

e="" from CLI or workspace=None programmatically.


torchx 2025-05-05 21:54:07 INFO     Workspace `file:///home/runner/work/torchx/torchx/docs/source` r

esolved to filesystem path `/home/runner/work/torchx/torchx/docs/source`


torchx 2025-05-05 21:54:08 INFO     Building workspace docker image (this may take a while)...


torchx 2025-05-05 21:54:08 INFO     Step 1/4 : ARG IMAGE


torchx 2025-05-05 21:54:08 INFO     Step 2/4 : FROM $IMAGE
torchx 2025-05-05 21:54:08 INFO      --->

 61d02cc06a11
torchx 2025-05-05 21:54:08 INFO     Step 3/4 : COPY . .


torchx 2025-05-05 21:54:15 INFO      ---> 573d1910d550


torchx 2025-05-05 21:54:15 INFO     Step 4/4 : LABEL torchx.pytorch.org/version=0.8.0dev0


torchx 2025-05-05 21:54:15 INFO      ---> Running in dc3f4c6378d8


torchx 2025-05-05 21:54:23 INFO      ---> Removed intermediate container dc3f4c6378d8


torchx 2025-05-05 21:54:23 INFO      ---> bb41efcfc05f

 One or more build-args [WORKSPACE] were not consumed


torchx 2025-05-05 21:54:23 INFO     Successfully built bb41efcfc05f


torchx 2025-05-05 21:54:23 INFO     Built new image `sha256:bb41efcfc05f7e8a73b68736117182206c727656

8af4e2be17235b55014158d6` based on original image `ghcr.io/pytorch/torchx:0.8.0dev0` and changes in 

workspace `file:///home/runner/work/torchx/torchx/docs/source` for role[0]=dist_app.


torchx 2025-05-05 21:54:23 INFO     Waiting for the app to finish...


dist_app/1 W0505 21:54:26.225000 1 site-packages/torch/distributed/run.py:766] 


dist_app/0 W0505 21:54:26.225000 1 site-packages/torch/distributed/run.py:766] 
dist_app/1 W0505 21:

54:26.225000 1 site-packages/torch/distributed/run.py:766] *****************************************


dist_app/0 W0505 21:54:26.225000 1 site-packages/torch/distributed/run.py:766] ********************

*********************
dist_app/1 W0505 21:54:26.225000 1 site-packages/torch/distributed/run.py:766]

 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your sys

tem being overloaded, please further tune the variable for optimal performance in your application a

s needed. 
dist_app/0 W0505 21:54:26.225000 1 site-packages/torch/distributed/run.py:766] Setting OM

P_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being o

verloaded, please further tune the variable for optimal performance in your application as needed. 


dist_app/0 W0505 21:54:26.225000 1 site-packages/torch/distributed/run.py:766] *********************

********************
dist_app/1 W0505 21:54:26.225000 1 site-packages/torch/distributed/run.py:766] 

*****************************************


dist_app/1 [0]:I am worker 2 of 4!
dist_app/1 [1]:I am worker 3 of 4!
dist_app/1 [1]:all_reduce outp

ut = tensor([6])


dist_app/1 [0]:all_reduce output = tensor([6])


dist_app/0 [0]:I am worker 0 of 4!
dist_app/0 [1]:I am worker 1 of 4!
dist_app/0 [1]:all_reduce outp

ut = tensor([6])
dist_app/0 [0]:all_reduce output = tensor([6])


torchx 2025-05-05 21:54:30 INFO     Job finished: SUCCEEDED


local_docker://torchx/dist_app-wbctc6k5rr6j4


## Workspaces / Patching

For each scheduler there's a concept of an `image`. For `local_cwd` and `slurm`
it uses the current working directory. For container based schedulers such as
`local_docker`, `kubernetes` and `aws_batch` it uses a docker container.

To provide the same environment between local and remote jobs, TorchX CLI uses
workspaces to automatically patch images for remote jobs on a per scheduler
basis.

When you launch a job via `torchx run` it'll overlay the current directory on
top of the provided image so your code is available in the launched job.

For `docker` based schedulers you'll need a local docker daemon to build and
push the image to your remote docker repository.

## `.torchxconfig`

Arguments to schedulers can be specified either via a command line flag to
`torchx run -s <scheduler> -c <args>` or on a per scheduler basis via a
`.torchxconfig` file.

In [9]:
%%writefile .torchxconfig

[kubernetes]
queue=torchx
image_repo=<your docker image repository>

[slurm]
partition=torchx

Writing .torchxconfig


## Remote Schedulers

TorchX supports a large number of schedulers.
Don't see yours?
[Request it!](https://github.com/pytorch/torchx/issues/new?assignees=&labels=&template=feature-request.md)

Remote schedulers operate the exact same way the local schedulers do. The same
run command for local works out of the box on remote.

```sh
$ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler ray dist.ddp -j 2x2 --script dist_app.py
```

Depending on the scheduler there may be a few extra configuration parameters so
TorchX knows where to run the job and upload built images. These can either be
set via `-c` or in the `.torchxconfig` file.

All config options:

In [10]:
%%sh
torchx runopts

local_docker:
    usage:
        [copy_env=COPY_ENV],[env=ENV],[privileged=PRIVILEGED],[image_repo=I

MAGE_REPO],[quiet=QUIET]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)


            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*


        env=ENV (typing.Dict[str, str], None)
            environment variables to be passed to the 

run. The separator sign can be eiher comma or semicolon
            (e.g. ENV1:v1,ENV2:v2,ENV3:v3 or

 ENV1:V1;ENV2:V2). Environment variables from env will be applied on top
            of the ones fro

m copy_env
        privileged=PRIVILEGED (bool, False)
            If true runs the container with e

levated permissions. Equivalent to running with `docker run --privileged`.
        image_repo=IMAGE_

REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, 

must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            

whether to suppress verbose output for image building. Defaults to ``False``.

local_cwd:
    usage:


        [log_dir=LOG_DIR],[prepend_cwd=PREPEND_CWD],[auto_set_cuda_visible_devices=AUTO_SET_CUDA_VI

SIBLE_DEVICES]

    optional arguments:
        log_dir=LOG_DIR (str, None)
            dir to write

 stdout/stderr log files of replicas
        prepend_cwd=PREPEND_CWD (bool, False)
            if se

t, prepends CWD to replica's PATH env var making any binaries in CWD take precedence over those in P

ATH
        auto_set_cuda_visible_devices=AUTO_SET_CUDA_VISIBLE_DEVICES (bool, False)
            se

ts the `CUDA_AVAILABLE_DEVICES` for roles that request GPU resources. Each role replica will be assi

gned one GPU. Does nothing if the device count is less than replicas.

slurm:
    usage:
        [pa

rtition=PARTITION],[time=TIME],[comment=COMMENT],[constraint=CONSTRAINT],[mail-user=MAIL-USER],[mail

-type=MAIL-TYPE],[job_dir=JOB_DIR]

    optional arguments:
        partition=PARTITION (str, None)


            The partition to run the job in.
        time=TIME (str, None)
            The maximum t

ime the job is allowed to run for. Formats:             "minutes", "minutes:seconds", "hours:minutes

:seconds", "days-hours",             "days-hours:minutes" or "days-hours:minutes:seconds"
        co

mment=COMMENT (str, None)
            Comment to set on the slurm job.
        constraint=CONSTRAINT

 (str, None)
            Constraint to use for the slurm job.
        mail-user=MAIL-USER (str, None

)
            User to mail on job end.
        mail-type=MAIL-TYPE (str, None)
            What even

ts to mail users on.
        job_dir=JOB_DIR (str, None)
            The directory to place the job 

code and outputs. The
            directory must not exist and will be created. To enable log
      

      iteration, jobs will be tracked in ``.torchxslurmjobdirs``.
            

kubernetes:
    usag

e:
        queue=QUEUE,[namespace=NAMESPACE],[service_account=SERVICE_ACCOUNT],[priority_class=PRIOR

ITY_CLASS],[image_repo=IMAGE_REPO],[quiet=QUIET]

    required arguments:
        queue=QUEUE (str)


            Volcano queue to schedule job in

    optional arguments:
        namespace=NAMESPACE (s

tr, default)
            Kubernetes namespace to schedule job in
        service_account=SERVICE_ACC

OUNT (str, None)
            The service account name to set on the pod specs
        priority_class

=PRIORITY_CLASS (str, None)
            The name of the PriorityClass to set on the job specs
      

  image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushi

ng patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, 

False)
            whether to suppress verbose output for image building. Defaults to ``False``.

ku

bernetes_mcad:
    usage:
        [namespace=NAMESPACE],[image_repo=IMAGE_REPO],[service_account=SER

VICE_ACCOUNT],[priority=PRIORITY],[priority_class_name=PRIORITY_CLASS_NAME],[image_secret=IMAGE_SECR

ET],[coscheduler_name=COSCHEDULER_NAME],[network=NETWORK]

    optional arguments:
        namespace

=NAMESPACE (str, default)
            Kubernetes namespace to schedule job in
        image_repo=IMA

GE_REPO (str, None)
            The image repository to use when pushing patched images, must have p

ush access. Ex: example.com/your/container
        service_account=SERVICE_ACCOUNT (str, None)
     

       The service account name to set on the pod specs
        priority=PRIORITY (int, None)
      

      The priority level to set on the job specs. Higher integer value means higher priority
       

 priority_class_name=PRIORITY_CLASS_NAME (str, None)
            Pod specific priority level. Check 

with your Kubernetes cluster admin if Priority classes are defined on your system
        image_secr

et=IMAGE_SECRET (str, None)
            The name of the Kubernetes/OpenShift secret set up for priva

te images
        coscheduler_name=COSCHEDULER_NAME (str, None)
            Option to run TorchX-MCA

D with a co-scheduler. User must provide the co-scheduler name.
        network=NETWORK (str, None)


            Name of additional pod-to-pod network beyond default Kubernetes network

aws_batch:
    

usage:
        queue=QUEUE,[user=USER],[privileged=PRIVILEGED],[share_id=SHARE_ID],[priority=PRIORIT

Y],[job_role_arn=JOB_ROLE_ARN],[execution_role_arn=EXECUTION_ROLE_ARN],[image_repo=IMAGE_REPO],[quie

t=QUIET]

    required arguments:
        queue=QUEUE (str)
            queue to schedule job in

  

  optional arguments:
        user=USER (str, runner)
            The username to tag the job with. 

`getpass.getuser()` if not specified.
        privileged=PRIVILEGED (bool, False)
            If tru

e runs the container with elevated permissions. Equivalent to running with `docker run --privileged`

.
        share_id=SHARE_ID (str, None)
            The share identifier for the job. This must be s

et if and only if the job queue has a scheduling policy.
        priority=PRIORITY (int, 0)
        

    The scheduling priority for the job within the context of share_id. Higher number (between 0 and

 9999) means higher priority. This will only take effect if the job queue has a scheduling policy.
 

       job_role_arn=JOB_ROLE_ARN (str, None)
            The Amazon Resource Name (ARN) of the IAM r

ole that the container can assume for AWS permissions.
        execution_role_arn=EXECUTION_ROLE_ARN

 (str, None)
            The Amazon Resource Name (ARN) of the IAM role that the ECS agent can assum

e for AWS permissions.
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image

 repository to use when pushing patched images, must have push access. Ex: example.com/your/containe

r
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image buildin

g. Defaults to ``False``.

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg

/sagemaker/config.yaml


sagemaker.config INFO - Not applying SDK defaults from location: /home/runner/.config/sagemaker/conf

ig.yaml


aws_sagemaker:
    usage:
        role=ROLE,instance_type=INSTANCE_TYPE,[instance_count=INSTANCE_COU

NT],[user=USER],[keep_alive_period_in_seconds=KEEP_ALIVE_PERIOD_IN_SECONDS],[volume_size=VOLUME_SIZE

],[volume_kms_key=VOLUME_KMS_KEY],[max_run=MAX_RUN],[input_mode=INPUT_MODE],[output_path=OUTPUT_PATH

],[output_kms_key=OUTPUT_KMS_KEY],[base_job_name=BASE_JOB_NAME],[tags=TAGS],[subnets=SUBNETS],[secur

ity_group_ids=SECURITY_GROUP_IDS],[model_uri=MODEL_URI],[model_channel_name=MODEL_CHANNEL_NAME],[met

ric_definitions=METRIC_DEFINITIONS],[encrypt_inter_container_traffic=ENCRYPT_INTER_CONTAINER_TRAFFIC

],[use_spot_instances=USE_SPOT_INSTANCES],[max_wait=MAX_WAIT],[checkpoint_s3_uri=CHECKPOINT_S3_URI],

[checkpoint_local_path=CHECKPOINT_LOCAL_PATH],[debugger_hook_config=DEBUGGER_HOOK_CONFIG],[enable_sa

gemaker_metrics=ENABLE_SAGEMAKER_METRICS],[enable_network_isolation=ENABLE_NETWORK_ISOLATION],[disab

le_profiler=DISABLE_PROFILER],[environment=ENVIRONMENT],[max_retry_attempts=MAX_RETRY_ATTEMPTS],[sou

rce_dir=SOURCE_DIR],[git_config=GIT_CONFIG],[hyperparameters=HYPERPARAMETERS],[container_log_level=C

ONTAINER_LOG_LEVEL],[code_location=CODE_LOCATION],[dependencies=DEPENDENCIES],[training_repository_a

ccess_mode=TRAINING_REPOSITORY_ACCESS_MODE],[training_repository_credentials_provider_arn=TRAINING_R

EPOSITORY_CREDENTIALS_PROVIDER_ARN],[disable_output_compression=DISABLE_OUTPUT_COMPRESSION],[enable_

infra_check=ENABLE_INFRA_CHECK],[image_repo=IMAGE_REPO],[quiet=QUIET]

    required arguments:
     

   role=ROLE (str)
            an AWS IAM role (either name or full ARN). The Amazon SageMaker train

ing jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and m

odel artifacts. After the endpoint is created, the inference code might use the IAM role, if it need

s to access an AWS resource.
        instance_type=INSTANCE_TYPE (str)
            type of EC2 insta

nce to use for training, for example, 'ml.c4.xlarge'

    optional arguments:
        instance_count

=INSTANCE_COUNT (int, 1)
            number of Amazon EC2 instances to use for training. Required if

 instance_groups is not set.
        user=USER (str, runner)
            the username to tag the job

 with. `getpass.getuser()` if not specified.
        keep_alive_period_in_seconds=KEEP_ALIVE_PERIOD_

IN_SECONDS (int, None)
            the duration of time in seconds to retain configured resources in

 a warm pool for subsequent training jobs.
        volume_size=VOLUME_SIZE (int, None)
            s

ize in GB of the storage volume to use for storing input and output data during training (default: 3

0).
        volume_kms_key=VOLUME_KMS_KEY (str, None)
            KMS key ID for encrypting EBS volu

me attached to the training instance.
        max_run=MAX_RUN (int, None)
            timeout in sec

onds for training (default: 24 * 60 * 60).
        input_mode=INPUT_MODE (str, None)
            the

 input mode that the algorithm supports (default: ‘File’).
        output_path=OUTPUT_PATH (str,

 None)
            S3 location for saving the training result (model artifacts and output files). If

 not specified, results are stored to a default bucket. If the bucket with the specific name does no

t exist, the estimator creates the bucket during the fit() method execution.
        output_kms_key=

OUTPUT_KMS_KEY (str, None)
            KMS key ID for encrypting the training output (default: Your 

IAM role’s KMS key for Amazon S3).
        base_job_name=BASE_JOB_NAME (str, None)
            pre

fix for training job name when the fit() method launches. If not specified, the estimator generates 

a default job name based on the training image name and current timestamp.
        tags=TAGS (typing

.List[typing.Dict[str, str]], None)
            list of tags for labeling a training job.
        su

bnets=SUBNETS (typing.List[str], None)
            list of subnet ids. If not specified training job

 will be created without VPC config.
        security_group_ids=SECURITY_GROUP_IDS (typing.List[str]

, None)
            list of security group ids. If not specified training job will be created withou

t VPC config.
        model_uri=MODEL_URI (str, None)
            URI where a pre-trained model is s

tored, either locally or in S3.
        model_channel_name=MODEL_CHANNEL_NAME (str, None)
          

  name of the channel where ‘model_uri’ will be downloaded (default: ‘model’).
        metri

c_definitions=METRIC_DEFINITIONS (typing.List[typing.Dict[str, str]], None)
            list of dict

ionaries that defines the metric(s) used to evaluate the training jobs. Each dictionary contains two

 keys: ‘Name’ for the name of the metric, and ‘Regex’ for the regular expression used to ext

ract the metric from the logs.
        encrypt_inter_container_traffic=ENCRYPT_INTER_CONTAINER_TRAFF

IC (bool, None)
            specifies whether traffic between training containers is encrypted for t

he training job (default: False).
        use_spot_instances=USE_SPOT_INSTANCES (bool, None)
       

     specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max

_wait arg should also be set.
        max_wait=MAX_WAIT (int, None)
            timeout in seconds w

aiting for spot training job.
        checkpoint_s3_uri=CHECKPOINT_S3_URI (str, None)
            S3

 URI in which to persist checkpoints that the algorithm persists (if any) during training.
        c

heckpoint_local_path=CHECKPOINT_LOCAL_PATH (str, None)
            local path that the algorithm wri

tes its checkpoints to.
        debugger_hook_config=DEBUGGER_HOOK_CONFIG (bool, None)
            c

onfiguration for how debugging information is emitted with SageMaker Debugger. If not specified, a d

efault one is created using the estimator’s output_path, unless the region does not support SageMa

ker Debugger. To disable SageMaker Debugger, set this parameter to False.
        enable_sagemaker_m

etrics=ENABLE_SAGEMAKER_METRICS (bool, None)
            enable SageMaker Metrics Time Series.
     

   enable_network_isolation=ENABLE_NETWORK_ISOLATION (bool, None)
            specifies whether cont

ainer will run in network isolation mode (default: False).
        disable_profiler=DISABLE_PROFILER

 (bool, None)
            specifies whether Debugger monitoring and profiling will be disabled (defa

ult: False).
        environment=ENVIRONMENT (typing.Dict[str, str], None)
            environment v

ariables to be set for use during training job
        max_retry_attempts=MAX_RETRY_ATTEMPTS (int, N

one)
            number of times to move a job to the STARTING status. You can specify between 1 and

 30 attempts.
        source_dir=SOURCE_DIR (str, None)
            absolute, relative, or S3 URI Pa

th to a directory with any other training source code dependencies aside from the entry point file (

default: current working directory)
        git_config=GIT_CONFIG (typing.Dict[str, str], None)
    

        git configurations used for cloning files, including repo, branch, commit, 2FA_enabled, user

name, password, and token.
        hyperparameters=HYPERPARAMETERS (typing.Dict[str, str], None)
   

         dictionary containing the hyperparameters to initialize this estimator with.
        contai

ner_log_level=CONTAINER_LOG_LEVEL (int, None)
            log level to use within the container (def

ault: logging.INFO).
        code_location=CODE_LOCATION (str, None)
            S3 prefix URI where

 custom code is uploaded.
        dependencies=DEPENDENCIES (typing.List[str], None)
            lis

t of absolute or relative paths to directories with any additional libraries that should be exported

 to the container.
        training_repository_access_mode=TRAINING_REPOSITORY_ACCESS_MODE (str, Non

e)
            specifies how SageMaker accesses the Docker image that contains the training algorith

m.
        training_repository_credentials_provider_arn=TRAINING_REPOSITORY_CREDENTIALS_PROVIDER_ARN

 (str, None)
            Amazon Resource Name (ARN) of an AWS Lambda function that provides credenti

als to authenticate to the private Docker registry where your training image is hosted.
        disa

ble_output_compression=DISABLE_OUTPUT_COMPRESSION (bool, None)
            when set to true, Model i

s uploaded to Amazon S3 without compression after training finishes.
        enable_infra_check=ENAB

LE_INFRA_CHECK (bool, None)
            specifies whether it is running Sagemaker built-in infra che

ck jobs.
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to

 use when pushing patched images, must have push access. Ex: example.com/your/container
        quie

t=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to

 ``False``.



gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
      

  project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP pr

oject in the environment
        location=LOCATION (str, us-central1)
            Name of the locati

on to schedule the job in. Defaults to us-central1

ray:
    usage:
        [cluster_config_file=CLU

STER_CONFIG_FILE],[cluster_name=CLUSTER_NAME],[dashboard_address=DASHBOARD_ADDRESS],[requirements=RE

QUIREMENTS]

    optional arguments:
        cluster_config_file=CLUSTER_CONFIG_FILE (str, None)
   

         Use CLUSTER_CONFIG_FILE to access or create the Ray cluster.
        cluster_name=CLUSTER_N

AME (str, None)
            Override the configured cluster name.
        dashboard_address=DASHBOAR

D_ADDRESS (str, 127.0.0.1:8265)
            Use ray status to get the dashboard address you will sub

mit jobs against
        requirements=REQUIREMENTS (str, None)
            Path to requirements.txt



lsf:
    usage:
        [lsf_queue=LSF_QUEUE],[jobdir=JOBDIR],[container_workdir=CONTAINER_WORKDIR]

,[host_network=HOST_NETWORK],[shm_size=SHM_SIZE]

    optional arguments:
        lsf_queue=LSF_QUEU

E (str, None)
            queue name to submit jobs
        jobdir=JOBDIR (str, None)
            Th

e directory to place the job code and outputs. The directory must not exist and will be created.
   

     container_workdir=CONTAINER_WORKDIR (str, None)
            working directory in container jobs


        host_network=HOST_NETWORK (bool, False)
            True if using the host network for jobs


        shm_size=SHM_SIZE (str, 64m)
            size of shared memory (/dev/shm) for jobs



## Custom Images

### Docker-based Schedulers

If you want more than the standard PyTorch libraries you can add custom
Dockerfile or build your own docker container and use it as the base image for
your TorchX jobs.


In [11]:
%%writefile timm_app.py

import timm

print(timm.models.resnet18())

Writing timm_app.py


In [12]:
%%writefile Dockerfile.torchx

FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

RUN pip install timm

COPY . .

Writing Dockerfile.torchx


Once we have the Dockerfile created we can launch as normal and TorchX will
automatically build the image with the newly provided Dockerfile instead of the
default one.

In [13]:
%%sh
torchx run --scheduler local_docker utils.python --script timm_app.py

torchx 2025-05-05 21:54:34 INFO     loaded configs from /home/runner/work/torchx/torchx/docs/source/

.torchxconfig


torchx 2025-05-05 21:54:34 INFO     Tracker configurations: {}


torchx 2025-05-05 21:54:34 INFO     Checking for changes in workspace `file:///home/runner/work/torc

hx/torchx/docs/source`...
torchx 2025-05-05 21:54:34 INFO     To disable workspaces pass: --workspac

e="" from CLI or workspace=None programmatically.


torchx 2025-05-05 21:54:34 INFO     Workspace `file:///home/runner/work/torchx/torchx/docs/source` r

esolved to filesystem path `/home/runner/work/torchx/torchx/docs/source`


torchx 2025-05-05 21:54:35 INFO     Building workspace docker image (this may take a while)...


torchx 2025-05-05 21:54:35 INFO     Step 1/4 : FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime


torchx 2025-05-05 21:55:38 INFO      ---> c3f17e5ac010


torchx 2025-05-05 21:55:38 INFO     Step 2/4 : RUN pip install timm


torchx 2025-05-05 21:55:38 INFO      ---> Running in c9ba5fcd7908


torchx 2025-05-05 21:55:38 INFO     Collecting timm


torchx 2025-05-05 21:55:38 INFO       Downloading timm-0.9.12-py3-none-any.whl (2.2 MB)


torchx 2025-05-05 21:55:39 INFO     Collecting safetensors


torchx 2025-05-05 21:55:39 INFO       Downloading safetensors-0.5.3.tar.gz (67 kB)


torchx 2025-05-05 21:55:39 INFO       Installing build dependencies: started


torchx 2025-05-05 21:55:41 INFO       Installing build dependencies: finished with status 'done'


torchx 2025-05-05 21:55:41 INFO       Getting requirements to build wheel: started


torchx 2025-05-05 21:55:41 INFO       Getting requirements to build wheel: finished with status 'don

e'


torchx 2025-05-05 21:55:41 INFO         Preparing wheel metadata: started


torchx 2025-05-05 21:55:41 INFO         Preparing wheel metadata: finished with status 'error'


torchx 2025-05-05 21:55:41 INFO     [91m    ERROR: Command errored out with exit status 1:
     com

mand: /opt/conda/bin/python /opt/conda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py

 prepare_metadata_for_build_wheel /tmp/tmpzo0vychb
         cwd: /tmp/pip-install-lu42qvv8/safetenso

rs_e7b96d32de524ad3bb0413e394eebee0
    Complete output (6 lines):
    
    Cargo, the Rust package 

manager, is not installed or is not on PATH.
    This package requires Rust and Cargo to compile ext

ensions. Install it through
    the system's package manager or via https://rustup.rs/
    
    Chec

king for Rust toolchain....
    ----------------------------------------
[0m
torchx 2025-05-05 21:5



2315367ec7475693d110f512922d582fef1bd4a63adc3/safetensors-0.5.3.tar.gz#sha256=b6b0d6ecacec39a4fdd99c

c19f4576f5219ce858e6fd8dbe7609df0b8dc56965 (from https://pypi.org/simple/safetensors/) (requires-pyt

hon:>=3.7). Command errored out with exit status 1: /opt/conda/bin/python /opt/conda/lib/python3.7/s

ite-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmpzo0vychb Che

ck the logs for full command output.
[0m


torchx 2025-05-05 21:55:41 INFO       Downloading safetensors-0.5.2.tar.gz (66 kB)


torchx 2025-05-05 21:55:41 INFO       Installing build dependencies: started


torchx 2025-05-05 21:55:43 INFO       Installing build dependencies: finished with status 'done'


torchx 2025-05-05 21:55:43 INFO       Getting requirements to build wheel: started


torchx 2025-05-05 21:55:43 INFO       Getting requirements to build wheel: finished with status 'don

e'


torchx 2025-05-05 21:55:43 INFO         Preparing wheel metadata: started


torchx 2025-05-05 21:55:43 INFO         Preparing wheel metadata: finished with status 'error'


torchx 2025-05-05 21:55:43 INFO     [91m    ERROR: Command errored out with exit status 1:
     com

mand: /opt/conda/bin/python /opt/conda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py

 prepare_metadata_for_build_wheel /tmp/tmp8ok3_m4w
         cwd: /tmp/pip-install-lu42qvv8/safetenso

rs_e0b8851d2f554f4c9a99bb45b40d5ebc
    Complete output (6 lines):
    
    Cargo, the Rust package 

manager, is not installed or is not on PATH.
    This package requires Rust and Cargo to compile ext

ensions. Install it through
    the system's package manager or via https://rustup.rs/
    
    Chec

king for Rust toolchain....
    ----------------------------------------
[0m
torchx 2025-05-05 21:5



4b01b67a63a444d2e557c8fe1d82faf3ebd85f370a917/safetensors-0.5.2.tar.gz#sha256=cb4a8d98ba12fa016f4241

932b1fc5e702e5143f5374bba0bbcf7ddc1c4cf2b8 (from https://pypi.org/simple/safetensors/) (requires-pyt

hon:>=3.7). Command errored out with exit status 1: /opt/conda/bin/python /opt/conda/lib/python3.7/s

ite-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmp8ok3_m4w Che

ck the logs for full command output.
[0m


torchx 2025-05-05 21:55:43 INFO       Downloading safetensors-0.5.1.tar.gz (66 kB)


torchx 2025-05-05 21:55:43 INFO       Installing build dependencies: started


torchx 2025-05-05 21:55:44 INFO       Installing build dependencies: finished with status 'done'


torchx 2025-05-05 21:55:44 INFO       Getting requirements to build wheel: started


torchx 2025-05-05 21:55:44 INFO       Getting requirements to build wheel: finished with status 'don

e'


torchx 2025-05-05 21:55:44 INFO         Preparing wheel metadata: started


torchx 2025-05-05 21:55:44 INFO         Preparing wheel metadata: finished with status 'error'


torchx 2025-05-05 21:55:44 INFO     [91m    ERROR: Command errored out with exit status 1:
     com

mand: /opt/conda/bin/python /opt/conda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py

 prepare_metadata_for_build_wheel /tmp/tmp6w3l2pzj
         cwd: /tmp/pip-install-lu42qvv8/safetenso

rs_862ca6b69af64c58812907b325882aef
    Complete output (6 lines):
    
    Cargo, the Rust package 

manager, is not installed or is not on PATH.
    This package requires Rust and Cargo to compile ext

ensions. Install it through
    the system's package manager or via https://rustup.rs/
    
    Chec

king for Rust toolchain....
    ----------------------------------------
[0m
torchx 2025-05-05 21:5



314749ad4760545b6e4ec7b306cfa142776daaca6c980/safetensors-0.5.1.tar.gz#sha256=75927919a73b0f34d6943b

531d757f724e65797a900d88d8081fe8b4448eadc3 (from https://pypi.org/simple/safetensors/) (requires-pyt

hon:>=3.7). Command errored out with exit status 1: /opt/conda/bin/python /opt/conda/lib/python3.7/s

ite-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmp6w3l2pzj Che

ck the logs for full command output.
[0m


torchx 2025-05-05 21:55:44 INFO       Downloading safetensors-0.5.0.tar.gz (65 kB)


torchx 2025-05-05 21:55:44 INFO       Installing build dependencies: started


torchx 2025-05-05 21:55:46 INFO       Installing build dependencies: finished with status 'done'


torchx 2025-05-05 21:55:46 INFO       Getting requirements to build wheel: started


torchx 2025-05-05 21:55:46 INFO       Getting requirements to build wheel: finished with status 'don

e'


torchx 2025-05-05 21:55:46 INFO         Preparing wheel metadata: started


torchx 2025-05-05 21:55:46 INFO         Preparing wheel metadata: finished with status 'error'


torchx 2025-05-05 21:55:46 INFO     [91m    ERROR: Command errored out with exit status 1:
     com

mand: /opt/conda/bin/python /opt/conda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py

 prepare_metadata_for_build_wheel /tmp/tmpggmkwy3p
         cwd: /tmp/pip-install-lu42qvv8/safetenso

rs_93bfcb265b7c45f4b3e5de5b0bb0d466
    Complete output (6 lines):
    
    Cargo, the Rust package 

manager, is not installed or is not on PATH.
    This package requires Rust and Cargo to compile ext

ensions. Install it through
    the system's package manager or via https://rustup.rs/
    
    Chec

king for Rust toolchain....
    ----------------------------------------
[0m
torchx 2025-05-05 21:5



9d124ca63c6908f8092b528b48bd95ba11507e14d9dba/safetensors-0.5.0.tar.gz#sha256=c47b34c549fa1e0c655c46

44da31332c61332c732c47c8dd9399347e9aac69d1 (from https://pypi.org/simple/safetensors/) (requires-pyt

hon:>=3.7). Command errored out with exit status 1: /opt/conda/bin/python /opt/conda/lib/python3.7/s

ite-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmpggmkwy3p Che

ck the logs for full command output.
[0m


torchx 2025-05-05 21:55:46 INFO       Downloading safetensors-0.4.5-cp37-cp37m-manylinux_2_17_x86_64

.manylinux2014_x86_64.whl (436 kB)


torchx 2025-05-05 21:55:46 INFO     Collecting huggingface-hub


torchx 2025-05-05 21:55:46 INFO       Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)




7/site-packages (from timm) (5.4.1)




on3.7/site-packages (from timm) (1.10.0)




hon3.7/site-packages (from timm) (0.11.0)




ib/python3.7/site-packages (from torch>=1.7->timm) (3.10.0.2)


torchx 2025-05-05 21:55:46 INFO     Collecting importlib-metadata


torchx 2025-05-05 21:55:46 INFO       Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)


torchx 2025-05-05 21:55:46 INFO     Collecting fsspec


torchx 2025-05-05 21:55:46 INFO       Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB)




3.7/site-packages (from huggingface-hub->timm) (2.25.1)


torchx 2025-05-05 21:55:46 INFO     Collecting packaging>=20.9


torchx 2025-05-05 21:55:47 INFO       Downloading packaging-24.0-py3-none-any.whl (53 kB)




3.7/site-packages (from huggingface-hub->timm) (3.0.12)




thon3.7/site-packages (from huggingface-hub->timm) (4.61.2)


torchx 2025-05-05 21:55:47 INFO     Collecting zipp>=0.5


torchx 2025-05-05 21:55:47 INFO       Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)




da/lib/python3.7/site-packages (from requests->huggingface-hub->timm) (1.26.6)




thon3.7/site-packages (from requests->huggingface-hub->timm) (2.10)
torchx 2025-05-05 21:55:47 INFO 



m requests->huggingface-hub->timm) (2021.10.8)




ib/python3.7/site-packages (from requests->huggingface-hub->timm) (4.0.0)




/site-packages (from torchvision->timm) (1.21.2)




da/lib/python3.7/site-packages (from torchvision->timm) (8.4.0)


torchx 2025-05-05 21:55:47 INFO     Installing collected packages: zipp, packaging, importlib-metada

ta, fsspec, safetensors, huggingface-hub, timm


torchx 2025-05-05 21:55:48 INFO     Successfully installed fsspec-2023.1.0 huggingface-hub-0.16.4 im

portlib-metadata-6.7.0 packaging-24.0 safetensors-0.4.5 timm-0.9.12 zipp-3.15.0


torchx 2025-05-05 21:55:50 INFO      ---> Removed intermediate container c9ba5fcd7908


torchx 2025-05-05 21:55:50 INFO      ---> 574f2913af87
torchx 2025-05-05 21:55:50 INFO     Step 3/4 

: COPY . .


torchx 2025-05-05 21:55:53 INFO      ---> abb6678f8503


torchx 2025-05-05 21:55:53 INFO     Step 4/4 : LABEL torchx.pytorch.org/version=0.8.0dev0


torchx 2025-05-05 21:55:53 INFO      ---> Running in 464791ad7778


torchx 2025-05-05 21:55:55 INFO      ---> Removed intermediate container 464791ad7778


torchx 2025-05-05 21:55:55 INFO      ---> 72ba1b4e978e




umed


torchx 2025-05-05 21:55:55 INFO     Successfully built 72ba1b4e978e


torchx 2025-05-05 21:55:55 INFO     Built new image `sha256:72ba1b4e978ea7ce73595ccfc8fe1d5a320bdebf

531dd8c7319f6816cecd5979` based on original image `ghcr.io/pytorch/torchx:0.8.0dev0` and changes in 

workspace `file:///home/runner/work/torchx/torchx/docs/source` for role[0]=python.


torchx 2025-05-05 21:55:55 INFO     Waiting for the app to finish...


python/0 ResNet(
python/0   (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3)

, bias=False)
python/0   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_

stats=True)
python/0   (act1): ReLU(inplace=True)
python/0   (maxpool): MaxPool2d(kernel_size=3, str

ide=2, padding=1, dilation=1, ceil_mode=False)
python/0   (layer1): Sequential(
python/0     (0): Ba

sicBlock(
python/0       (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), 

bias=False)
python/0       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_runnin

g_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
pyth

on/0       (aa): Identity()
python/0       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1)

, padding=(1, 1), bias=False)


python/0       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True

)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0     (1): BasicBlock(
python/0   

    (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0 

      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/

0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Iden

tity()
python/0       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bia

s=False)
python/0       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_s

tats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (layer2)

: Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(64, 128, kernel_size=(3, 

3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(128, eps=1e-05, mom

entum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0  

     (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(128, 

128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2

d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(i

nplace=True)
python/0       (downsample): Sequential(
python/0         (0): Conv2d(64, 128, kernel_s

ize=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(128, eps=1e-05, momentum=0.

1, affine=True, track_running_stats=True)
python/0       )
python/0     )
python/0     (1): BasicBlo

ck(
python/0       (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias

=False)
python/0       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_s

tats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/

0       (aa): Identity()
python/0       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1),

 padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=

True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   

)
python/0   (layer3): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(128,

 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm

2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block):

 Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0      

 (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0  

     (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/

0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
python/0         (0): Co

nv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(256,

 eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       )
python/0     )
pyt

hon/0     (1): BasicBlock(
python/0       (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1

), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affin

e=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReL

U(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(256, 256, kernel_size

=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(256, eps=1e-05

, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
pyt

hon/0     )
python/0   )
python/0   (layer4): Sequential(
python/0     (0): BasicBlock(
python/0    

   (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0

       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
pytho

n/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Id

entity()
python/0       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1),

 bias=False)
python/0       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_runn

ing_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
p

ython/0         (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0       

  (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0    

   )
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(512, 512, kernel_si

ze=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(512, eps=1e-

05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
pyt

hon/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2

d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): Bat

chNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2):

 ReLU(inplace=True)
python/0     )
python/0   )
python/0   (global_pool): SelectAdaptivePool2d(pool_

type=avg, flatten=Flatten(start_dim=1, end_dim=-1))
python/0   (fc): Linear(in_features=512, out_fea

tures=1000, bias=True)
python/0 )


torchx 2025-05-05 21:55:57 INFO     Job finished: SUCCEEDED


local_docker://torchx/torchx_utils_python-bwf1ssf212ckqd


### Slurm

The `slurm` and `local_cwd` use the current environment so you can use `pip` and
`conda` as normal.

## Next Steps

1. Checkout other features of the [torchx CLI](cli.rst)
2. Take a look at the [list of schedulers](schedulers.rst) supported by the runner
3. Browse through the collection of [builtin components](components/overview.rst)
4. See which [ML pipeline platforms](pipelines.rst) you can run components on
5. See a [training app example](examples_apps/index.rst)