## Submit jobs to the Ray cluster



### Use Ray Train



``` bash
# run in a terminal inside jupyter container
cd ~/work
git stash # stash any changes you made to the current branch
git fetch -a
git switch ray
cd ~/work
```


``` bash
# runs on jupyter container inside node-mltrain, from inside the "work" directory
ray job submit --runtime-env runtime.json  --working-dir .  -- python gourmetgram-train/train.py 
```

Submit the job, and note that it runs mostly as before. Let it run until it is finished.

### Use Ray Train with multiple workers


``` python
scaling_config = ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU": 1, "CPU": 8})
```

to scale to two worker nodes, each with 1 GPU and 8 GPUs assigned to the job. Save it, and run with

``` bash
# runs on jupyter container inside node-mltrain, from inside the "work" directory
ray job submit --runtime-env runtime.json  --working-dir .  -- python gourmetgram-train/train.py 
```

On the Ray dashboard, in the “Resource Status” section of the “Overview” tab, you should see the increased resource requirements reflected in the “Usage” section.

In a terminal on the “node-mltrain” host (*not* inside the Jupyter container), run

``` bash
# runs on node-mltrain
nvtop
```

and confirm that both GPUs are busy.

### Use Ray Train for fault tolerance

Next, let’s try out fault tolerance! If the worker node that runs our Ray Train job dies, Ray can resume from the most recent checkpoint on another worker node.

Fault tolerance is configured in another branch

``` bash
# run in a terminal inside jupyter container
cd ~/work/gourmetgram-train
git stash # stash any changes you made to the current branch
git fetch -a
git switch fault_tolerance
cd ~/work
```

To add fault tolerance, we

-   have an additional import
-   add it to our `RunConfig`:

``` python
run_config = RunConfig( ... failure_config=FailureConfig(max_failures=2))
```

-   and inside `train_fun`, we replace the old

``` python
trainer.fit(lightning_food11_model, train_dataloaders=train_loader, val_dataloaders=val_loader)
```

with

``` python
## For Ray Train fault tolerance with FailureConfig
# Recover from checkpoint, if we are restoring after failure
checkpoint = train.get_checkpoint()
if checkpoint:
    with checkpoint.as_directory() as ckpt_dir:
        ckpt_path = os.path.join(ckpt_dir, "checkpoint.ckpt")
        trainer.fit(lightning_food11_model, train_dataloaders=train_loader, val_dataloaders=val_loader, ckpt_path=ckpt_path)
else:
        trainer.fit(lightning_food11_model, train_dataloaders=train_loader, val_dataloaders=val_loader)
```


### Stop Ray system



For NVIDIA GPUs:

``` bash
# run on node-mltrain
docker compose -f LLM_LegalDocSummarization/docker/docker-compose-ray-cuda.yaml down
```

and then stop the Jupyter server with

``` bash
# run on node-mltrain
docker stop jupyter
```