<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Running Jobs Remotely

# Introduction


In addition to our local training loop, Oumi provides the [launcher module](https://github.com/oumi-ai/oumi/tree/main/src/oumi/launcher) as a simple interface for kicking off jobs on a wide variety of remote hardware. We support various cloud providers (GCP, Runpod, Lambda) out of the box, with the additional flexibility to support your own custom cluster should the need arise! In this tutorial we will focus on running jobs using GCP, but this tutorial applies to all clouds Oumi supports. You can read more about the launcher API [here](https://github.com/oumi-ai/oumi/blob/main/src/oumi/launcher/launcher.py).


# Prerequisites

This tutorial assumes:
- You have a valid Google Cloud Platform (GCP) project with billing enabled.
- Your GCP project has the `Compute Engine API` enabled.
- You have the following IAM permissions in your project:
  - ```bash
    roles/browser
    roles/compute.admin
    roles/serviceusage.serviceUsageConsumer
    roles/storage.admin
    ```

You must also authenticate with GCP locally before starting this tutorial:

```bash
conda install -c conda-forge google-cloud-sdk -y
gcloud init
# Run this if you don't have a credentials file.
# This will generate ~/.config/gcloud/application_default_credentials.json.
gcloud auth application-default login
```

## Oumi Installation

First, let's install Oumi. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [None]:
%pip install oumi

# Creating a Job

The Oumi Launcher operates using three key concepts:

1) `Jobs`: A `job` is a unit of work, such as running training or model evaluation. This can be any script you'd like!
2) `Clusters`: A `cluster` is a set of dedicated hardware upon which `jobs` are run. A `cluster` could be as simple as a cloud VM environment.
3) `Clouds` : A `cloud` is a resource provider that manages `clusters`. These include GCP, AWS, Lambda, Runpod, etc.

When you submit a job to the launcher it will handle queueing your job in the proper cluster's job queue. If your desired Cloud does not have an appropriate cluster for running your job it will try to create one on the fly!

Start by creating a simple job:

In [None]:
import oumi.launcher as launcher

job_name = "Create_a_display_name_for_your_job"
cloud_name = "gcp"

job = launcher.JobConfig(
    name=job_name,
    working_dir=".",
    setup="",
    run="",
    resources=launcher.JobResources(
        # We're using Google Cloud Platform in this example.
        cloud=cloud_name,
    ),
)

Congratulations on creating your first job!

Right now your job has an empty `run` field meaning it won't execute any code at runtime. Let's fix that by adding a few simple echo statements. It's important to note that all lines of `run` will be executed on your cluster directly in the shell--but more on that later.

In [None]:
env_vars = {
    "TEST_ENV_VARIABLE": '"Hello, World!"',
}
job.envs = env_vars

run_script = """
echo "$TEST_ENV_VARIABLE"
"""

job.run = run_script

Let's also populate `setup`. Like `run`, `setup` is executed in the shell on the cluster. However, for most clouds `setup` is only executed when a cluster is created for the first time. This is where you should `pip install` any dependencies needed by your job's `run` script.

For now, let's add a simple echo statement:

In [None]:
setup_script = """
echo "This is a script to help set up your environment for your job."
echo "On most clouds, this is only run during cluster creation."
"""

job.setup = setup_script

# Running your Job

Now that you have a job, it's time to run it on a cluster. You can use `launcher.up(...)` to launch your job on a cluster. If you don't have any clusters set up yet, the launcher will make a best-effort at spinning up a cluster that meets the requirements you set in your job's `JobResources`:

In [None]:
cluster_name = "your_cluster_name"

# If you specify an existing cluster name the launcher will use that cluster.
# Otherwise the launcher will create a new cluster with the specified name.
cluster, job_status = launcher.up(job, cluster_name)

>  You'll notice that the logs from the previous command reference Sky. Individual clouds / clusters in the Oumi launcher may use different libraries for communication and job orchestration. At the time of writing, the GCP cloud implementation leverages Sky Pilot.

You can get the latest status of your job by querying the job status on your cluster:

In [None]:
latest_status = cluster.get_job(job_status.id)

print(latest_status)

And list the status of all jobs across all clouds and clusters:

In [None]:
status_list = launcher.status()

print(status_list)

Another handy utility is the ability to list all active clusters for a cloud. Your new cluster will appear in this list:

In [None]:
clusters = launcher.get_cloud(cloud_name).list_clusters()

cluster_names = [cluster.name() for cluster in clusters]
print(cluster_names)

Note that `launcher.get_cloud(cloud_name)` returned a `BaseCloud` object. You can learn more about the `Cloud` API [here](https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/types/base_cloud.py).


You can learn more about the `Cluster` API [here](https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/types/base_cluster.py#L28).

# Canceling a Job

Running jobs can be quickly canceled by using the `cancel` method:

In [None]:
# Only run this cell if you want to cancel your job!
final_status = launcher.cancel(job_status.id, cloud_name, job_status.cluster)

print(final_status)

# Cleaning Up

After your job is done, make sure you don't forget to turn down your cluster! Most cloud providers will bill you for the time that your cluster is up, whether or not it is actively running jobs:

In [None]:
# Cluster names are only unique within a Cloud.
# Specify both the cloud and the cluster you'd like to turn down
launcher.down(cloud_name, cluster_name)

# Running a Remote Training Job

Running training jobs on a remote cluster is simple. Before getting started, we strongly suggest you take a look at our tuning tutorial to learn the ropes of Oumi training.

You can apply the same methods for local training to a remote job. The following job is a sample script for training Llama-2b on GCP. A few important notes:

- If you use `${ENV_VAR}` interpolation in your `setup` or `run` script, they must be delimited. e.g. `${ENV_VAR}` -> `\${ENV_VAR}`
- The job assumes it was kicked off in a `working_dir=.` that contains the Oumi repository. You will see references to local paths like `./configs/examples/misc/sky_init.sh`, etc.

In [None]:
job_config = launcher.JobConfig(
    name="llama-2b",
    working_dir="..",
    file_mounts={
        "~/.netrc": "~/.netrc"  # WandB credentials
    },
    envs={
        "ACCELERATE_LOG_LEVEL": "info",
    },
    resources=launcher.JobResources(
        # Run on Google Cloud Platform
        cloud="gcp",
        # Use 4 A100 GPUs
        accelerators="A100:4",
    ),
    setup="""
set -e
pip install uv && uv pip install oumi[gpu]
""",
    run="""
set -e  # Exit if any command failed.

# Run some checks, and export "OUMI_*" env vars
source ./configs/examples/misc/sky_init.sh

set -x
oumi distributed torchrun \
    -m oumi train \
    -c configs/examples/fineweb_ablation_pretraining/fsdp/train.yaml \
    --training.max_steps 20 \
    --training.save_steps 0 \
    --training.save_final_model false

echo "Node \\${SKYPILOT_NODE_RANK} is all done!"
""",
)

You can kick off this job just as you did before. Note that it requires a cluster with 4 A100 GPUs. You can uncomment the following command and run it to start this training job on GCP:

In [None]:
# Uncomment the following line to run training
# cluster, job_status = launcher.up(job_config, "llama-2b-cluster")

To view your job logs, run the following:

In [None]:
!sky logs llama-2b-cluster

To turn down your cluster when you're done, run:

In [None]:
launcher.down(cloud_name, "llama-2b-cluster")

### Advanced Fields

The `JobConfig` used to define a job contains many fields we didn't cover above. See the following definitions to better understand how to set up resourcing for your jobs:

#### JobConfig

| **Field Name**  | **Type**                                        | **Description**                                                                                                                                                                   |
|-----------------|-------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| name            | Optional[str] (default=None)                    | The display name of the job. Used for display purposes for most clouds.                                                                                                           |
| user            | Optional[str] (default=None)                    | Only used for the `Polaris` cloud. The user that the job will run as.                                                                                                             |
| working_dir     | str (required)                                  | The local directory containing scripts required to execute the job. This directory will be copied to the remote node.                                                             |
| num_nodes       | int (required, default=1)                       | The number of nodes (compute instances) to use for the job. Used during cluster creation.                                                                                         |
| resources       | JobResources (required)                         | The resources required for each node in the job.                                                                                                                                  |
| envs            | Dict[str, str] (required, default={})           | The environment variables to set before running the job.                                                                                                                          |
| file_mounts     | Dict[str, str] (required, default={})           | File mounts to attach to the node. For mounting (copying) local directories. The key is the remote path, and the value is the local path. Cannot share a key with `storage_mounts`|
| storage_mounts  | Dict[str, StorageMount] (required, default={})  | Storage systems to attach to the node. The key is the remote path, and the value is the storage system to mount. Cannot share a key with `file_mounts`                             |
| setup           | Optional[str] (default=None)                    | The setup script to run before the job starts. For most clouds this is executed only on cluster creation. ex) `pip install -r requirements.txt`                                   |
| run             | str (required)                                  | The script to run on the remote cluster.                                                                                                                                          |

#### StorageMount

| **Field Name**  | **Type**                                        | **Description**                                                                                                                                                                   |
|-----------------|-------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| source          | str (required)                                  | The remote path to mount the local path to. e.g. 'gs://bucket/path' for GCS, 's3://bucket/path' for S3, or 'r2://path' for R2.                                                    |
| store           | str (required)                                  | The remote storage solution (Required). Must be one of 's3', 'gcs' or 'r2'.                                                                                                       |

#### JobResources

| **Field Name**  | **Type**                                        | **Description**                                                                                                                                                                   |
|-----------------|-------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cloud           | str (required)                                  | The cloud used to run the job.                                                                                                                                                    |
| region          | Optional[str] (default=None)                    | The region to use (optional). Supported values vary by environment.                                                                                                               |
| zone            | Optional[str] (default=None)                    | The zone to use (optional). Supported values vary by environment.                                                                                                                 |
| accelerators    | Optional[str] (default=None)                    | Accelerator type (optional). Supported values vary by environment. For GCP you may specify the accelerator name and count, e.g. "V100:4".                                         |
| cpus            | Optional[str] (default=None)                    | Number of vCPUs to use per node (optional). Sky-based clouds support strings with  modifiers, e.g. "2+" to indicate at least 2 vCPUs.                                             |
| memory          | Optional[str] (default=None)                    | Memory to allocate per node in GiB (optional). Sky-based clouds support strings with modifiers, e.g. "256+" to indicate at least 256 GB.                                          |
| instance_type   | Optional[str] (default=None)                    | Instance type to use (optional). Supported values vary by environment. The instance type is automatically inferred if `accelerators` is specified.                                |
| use_spot        | bool (required, default=False)                  | Whether the cluster should use spot instances. If unspecified, defaults to False (on-demand instances).                                                                           |
| disk_size       | Optional[int] (default=None)                    | Disk size in GiB to allocate for OS (mounted at /). Ignored by Polaris. Optional.                                                                                                 |
| disk_tier       | Optional[str] (default=None)                    |  Disk tier to use for OS (optional). For sky-based clouds this Could be one of 'low', 'medium', 'high' or 'best'. Defaults to 'medium'. Ignored by Polaris.                       |