# Multinode Inference on Polaris

Many flavors of modern LLMs are prohibitively large to serve on your local hardware. To that end, this tutorial will demonstrate how you can run inference on Llama 3.3 70B using the Polaris cluster. As a reminder, Polaris is composed of hundreds of nodes, each composed of 4 x A100-40GB GPUs.

# Prerequisites

## Oumi Installation
First, let's install Oumi. You can find detailed instructions [here](https://github.com/oumi-ai/oumi/blob/main/README.md), but it should be as simple as:

```bash
pip install oumi
```

## Creating our Working Directory
For this tutorial, we'll use the following folder to save our generated dataset

In [1]:
from pathlib import Path

tutorial_dir = "polaris_inference_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)

# Prepare Your Data

Our inference pipeline currently expects inputs using OpenAI's chat format. Let's download a dataset from HuggingFace and massage it into the proper format. We'll use a small subset of the `cais/mmlu` as an example.

In [None]:
import datasets

# Optional system context we'll use when creating our dataset
system_context = "You are a helpful AI assistant."

# This dataset has only 100 examples.
dset = datasets.load_dataset("cais/mmlu", "abstract_algebra", split="test")

Now let's massage the data and save it as a JSONL

In [None]:
import json
from pprint import pprint

data_location = Path(tutorial_dir) / "data.jsonl"
with open(str(data_location), "w") as f:
    for data in dset:
        system_message = {"role": "system", "content": system_context}
        user_content = "\n".join([data["question"], "Choices: ", *data["choices"]])
        user_prompt = {"role": "user", "content": user_content}
        entry = {"messages": [system_message, user_prompt]}
        print(json.dumps(entry), file=f)

print("Sample entry:")
pprint(entry)

# Setting up our Job

We have a predefined job for running inference on Llama3.3-70B-Instruct. You can find it at `configs/examples/misc/vllm_polaris_job.yaml`.

This job accepts an input JSONL file of conversations and an output directory for writing our results. Note: your output dir must be on one of Polaris' file systems (we recommend `/eagle/community_ai/$USER`).

First, let's load the config:

In [None]:
import os

import oumi.launcher as launcher

job_name = "YOUR_JOB_NAME"
polaris_user = "YOUR_POLARIS_USERNAME"

# We assume you're running this notebook in the /notebooks directory.
# Move up one directory to run the job from the root of the repository.
os.chdir(Path(tutorial_dir).absolute().parent.parent)
job_path = Path() / "configs" / "examples" / "misc" / "vllm_polaris_job.yaml"

job = launcher.JobConfig.from_yaml(str(job_path))
job.name = job_name
job.user = polaris_user

Now let's specify the inputs for the job, such as the model to run inference on and the input path:

In [None]:
# Your input path should be a relative path from the working directory.
input_filepath = str(Path("notebooks") / data_location)

# Write the output to Polaris in a directory named after the job and user.
output_dir = str(Path("/eagle") / "community_ai" / polaris_user / job_name)

# Set the input and output paths in the job environment.
job.envs["REPO"] = "meta-llama"
job.envs["MODEL"] = "Llama-3.3-70B-Instruct"
job.envs["OUMI_VLLM_INPUT_FILEPATH"] = input_filepath
job.envs["OUMI_VLLM_OUTPUT_DIR"] = output_dir
job.envs["OUMI_VLLM_NUM_WORKERS"] = str(10)  # Samples will be divided amongst workers
job.envs["OUMI_VLLM_WORKERS_SPAWNED_PER_SECOND"] = str(10)

Note: You can run 70B inference on a single node in Polaris using the 70B.w8a8 (Int8 quantized) version from neuralmagic.

# Running Inference

With our job set up, we can kick off inference on Polaris!

**IMPORTANT** Note that you'll be required to input your Polaris credentials twice. Make sure you refresh your credentials between each input or copying your files will fail.

In [None]:
# The cluster for Polaris jobs must be of the form `queue_name.user_name`.
# The following will use the `debug` queue.
cluster, job_status = launcher.up(job, f"debug.{polaris_user}")

Now we just need to wait for our job to finish. We can check our job's status using the following command:

In [None]:
import time

while not job_status.done:
    job_status = cluster.get_job(job_status.id)
    print(f"Job status: {job_status}")
    time.sleep(30)

print(f"Job finished with status: {job_status.status}")

When your job is done, you can find the job outputs by SSHing into Polaris and navigating to your output directory: `/eagle/community_ai/YOUR_USER_NAME/YOUR_JOB_NAME`. You should see two files corresponding to the inference output and metrics, beginning with the job id being printed in the above cell.

Run the cell below to find the exact location for your jobs:

In [None]:
print("Run the following on Polaris to find the job output files:")
print(f"ls {output_dir}/{job_status.id}_vllm_*.jsonl")

The output of inference will be a JSONL in the standard OpenAI chat format.

# Advanced Setup

Depending on the size of your input, your job may require more time to run. Polaris requires jobs to set a max run time when queued. To adjust this, navigate to our job config and adjust the line containing the `#PBS -l walltime` directive to the time required for your run.

For example, the following will configure the job to terminate after 10 minutes of run time:
`#PBS -l walltime=00:10:00` 
