# Introduction


Many flavors of modern LLMs are prohibitively large to serve on your local hardware. To that end, this tutorial will demonstrate how you can running inference on Llama3.1-70b using the Polaris cluster. As a reminder, Polaris is composed of hundreds of nodes, each composed of 4 x A100-40GB GPUs.




# Prerequisites

## LeMa Installation
First, let's install LeMa. You can find detailed instructions [here](https://github.com/openlema/lema/blob/main/README.md), but it should be as simple as:

```bash
pip install -e ".[dev,train]"
```

## Creating our working directory
For this tutorial, we'll use the following folder to save our generated dataset

In [1]:
from pathlib import Path

tutorial_dir = "polaris_inference_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)

# Prepare Your Data

Our inference pipeline currently expects inputs using OpenAI's chat format. Let's download a dataset from HuggingFace and massage it into the proper format. We'll use a small subset of the `cais/mmlu` as an example.

In [2]:
import datasets

# Optional system context we'll use when creating our dataset
system_context = "You are a helpful AI assistant."

# This dataset has only 100 examples.
dset = datasets.load_dataset("cais/mmlu", "abstract_algebra", split="test")

Now let's massage the data and save it as a JSONL

In [None]:
import json

data_location = Path(tutorial_dir) / "data.jsonl"
with open(str(data_location), "w") as f:
    for data in dset:
        system_message = {"role": "system", "content": system_context}
        user_content = "\n".join([data["question"], "Choices: ", *data["choices"]])
        user_prompt = {"role": "user", "content": user_content}
        entry = {"messages": [system_message, user_prompt]}
        print(json.dumps(entry), file=f)
    print(f"Sample entry: {entry}")

# Setting up our Job

We have a predefined job for running inference on Llama3.1-70b. You can find it at `configs/lema/jobs/polaris/vllm.yaml`.

This job accepts an input JSONL file for inference and an output directory path for writing our results. Note: your output path must be on one of Polaris' file systems (we recommend `/eagle/community_ai/$USER`)

 Let's load the config and modify it to reference our input file:

In [None]:
import os

import lema.launcher as launcher

job_name = "Create_a_display_name_for_your_job"
cloud_name = "polaris"
polaris_user = "YOUR_USER_NAME"

# We assume you're running this notebook in the /notebooks directory.
# Move up one directory to run the job from the root of the repository.
os.chdir(Path(tutorial_dir).absolute().parent.parent)
job_path = Path(".") / "configs" / "lema" / "jobs" / "polaris" / "vllm.yaml"

job = launcher.JobConfig.from_yaml(str(job_path))
job.name = job_name
job.resources.cloud = cloud_name
job.user = polaris_user
job.working_dir = "."  # Use the current directory as the working directory

Now let's add your input and output folder targets to the job:

In [7]:
# Your input path should be a relative path from the working directory.
input_path = str(Path("notebooks") / data_location)

# Write the output to polaris in a directory named after the job and user.
output_path = str(Path("/eagle") / "community_ai" / polaris_user / job_name)

# Set the input and output paths in the job environment.
job.envs["LEMA_VLLM_INPUT_PATH"] = input_path
job.envs["LEMA_VLLM_OUTPUT_PATH"] = output_path

# Running Inference

With our job set up, we can kick off inference on Polaris!

**IMPORTANT** Note that you'll be required to input your polaris credentials twice. Make sure you refresh your credentials between each input or copying your files will fail. 

In [None]:
# The cluster for Polaris jobs must be of the form `queue_name.user_name`.
# The following will use the `debug-scaling` queue.
cluster, job_status = launcher.up(job, f"debug-scaling.{polaris_user}")

Now we just need to wait for our job to finish. We can check our job's status using the following command:

In [None]:
import time

while not job_status.done:
    job_status = cluster.get_job(job_status.id)
    print(f"Job status: {job_status}")
    time.sleep(10)

print(f"Job finished with status: {job_status.status}")

When your job is done, you can find your inference outputs by SSHing into Polaris and navigating to `/eagle/community_ai/YOUR_USER_NAME/YOUR_JOB_NAME`

Run the cell below to find the exact location for your jobs:

In [None]:
print(f"You can find the output of your job at {output_path} on Polaris.")

The output of inference will be a JSONL in the standard openai chat format.

# Advanced Setup

Depending on the size of your input, your job may require more time to run. Polaris requires jobs to set a max run time when queued. To adjust this, navigate to our job config and adjust the line containing the `#PBS -l walltime` directive to the time required for your run.

For example, the following will configure the job to terminate after 10 minutes of run time:
`#PBS -l walltime=00:10:00` 
