# Skythought Evals: Evaluation for LLM Reasoning

### Installation and Setup

You can install the latest release from PyPI, or install from source:

#### Installing from PyPI

```shell
pip install skythought
```

#### Installing from source

For installing from source, we recommend using uv for package management (For uv installation, refer to the [official guide](https://docs.astral.sh/uv/getting-started/installation)).

```shell
# Clone the repository
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought

# Create and activate a virtual environment (using uv here)
uv venv --python 3.10
source .venv/bin/activate

# Install the package in editable mode
uv pip install -e .
```

If you're evaluating OpenAI models, make sure to setup the appropriate env vars:

In [3]:
# Uncomment if needed
# export OPENAI_API_KEY=your_openai_api_key

### Understanding the CLI

In [None]:
!skythought --help

You should see the following:

<p align="center"><img src="../assets/cli.png" width="50%"></p>

We provide the following commands:
- `skythought evaluate` : Evaluate a model on a given task. This is the main entrypoint for those interested in evaluation
- `skythought generate`: Generate model outputs for a pre-configured task. This is useful in data curation i.e in cases where you might post-process the generations before scoring. Our evaluation library supports training datasets such as NUMINA, APPS and TACO. 
- `skythought score`: Score saved generations for a given task. This is again useful in the case of data curation where standalone scoring might be preferred. 

### `evaluate` 

Given below are some example commands: 

1. Quick Start

```bash
skythought evaluate \
--task aime24 \ 
--model  NovaSky-AI/Sky-T1-32B-Preview \
--backend vllm \
--batch-size 128
```

2. Customized

```bash
skythought evaluate \
    --task aime24 \
    --model  NovaSky-AI/Sky-T1-32B-Flash \
    --backend vllm \
    --backend-args tensor_parallel_size=8,revision=0dccf55,dtype=float32 \
    --sampling-params max_tokens=4096,temperature=0.1 \
    # use a pre-configured system prompt
    --system-prompt-name prime_rl \
    --result-dir ./ \
    --batch-size 128
```

### Key Concepts

-  Task: A task is an evaluation dataset. We use the `task` argument to retrieve the corresponding configuration file from our pre-configured benchmarks (To see the available tasks, use `skythought evaluate --help`) 
-  Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. 
- Backend: The Backend is concerned with how the LLM instance is created and queried. We support a variety of backends via the `backend` argument. 
    - The `openai` backend can be used to query OpenAI-compatible endpoints. Example: `--backend openai --backend-args base_url=https://api.openai.com`
    - The `vllm` backend instantiates a local model instance with [vLLM](docs.vllm.ai) for efficient inference. 
    - The `ray` backend leverages [Ray Data](https://docs.ray.io/en/latest/data/data.html) on top of vLLM for scaling inference to multiple replicas on single node or a multi-node Ray cluster. This is the recommended backend for high throughput. 
The Backend also consists of configuration at instantiation (`--backend-args`) and during generation (`--sampling-params` to control temperature, max_tokens, etc, as well as `--n` for number of generations per problem). 


During evaluation, the flow is straightforward: 
1. Load dataset and create conversations based on the Task and Model specified by the user
2. Generate model responses from the Backend based on the provided sampling parameters
3. Score model responses based on the Task 
4. Output final results

<p align="center"><img src="../assets/flow.png" width="65%"></p>

Once finished, the results should be saved in a folder in `result-dir` :

```bash
result-dir/
├── NovaSky-AI_Sky-T1-32B-Flash_aime24_myHash
│   ├── results.json
│   └── summary.json
```

For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/evals/README.md)