## Skythought Scoring: Unified APIs for data curation, training and evaluation

This notebook will provide a quick overview of the `Scorer` API in Skythought. A `Scorer` is a lightweight class that deals with scoring model response for a given task. Skythought provides a set of pre-defined scoring functions for verifiable domains (math, coding, etc), making it easy to use consistent scoring across curation, training and evaluation. 

### Installation and Setup

First, make sure you've installed the latest changes from source:

#### Installing from source


```shell
# Clone the repository
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought

# Create and activate a virtual environment (using uv here)
uv venv --python 3.10
source .venv/bin/activate

# Install the package in editable mode
uv pip install -e .
```

## Example Usage during Data Curation

Here's an example recipe for data curation:

1. Create a dataset combining the “hard’ subset of NUMINA and the GSM8K dataset . 
2. Perform rejection sampling with the base model.  
    a. Obtain a response for each sample and filter out the incorrect responses.   
    b. For scoring, we will combine two functions: a correctness check for math responses like math verify along with a format scorer to make sure the model is adhering to instructions.   


```python
import ray
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig
from datasets import load_dataset
from skythought.evals.scoring import Scorer, MathEqualScorer
import re
import os 

SYSTEM_PROMPT = "Think step-by-step and provide the final answer in \\boxed{}"
MAX_TOKENS = 2048 

class FormatScorer(Scorer):
    SCORE_COLUMN = "format_score"
    def __init__(self, response_column):
        self.response_column = response_column

    def score(self, row):
        pat1 = "<think>(.*)</think>"
        pat2 = "\\boxed{(.*)}"
        text = row[self.response_column]
        match1 = re.search(pat1, text)
        match2 = re.search(pat2, text)
        # if even one of the patterns is not found, return 0
        if not match1 or not match2:
            passed = False
        passed = True
        return {self.SCORE_COLUMN: passed}


if __name__ == "__main__":

    # limit the number of samples per dataset for testing
    num_samples = 20

    save_dir = "my_results_dir"
    
    numina_hf = load_dataset("AI-MO/NuminaMath-CoT", split="train")
    gsm8k_hf = load_dataset("openai/gsm8k", "main", split="train")
    
    # filter hard problems and rename to match GSM8K's format
    ds1 = ray.data.from_huggingface(numina_hf) \
        .filter(expr="source == 'hard'")\
        .rename_columns({"problem": "question", "solution": "answer"}) \
        .drop_columns(["source"]).limit(num_samples)

    ds2 = ray.data.from_huggingface(gsm8k_hf).limit(num_samples)

    ds = ds1.union(ds2)

    llm = build_llm_processor(
        vLLMEngineProcessorConfig(
            model="meta-llama/Meta-Llama-3.1-8B-Instruct",
            engine_kwargs=dict(
                tensor_parallel_size=2
            ),
            batch_size=64,
            concurrency=2,
        ),
        preprocess=lambda row: dict(
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": row["question"]},
            ],
            sampling_params=dict(
                temperature=0,
                max_tokens=MAX_TOKENS,
            ),
        )
    )
    # generates responses and saves it in "generated_text" column
    ds = llm(ds)

    ds = ds.map(
        MathEqualScorer, 
	    fn_constructor_kwargs= dict(
            response_column="generated_text", answer_column="answer"
        ),
        concurrency=5
    )

    ds = ds.map(
        FormatScorer, 
        fn_constructor_kwargs= dict(
            response_column="generated_text"
        ),
        concurrency=5
    )

    ds = ds.filter(expr="math_equal_score and format_score")
    
    ds.write_parquet(os.path.abspath(save_dir))

```

## Example Usage During Training

Given below is an example of creating a custom scorer for training for the dataset used in TULU-3's RLVR stage (a mix of GSM8K, IFEval and MATH)

```python
...
from skythought.scoring import MathVerifyScorer, GSM8KScorer, IFEvalScorer, Scorer

# Custom Scoring function for a mix of GSM8K, MATH and IFEval 
class MyScorer(Scorer):
	SCORE_COLUMN = "score"
	def __init__(self, source_column, response_column, output_column):
		self.source_column = source_column
		self.response_column = response_column
		self.output_column = output_column
		self.gsm8k = GSM8KScorer(response_column, output_column)
		self.ifeval = IFEvalScorer(response_column, output_column)
		self.math = MathVerifyScorer(response_column, output_column)

	def score(self, row):
		source = row[self.source_column]
		if source == "gsm8k": 
			return {self.SCORE_COLUMN: self.gsm8k(row)}
		elif source == "math": 
			return {self.SCORE_COLUMN: self.math(row)}
		elif source == "ifeval":
			return {self.SCORE_COLUMN: self.ifeval(row)}
		else:
			raise ValueError

def main(args):
    dataset_args, training_args = parse_args(args)
    ...
    train_dataset = prepare_dataset(train_dataset, tokenizer)
    eval_dataset = prepare_dataset(eval_dataset, tokenizer)
    # assume that the trainer will provide inputs as a single dict. if not, you can customize the interface for the scorer
	# you can use `.score` or the __call__ interface to get the scores
    reward_function = MyScorer("id", "response", "ground_truth")
```