# Adaptive Evaluations with Scorebook - Evaluating an OpenAI GPT Model

This quick-start guide showcases an adaptive evaluation of OpenAI's GPT-4o Mini model.

We recommend that you first see our [getting started quick-start guide](https://colab.research.google.com/github/trismik/scorebook/blob/main/tutorials/quickstarts/getting_started.ipynb) if you have not done so already, for more of a detailed overview on adaptive testing and setting up Trismik credentials.

## Prerequisites

- **Trismik API key**: Generate a Trismik API key from the [Trismik dashboard's settings page](https://app.trismik.com/settings).
- **Trismik Project Id**: We recommend you use the project id generated in the [Getting Started Quick-Start Guide](https://colab.research.google.com/github/trismik/scorebook/blob/main/tutorials/quickstarts/getting_started.ipynb).
- **OpenAI API key**: Generate an OpenAI API key from [OpenAI's API Platform](https://openai.com/api/).

## Install Scorebook

In [None]:
!pip install scorebook
# if you're running this locally, please run !pip install scorebook"[examples, providers]"


## Setup Credentials

Enter your Trismik API key, project id and OpenAI API Key below.

In [None]:
# Set your credentials here
TRISMIK_API_KEY = "your-trismik-api-key-here"
TRISMIK_PROJECT_ID = "your-trismik-project-id-here"
OPENAI_API_KEY = "your-openai-api-key-here"

## Login with Trismik API Key

In [None]:
from scorebook import login

# Login to Trismik
login(TRISMIK_API_KEY)
print("âœ“ Logged in to Trismik")

## Define Inference Functions

To evaluate a model with Scorebook, it must be encapsulated within an inference function. An inference function must accept a list of model inputs, pass these to the model for inference, collect and return outputs generated.

An inference function can be defined to encapsulate any model, local or cloud-hosted. There is flexibility in how an inference function can be defined, the only requirements are the function signature. An inference function must,

Accept:

- A list of model inputs.
- Hyperparameters which can be optionally accessed via kwargs.

Return

- A list of parsed model outputs for scoring.

We define two separate inference functions: one for multiple-choice items and one for open-ended items.

In [None]:
from openai import OpenAI
from typing import Any, List

client = OpenAI(api_key=OPENAI_API_KEY)
model_name = "gpt-4o-mini"


def mc_inference(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Process multiple-choice inputs through OpenAI's API."""
    outputs = []
    for input_val in inputs:
        choices = input_val.get("choices", [])
        prompt = (
            str(input_val.get("question", ""))
            + "\nOptions:\n"
            + "\n".join(f"{choice['id']}: {choice['text']}" for choice in choices)
        )

        messages = [
            {"role": "system", "content": "Answer with only the letter of the correct option."},
            {"role": "user", "content": prompt},
        ]

        try:
            response = client.chat.completions.create(
                model=model_name, messages=messages, temperature=0.7,
            )
            output = response.choices[0].message.content.strip()
        except Exception as e:
            output = f"Error: {str(e)}"

        outputs.append(output)
    return outputs


def open_ended_inference(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Process open-ended inputs through OpenAI's API."""
    outputs = []
    for input_val in inputs:
        prompt = str(input_val.get("question", ""))

        messages = [
            {
                "role": "system",
                "content": "Answer the question. Place your final answer between <answer> and </answer> tags.",
            },
            {"role": "user", "content": prompt},
        ]

        try:
            response = client.chat.completions.create(
                model=model_name, messages=messages, temperature=0.7,
            )
            output = response.choices[0].message.content.strip()

            # Extract from <answer> tags if present
            start = output.rfind("<answer>")
            end = output.rfind("</answer>")
            if start != -1 and end > start:
                output = output[start + len("<answer>"):end].strip()

        except Exception as e:
            output = f"Error: {str(e)}"

        outputs.append(output)
    return outputs

## Run an Adaptive Evaluation

When running an adaptive evaluation, we can use any single or multiple adaptive datasets and specify a split to be evaluated.

### Multiple-Choice Adaptive Evaluation

In [None]:
from scorebook import evaluate

# Run multiple-choice adaptive evaluation
results = evaluate(
    inference=mc_inference,
    datasets="trismik/figQA:adaptive",
    split="validation",
    experiment_id="Adaptive Evaluation Tutorial",
    project_id=TRISMIK_PROJECT_ID,
)

print("Adaptive evaluation complete!")
print("Results: ", results[0]["score"])

### Open-Ended Adaptive Evaluation

Scorebook also supports open-ended adaptive evaluations where the model provides free-text answers instead of selecting from multiple choices. We use a separate inference function tailored for open-ended items.

In [None]:
# Run open-ended adaptive evaluation
results_open_ended = evaluate(
    inference=open_ended_inference,
    datasets="trismik/fingpt_convfinqa_test:adaptive",
    split="validation",
    experiment_id="Adaptive Evaluation Tutorial",
    project_id=TRISMIK_PROJECT_ID,
)

print("Open-ended adaptive evaluation complete!")
print("Results: ", results_open_ended[0]["score"])

---

## Next Steps

- [Adaptive Testing White Paper](https://docs.trismik.com/adaptiveTesting/adaptive-testing-introduction/): An in depth overview of the science behind the adaptive testing methodology.
- [Dataset Page](https://dashboard.trismik.com/datasets): Trismik's full set of currently adaptive datasets from the Trismik dashboard.
- [Scorebook Docs](https://docs.trismik.com/scorebook/introduction-to-scorebook/): Scorebook's full documentation.
- [Scorebook Repository](https://github.com/trismik/scorebook): Scorebook is an open-source library, view the code and more examples.