# Adaptive Evaluations with Scorebook (GPT)

This notebook demonstrates Trismik's adaptive evaluation feature using **OpenAI's GPT models** for high-accuracy results.

> **Looking for a version without API costs?** See `3-adaptive_evaluation_local.ipynb` for a version using local open-source models (Phi-3) that runs on your machine without API keys.

## What are Adaptive Evaluations?

Adaptive evaluations dynamically select questions based on a model's previous responses, similar to adaptive testing in education (like the GRE or GMAT).

### Benefits:
- **More efficient**: Fewer questions needed to assess capability
- **Precise measurement**: Better statistical confidence intervals
- **Optimal difficulty**: Questions adapt to the model's skill level

## Prerequisites

- **Trismik API key**: Get yours at https://app.trismik.com/settings
- **Trismik Project**: Create a project at https://app.trismik.com and copy its Project ID
- **OpenAI API key**: For high-accuracy results on complex reasoning tasks

## Setup Credentials

Set your API credentials here:

In [None]:
# STEP 1: Get your Trismik API key from https://app.trismik.com/settings
# STEP 2: Create a project at https://app.trismik.com and copy the Project ID
# STEP 3: Get your OpenAI API key from https://platform.openai.com/api-keys

# Set your credentials here
TRISMIK_API_KEY = "your-trismik-api-key"  # pragma: allowlist secret
TRISMIK_PROJECT_ID = "your-project-id"
OPENAI_API_KEY = "your-openai-api-key"  # pragma: allowlist secret

## Import Dependencies

In [None]:
import asyncio
import string
from pprint import pprint
from typing import Any, List

from openai import AsyncOpenAI
from scorebook import evaluate_async, login

## Login to Trismik

Authenticate with your Trismik account:

In [None]:
if not TRISMIK_API_KEY or TRISMIK_API_KEY == "your-trismik-api-key":
    raise ValueError("Please set TRISMIK_API_KEY. Get your API key from https://app.trismik.com/settings")

login(TRISMIK_API_KEY)
print("✓ Logged in to Trismik")

if not TRISMIK_PROJECT_ID or TRISMIK_PROJECT_ID == "your-project-id":
    raise ValueError("Please set TRISMIK_PROJECT_ID. Create a project at https://app.trismik.com")

print(f"✓ Using project: {TRISMIK_PROJECT_ID}")

## Initialize OpenAI Client

In [None]:
if not OPENAI_API_KEY or OPENAI_API_KEY == "your-openai-api-key":
    raise ValueError("Please set OPENAI_API_KEY. Get your API key from https://platform.openai.com/api-keys")

client = AsyncOpenAI(api_key=OPENAI_API_KEY)  # pragma: allowlist secret
model_name = "gpt-4o-mini"

print(f"✓ Using model: {model_name}")

## Define Async Inference Function

Create an async function to process inputs through the OpenAI API:

In [None]:
async def inference(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Process inputs through OpenAI's API.
    
    Args:
        inputs: Input values from an EvalDataset. For adaptive MMLU-Pro,
               each input is a dict with 'question' and 'options' keys.
        hyperparameters: Model hyperparameters.
    
    Returns:
        List of model outputs for all inputs.
    """
    outputs = []
    
    for input_val in inputs:
        # Handle dict input from adaptive dataset
        if isinstance(input_val, dict):
            prompt = input_val.get("question", "")
            if "options" in input_val:
                prompt += "\nOptions:\n" + "\n".join(
                    f"{letter}: {choice}"
                    for letter, choice in zip(string.ascii_uppercase, input_val["options"])
                )
        else:
            prompt = str(input_val)
        
        # Build messages for OpenAI API
        messages = [
            {
                "role": "system",
                "content": "Answer the question with a single letter representing the correct answer from the list of choices. Do not provide any additional explanation or output beyond the single letter.",
            },
            {"role": "user", "content": prompt},
        ]
        
        # Call OpenAI API
        try:
            response = await client.chat.completions.create(
                model=model_name,
                messages=messages,
                temperature=0.7,
            )
            output = response.choices[0].message.content.strip()
        except Exception as e:
            output = f"Error: {str(e)}"
        
        outputs.append(output)
    
    return outputs

## Run Adaptive Evaluation

Use `evaluate_async()` with an adaptive dataset (indicated by the `:adaptive` suffix):

In [None]:
print(f"Running adaptive evaluation on MMLU-Pro with model: {model_name}")
print("Note: Adaptive evaluation selects questions dynamically based on responses.\n")

# Run adaptive evaluation
results = await evaluate_async(
    inference,
    datasets="MMLUPro2025:adaptive",  # Adaptive datasets have the ":adaptive" suffix
    experiment_id="Adaptive-MMLU-Pro-Notebook",
    project_id=TRISMIK_PROJECT_ID,
    return_dict=True,
    return_aggregates=True,
    return_items=True,
    return_output=True,
)

print("\n✓ Adaptive evaluation complete!")

## View Results

In [None]:
pprint(results)

## Analyze Adaptive Testing Behavior

Examine how the difficulty adapted to the model's performance:

In [None]:
# Extract accuracy and question count
accuracy = results['aggregates']['accuracy']
num_questions = len(results['items'])

print(f"\nAdaptive Evaluation Summary:")
print(f"  Questions Asked: {num_questions}")
print(f"  Overall Accuracy: {accuracy:.2%}")

# View sample questions and responses
print("\nSample Questions and Responses:")
for i, item in enumerate(results['items'][:5], 1):
    print(f"\nQuestion {i}:")
    if isinstance(item.get('input'), dict):
        print(f"  Q: {item['input'].get('question', 'N/A')[:100]}...")
    print(f"  Model Answer: {item['output']}")
    print(f"  Correct Answer: {item['label']}")
    print(f"  Result: {'✓ Correct' if item['accuracy'] == 1.0 else '✗ Incorrect'}")

## View Results on Dashboard

Your results have been uploaded to Trismik's dashboard for visualization and tracking:

In [None]:
from IPython.display import display, Markdown

dashboard_url = f"https://app.trismik.com/projects/{TRISMIK_PROJECT_ID}"
display(Markdown(f"### 📊 [View Results on Dashboard]({dashboard_url})"))
print(f"\nDirect link: {dashboard_url}")

## Understanding Adaptive Testing

### How it works:
1. **Initial Questions**: Start with medium-difficulty questions
2. **Adaptation**: If the model answers correctly, harder questions follow; if incorrect, easier questions are selected
3. **Convergence**: The test converges to the model's true ability level
4. **Stopping Criteria**: Stops when sufficient confidence is reached

### Benefits vs. Traditional Testing:
- **Efficiency**: Typically requires 50-70% fewer questions for the same precision
- **Precision**: Better estimates of model capability
- **Engagement**: Questions are appropriately challenging

## Next Steps

- Try adaptive evaluation with different models to compare
- **Don't have an OpenAI API key?** See `3-adaptive_evaluation_local.ipynb` to run adaptive evaluations with local open-source models (Phi-3, Llama, etc.)
- Explore other adaptive datasets available on Trismik
- See the **Upload Results** notebook for non-adaptive result tracking