# Adaptive Testing with Scorebook and Local Inference

This notebook demonstrates how to use Scorebook's adaptive evaluation capabilities with local transformer models.

## What is Adaptive Testing?

Adaptive testing dynamically adjusts question difficulty based on model performance, providing more efficient and accurate capability assessment.

## Prerequisites

### Trismik

To obtain a Trismik API key, go on https://www.trismik.com/ and click on Sign Up. You can start with a free account to test our platform!

Once you're subscribed, log in the [dashboard](https://dashboard.trismik.com),
click on your initials in the top-right corner of the screen, click on "API Keys" in the drop-down menu, and then on "Create API Key" to create a new API key. Copy-paste it in a text file - you will need it later in this tutorial.

### Running on GPU

If you're running this on a Google Colab, you can use a GPU for free by clicking on "Runtime" on the menu bar, selecting "Change Runtime Type" in the drop down menu, and then select "T4 GPU".

## 1. Install Dependencies

In [None]:
# Install Scorebook
!pip install scorebook

# Check GPU availability
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")

## 2. Import Libraries

In [None]:
import string
from typing import Any, Dict, List
from getpass import getpass
import secrets


import transformers
import trismik
from scorebook import evaluate, InferencePipeline, login


# Set transformers verbosity
transformers.utils.logging.set_verbosity_error()

## 3. Authentication Setup

Enter your Trismik credentials to enable adaptive evaluation.

In [None]:
# Get Trismik credentials
TRISMIK_API_KEY = getpass("Enter your Trismik API Key: ")

In [None]:
client = trismik.TrismikAsyncClient(api_key=TRISMIK_API_KEY)
project_name = f"Demo_Project_{secrets.token_hex(4)}"
project = await client.create_project(project_name)
experiment_name = f"Demo_Experiment_{secrets.token_hex(4)}"

In [None]:
print(f"You will find your experiment in the project {project_name} with the name {experiment_name}")

In [None]:
# Login to Trismik
login(TRISMIK_API_KEY)
print("✓ Successfully logged in to Trismik")

## 4. Load Local Model

We'll use Microsoft's Phi-4-mini-instruct model for this example. You can replace this with any HuggingFace model.

In [None]:
# Model configuration
MODEL_NAME = "microsoft/Phi-4-mini-instruct"

print(f"Loading model: {MODEL_NAME}")
print("This may take a few minutes on first run...")

# Initialize the pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=MODEL_NAME,
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto"
)

print(f"✓ Model loaded successfully")

# Generation parameters for consistent outputs
GENERATION_ARGS = {
    "max_new_tokens": 10,
    "temperature": 0.0,
    "do_sample": False,
    "return_full_text": False,
    "pad_token_id": pipeline.tokenizer.eos_token_id
}

## 5. Define Inference Pipeline Components

The InferencePipeline consists of three components:
1. **Preprocessor**: Formats questions for the model
2. **Inference**: Runs the model
3. **Postprocessor**: Extracts answers from model output

In [None]:
def preprocessor(eval_item: Dict, **hyperparameters: Any) -> List[Dict]:
    """Format evaluation item for HuggingFace model.

    Args:
        eval_item: Dictionary containing 'question' and optionally 'options'
        hyperparameters: Additional configuration

    Returns:
        Messages formatted for the model
    """
    # Build the prompt
    prompt = eval_item["question"]

    # Add multiple choice options if present
    if "options" in eval_item and eval_item["options"]:
        prompt += "\n\nOptions:\n"
        for i, option in enumerate(eval_item["options"]):
            letter = string.ascii_uppercase[i]
            prompt += f"{letter}: {option}\n"

    # System instruction for clear answer - Phi-4 is a small model and needs
    # a clear prompt
    system_message = """
Answer the question you are given using only a single letter \
(for example, 'A'). \
Do not use punctuation. \
Do not show your reasoning. \
Do not provide any explanation. \
Follow the instructions exactly and \
always answer using a single uppercase letter.

For example, if the question is "What is the capital of France?" and the \
choices are "A. Paris", "B. London", "C. Rome", "D. Madrid",
- the answer should be "A"
- the answer should NOT be "Paris" or "A. Paris" or "A: Paris"

Please adhere strictly to the instructions.
    """.strip()

    # Format as messages for the model
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": prompt}
    ]

    return messages


def inference(preprocessed_items: List[Any], **hyperparameters: Any) -> List[Any]:
    """Run inference on preprocessed items.

    Args:
        preprocessed_items: List of formatted messages
        hyperparameters: Additional configuration

    Returns:
        List of model outputs
    """
    results = []

    for item in preprocessed_items:
        try:
            # Run inference with consistent generation parameters
            output = pipeline(item, **GENERATION_ARGS)
            results.append(output)
        except Exception as e:
            print(f"Error during inference: {e}")
            # Return empty output on error
            results.append([{"generated_text": ""}])

    return results


def postprocessor(model_output: Any, **hyperparameters: Any) -> str:
    """Extract answer from model output with retry logic.

    Args:
        model_output: Raw model output
        hyperparameters: Additional configuration

    Returns:
        Extracted answer as a string
    """
    try:
        # Extract the generated text
        if isinstance(model_output, list) and len(model_output) > 0:
            generated = model_output[0].get("generated_text", "")
        else:
            generated = str(model_output)

        # Clean the response
        answer = generated.strip()

        # For single letter responses (multiple choice)
        if len(answer) == 1 and answer in string.ascii_uppercase:
            return answer

        return ""

    except Exception as e:
        print(f"Error in postprocessing: {e}")
        return ""


# Test the pipeline components
print("Testing pipeline components...")
test_item = {
    "question": "What is 2+2?",
    "options": ["3", "4", "5", "6"]
}

test_preprocessed = preprocessor(test_item)
print(f"✓ Preprocessor test passed")

test_output = inference([test_preprocessed])
print(f"✓ Inference test passed")

test_answer = postprocessor(test_output[0])
print(f"✓ Postprocessor test passed")
print(f"Test answer: '{test_answer}'")

## 6. Create InferencePipeline and Run Adaptive Evaluation

In [None]:
# Create the InferencePipeline
inference_pipeline = InferencePipeline(
    model=MODEL_NAME,
    preprocessor=preprocessor,
    inference_function=inference,
    postprocessor=postprocessor
)

print("✓ InferencePipeline created")
print(f"Dataset: MMLUPro2025:adaptive")
print(f"Experiment ID: {experiment_name}")
print(f"Project ID: {project.id}")

In [None]:
print("\nStarting adaptive evaluation...")
# Run the adaptive evaluation
results = evaluate(
    inference_pipeline,
    datasets="MMLUPro2024:adaptive",  # Adaptive dataset
    experiment_id=experiment_name,
    project_id=project.id,
    return_dict=True,
    return_aggregates=True,
    return_items=True,
    return_output=True
)

print("\n✓ Evaluation completed!")

## 7. Analyze Results

You can find the results in the `results` object returned by the `evaluate` function:

In [None]:
results['aggregate_results'][0]['score']

## Analyzing Results

The key metrics to focus on are:

- Theta (θ): The primary score measuring model ability on the dataset (higher is better)
- Standard Error: The uncertainty in the theta estimate (lower is better)

You can find more info [here](https://docs.trismik.com/adaptiveTesting/adaptive-testing-introduction/)!

## Summary

You've successfully run an adaptive evaluation using:
- **Scorebook** for evaluation orchestration
- **Local transformer models** for inference
- **Trismik's adaptive datasets** for intelligent question selection

### Key Takeaways:

1. **Adaptive testing** adjusts difficulty based on model performance
2. **Local inference** provides full control and no API costs
3. **Scorebook's InferencePipeline** makes it easy to swap between local and cloud inference

### Next Steps:

- Try different models by changing `MODEL_NAME`
- Adjust generation parameters for better accuracy
- Explore other adaptive datasets available through Trismik