# Adaptive Evaluations with Scorebook (Phi)

This notebook demonstrates Trismik's adaptive evaluation feature using a **local open-source model** that runs on your machine without requiring API keys or cloud costs.

## What are Adaptive Evaluations?

Adaptive evaluations dynamically select questions based on a model's previous responses, similar to adaptive testing in education (like the GRE or GMAT).

### Benefits:
- **More efficient**: Fewer questions needed to assess capability
- **Precise measurement**: Better statistical confidence intervals
- **Optimal difficulty**: Questions adapt to the model's skill level

## Prerequisites

- **Trismik API key**: Get yours at https://app.trismik.com/settings
- **Trismik Project**: Create a project at https://app.trismik.com and copy its Project ID
- **Hardware**: GPU recommended but not required (CPU inference will be slower)
- **Packages**: `pip install transformers torch` (or `pip install transformers torch torchvision` for full PyTorch)

## Note on Model Performance

⚠️ **Important**: Local models (especially smaller ones) may not perform as well on complex reasoning tasks. This notebook prioritizes **accessibility and reproducibility** over maximum accuracy and uses microsoft Phi-3 on MMLU-Pro.

## Setup Credentials

Set your Trismik credentials here:

In [None]:
# STEP 1: Get your Trismik API key from https://app.trismik.com/settings
# STEP 2: Create a project at https://app.trismik.com and copy the Project ID

# Set your credentials here
TRISMIK_API_KEY = "TRISMIK_API_KEY"
TRISMIK_PROJECT_ID = "TRISMIK_PROJECT_ID"

## Import Dependencies

In [None]:
import asyncio
import string
from pprint import pprint
from typing import Any, List

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from scorebook import evaluate_async, login

## Login to Trismik

Authenticate with your Trismik account:

In [None]:
if not TRISMIK_API_KEY or TRISMIK_API_KEY == "your-trismik-api-key":
    raise ValueError("Please set TRISMIK_API_KEY. Get your API key from https://app.trismik.com/settings")

login(TRISMIK_API_KEY)
print("✓ Logged in to Trismik")

if not TRISMIK_PROJECT_ID or TRISMIK_PROJECT_ID == "your-project-id":
    raise ValueError("Please set TRISMIK_PROJECT_ID. Create a project at https://app.trismik.com")

print(f"✓ Using project: {TRISMIK_PROJECT_ID}")

## Initialize Local Model

We'll use Phi-3-mini, a compact 3.8B parameter model that runs efficiently on consumer hardware.

**Model Options** (in order of size/performance):
- `microsoft/Phi-3-mini-4k-instruct` (3.8B) - Fast, runs on most hardware
- `microsoft/Phi-3-small-8k-instruct` (7B) - Better performance, needs more memory
- `microsoft/Phi-3-medium-4k-instruct` (14B) - High performance, requires GPU

Change the model name below based on your hardware capabilities.

In [None]:
# Select model (change based on your hardware)
model_name = "microsoft/Phi-3-mini-4k-instruct"

print(f"Loading {model_name}...")
print("(This may take a few minutes on first run as the model downloads)\n")

# Check if CUDA is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device_map="auto" if device == "cuda" else None,
    trust_remote_code=True,
)

if device == "cpu":
    model = model.to(device)

print(f"\n✓ Model loaded successfully on {device}")

## Define Async Inference Function

Create an async function to process inputs through the local model:

In [None]:
async def inference(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Process inputs through the local Phi model.
    
    Args:
        inputs: Input values from an EvalDataset. For adaptive MMLU-Pro,
               each input is a dict with 'question' and 'options' keys.
        hyperparameters: Model hyperparameters.
    
    Returns:
        List of model outputs for all inputs.
    """
    outputs = []
    
    for input_val in inputs:
        # Handle dict input from adaptive dataset
        if isinstance(input_val, dict):
            prompt = input_val.get("question", "")
            if "options" in input_val:
                prompt += "\nOptions:\n" + "\n".join(
                    f"{letter}: {choice}"
                    for letter, choice in zip(string.ascii_uppercase, input_val["options"])
                )
        else:
            prompt = str(input_val)
        
        # Build prompt for Phi model
        system_message = "Answer the question with a single letter representing the correct answer from the list of choices. Do not provide any additional explanation or output beyond the single letter."
        
        # Phi-3 uses ChatML format
        messages = [
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt},
        ]
        
        # Apply chat template
        formatted_prompt = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        
        # Tokenize
        inputs_tokenized = tokenizer(
            formatted_prompt, return_tensors="pt", truncation=True, max_length=2048
        )
        inputs_tokenized = {k: v.to(device) for k, v in inputs_tokenized.items()}
        
        # Generate
        try:
            with torch.no_grad():
                output_ids = model.generate(
                    **inputs_tokenized,
                    max_new_tokens=10,  # We only need 1 letter
                    temperature=0.7,
                    do_sample=True,
                    pad_token_id=tokenizer.eos_token_id,
                )
            
            # Decode only the generated tokens
            generated_tokens = output_ids[0][inputs_tokenized["input_ids"].shape[1]:]
            output = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
            
            # Extract just the first letter if model outputs more
            if output and output[0].upper() in string.ascii_uppercase:
                output = output[0].upper()
        except Exception as e:
            output = f"Error: {str(e)}"
        
        outputs.append(output)
    
    return outputs

## Run Adaptive Evaluation

Use `evaluate_async()` with an adaptive dataset (indicated by the `:adaptive` suffix):

In [None]:
print(f"Running adaptive evaluation on Common Sense QA with model: {model_name.split('/')[-1]}")
print("Note: Adaptive evaluation selects questions dynamically based on responses.")

# Run adaptive evaluation
results = await evaluate_async(
    inference,
    datasets="trismik/CommonSenseQA:adaptive",  # Adaptive datasets have the ":adaptive" suffix
    experiment_id="Adaptive-Common-Sense-QA-Local-Notebook",
    project_id=TRISMIK_PROJECT_ID,
    return_dict=True,
    return_aggregates=True,
    return_items=True,
    return_output=True,
)

print("\n✓ Adaptive evaluation complete!")

## View Results

In [None]:
pprint(results)

## View Results on Dashboard

Your results have been uploaded to Trismik's dashboard for visualization and tracking:

## Understanding Adaptive Testing

### How it works:
1. **Initial Questions**: Start with medium-difficulty questions
2. **Adaptation**: If the model answers correctly, harder questions follow; if incorrect, easier questions are selected
3. **Convergence**: The test converges to the model's true ability level
4. **Stopping Criteria**: Stops when sufficient confidence is reached

### Benefits vs. Traditional Testing:
- **Efficiency**: Typically requires 50-70% fewer questions for the same precision
- **Precision**: Better estimates of model capability
- **Engagement**: Questions are appropriately challenging

## Next Steps

- Try different local models (Phi-3-small, Phi-3-medium, Llama-3, etc.)
- Compare local model performance with the OpenAI version
- Explore other adaptive datasets available on Trismik
- See the **Upload Results** notebook for non-adaptive result tracking