# Trismik Adaptive Testing Showcase

This notebook demonstrates how to use the Trismik SDK for adaptive testing of language models. Trismik is a platform offering adversarial testing for LLMs that allows you to evaluate models up to 95% faster than traditional evaluation techniques. Our adaptive testing algorithm estimates model precision by examining only a small portion of a dataset, providing efficient evaluation across multiple dimensions like reasoning, toxicity, and tool use.

## Setup

First, let's set up our API keys. You'll need both a Trismik API key and an OpenAI API key for the full examples.

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Uncomment and set these if you want to override the .env file
# os.environ['TRISMIK_API_KEY'] = 'your-trismik-api-key-here'
# os.environ['OPENAI_API_KEY'] = 'your-openai-api-key-here'

# Configure your project and experiment identifiers
# These are required for all Trismik runs and help organize your results
PROJECT_ID = "YOUR_PROJECT_ID" # Replace with your project ID
EXPERIMENT = "default"         # Replace with your experiment name

Now let's import the necessary modules and check our available tests:

In [None]:
import asyncio
from typing import Any

from trismik.adaptive_test import AdaptiveTest
from trismik.types import (
    AdaptiveTestScore,
    TrismikItem,
    TrismikMultipleChoiceTextItem,
    TrismikRunMetadata,
)

# Initialize a test runner to list available datasets
runner = AdaptiveTest(lambda x: x)  # Dummy processor for listing datasets

# List available datasets
available_datasets = runner.list_datasets()
print("Available datasets:")
for dataset in available_datasets:
    print(f"- {dataset.id}")

dataset = available_datasets[0].id

## Section 1: Mock Inference Example

Let's start with a simple mock inference function that demonstrates how the adaptive testing framework works. This example shows the basic structure of how to process test items and run adaptive tests.

In [None]:
def mock_inference(item: TrismikItem) -> Any:
    """
    Process a test item and return a response.
    
    This is where you would call your model to run inference and return its
    response. For demonstration purposes, we simply pick the first choice.
    
    Args:
        item (TrismikItem): Test item to process.
        
    Returns:
        Any: Response to the test item (depends on item type).
    """
    if isinstance(item, TrismikMultipleChoiceTextItem):
        # For multiple choice items, we need to return a choice id
        # Here we just pick the first choice as a mock response
        # In a real scenario, you would:
        # 1. Use item.question as input to your model
        # 2. Present item.choices as options
        # 3. Post-process the model output to ensure it's a valid choice id
        return item.choices[0].id
    else:
        raise RuntimeError("Encountered unknown item type")

def print_score(score: AdaptiveTestScore) -> None:
    """Print adaptive test score with theta and standard error."""
    print("\nAdaptive Test Score:")
    print(f"Final theta: {score.theta:.4f}")
    print(f"Final standard error: {score.std_error:.4f}")

In [None]:
# Set up run metadata for our mock test
mock_metadata = TrismikRunMetadata(
    model_metadata=TrismikRunMetadata.ModelMetadata(
        name="mock-model",
        parameters="N/A",
        provider="Demo",
    ),
    test_configuration={
        "task_name": dataset,
        "response_format": "Multiple-choice",
        "description": "Mock inference demonstration",
    },
    inference_setup={
        "strategy": "first_choice_selection",
    },
)

# Run the mock inference run
print("Running mock inference run...")
mock_runner = AdaptiveTest(mock_inference)
mock_results = mock_runner.run(
    dataset,
    PROJECT_ID,
    EXPERIMENT,
    run_metadata=mock_metadata,
    return_dict=False,
)

print(f"Run {mock_results.run_id} completed.")
if mock_results.score is not None:
    print_score(mock_results.score)
else:
    print("No score available.")

## Section 2: Local Model Inference with Transformers

This section demonstrates how to run adaptive testing with a local Hugging Face transformers model. We'll use a lightweight model that can run efficiently on most hardware.

In [None]:
import re
import transformers

# Set up the model pipeline
# Using Phi-4-mini-instruct as it's relatively lightweight
pipeline = transformers.pipeline(
    "text-generation",
    model="microsoft/Phi-4-mini-instruct",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

generation_args = {
    "max_new_tokens": 1024,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

In [None]:
def transformers_inference(
    pipeline: transformers.pipeline, item: TrismikItem, max_retries: int = 5
) -> str:
    """
    Run inference on an item using a Hugging Face model.
    
    Args:
        pipeline (transformers.pipeline): Hugging Face pipeline.
        item (TrismikItem): Item to run inference on.
        max_retries (int): Maximum number of retries.
    """
    assert isinstance(item, TrismikMultipleChoiceTextItem)
    
    # Construct the prompt from the question and choices
    prompt = f"{item.question}\nOptions:\n" + "\n".join(
        [f"- {choice.id}: {choice.text}" for choice in item.choices]
    )
    
    # System message with strict instructions
    messages = [
        {
            "role": "system",
            "content": """
Answer the question you are given using only a single letter \
(for example, 'A'). \
Do not use punctuation. \
Do not show your reasoning. \
Do not provide any explanation. \
Follow the instructions exactly and \
always answer using a single uppercase letter.

For example, if the question is "What is the capital of France?" and the \
choices are "A. Paris", "B. London", "C. Rome", "D. Madrid",
- the answer should be "A"
- the answer should NOT be "Paris" or "A. Paris" or "A: Paris"

Please adhere strictly to the instructions.
""".strip(),
        },
        {"role": "user", "content": prompt},
    ]
    
    final_answer = None
    tries = 0
    valid_ids = [choice.id for choice in item.choices]
    
    while final_answer is None and tries < max_retries:
        outputs = pipeline(messages, **generation_args)
        answer = outputs[0]["generated_text"].strip()
        
        # Post-process to extract just the letter if needed
        if len(answer) != 1:
            match = re.match(r"^([A-Z]): .+", answer)
            if match:
                answer = match.group(1)
        
        if answer in valid_ids:
            final_answer = answer
        else:
            tries += 1
    
    if final_answer is None:
        raise RuntimeError(
            f"Failed to run inference on question {item.question}, "
            f"{item.choices}; the last model response was {answer}."
        )
    
    return final_answer

In [None]:
# Set up metadata for the transformers model
transformers_metadata = TrismikRunMetadata(
    model_metadata=TrismikRunMetadata.ModelMetadata(
        name="microsoft/Phi-4-mini-instruct",
        parameters="14B",
        provider="Microsoft",
    ),
    test_configuration={
        "task_name": dataset,
        "response_format": "Multiple-choice",
    },
    inference_setup={
        "max_tokens": 1024,
        "temperature": 0.0,
    },
)

# Run with the transformers model
print("Running transformers model...")
transformers_runner = AdaptiveTest(lambda item: transformers_inference(pipeline, item))
transformers_results = transformers_runner.run(
    dataset,
    PROJECT_ID,
    EXPERIMENT,
    run_metadata=transformers_metadata,
    return_dict=False,
)

print(f"Run {transformers_results.run_id} completed.")
if transformers_results.score is not None:
    print_score(transformers_results.score)
else:
    print("No score available.")

## Section 3: API-based Inference with OpenAI (Async)

This section demonstrates how to use the Trismik SDK with OpenAI's API using asynchronous methods for improved performance.

In [None]:
from openai import AsyncOpenAI

# Initialize the async OpenAI client
openai_client = AsyncOpenAI()
model_name = "gpt-4o-mini"

async def openai_inference_async(
    client: AsyncOpenAI, item: TrismikItem, max_retries: int = 5
) -> str:
    """
    Run inference on an item using the OpenAI API asynchronously.
    
    Args:
        client (AsyncOpenAI): Async OpenAI client.
        item (TrismikItem): Item to run inference on.
        max_retries (int): Maximum number of retries.
    """
    assert isinstance(item, TrismikMultipleChoiceTextItem)
    
    # Construct the prompt from the question and choices
    prompt = f"{item.question}\nOptions:\n" + "\n".join(
        [f"- {choice.id}: {choice.text}" for choice in item.choices]
    )
    
    # System message with strict instructions
    messages = [
        {
            "role": "developer",
            "content": """
Answer the question you are given using only a single letter \
(for example, 'A'). \
Do not use punctuation. \
Do not show your reasoning. \
Do not provide any explanation. \
Follow the instructions exactly and \
always answer using a single uppercase letter.

For example, if the question is "What is the capital of France?" and the \
choices are "A. Paris", "B. London", "C. Rome", "D. Madrid",
- the answer should be "A"
- the answer should NOT be "Paris" or "A. Paris" or "A: Paris"

Please adhere strictly to the instructions.
""".strip(),
        },
        {"role": "user", "content": prompt},
    ]
    
    final_answer = None
    tries = 0
    valid_ids = [choice.id for choice in item.choices]
    
    while final_answer is None and tries < max_retries:
        response = await client.chat.completions.create(
            model=model_name,
            messages=messages,
            max_tokens=10,
            temperature=0.0,
        )
        answer = response.choices[0].message.content.strip()
        
        if answer in valid_ids:
            final_answer = answer
        else:
            tries += 1
    
    if final_answer is None:
        raise RuntimeError(
            f"Failed to run inference on question {item.question}, "
            f"{item.choices}; the last model response was {answer}."
        )
    
    return final_answer

In [None]:
# Set up metadata for the OpenAI model
openai_metadata = TrismikRunMetadata(
    model_metadata=TrismikRunMetadata.ModelMetadata(
        name=model_name,
        provider="OpenAI",
    ),
    test_configuration={
        "task_name": dataset,
        "response_format": "Multiple-choice",
    },
    inference_setup={
        "max_tokens": 10,
        "temperature": 0.0,
    },
)

# Run the async test with OpenAI
async def run_openai_test():
    print("Running OpenAI async test...")
    
    # Create an async inference function that captures the client
    async def inference_wrapper(item: TrismikItem) -> str:
        return await openai_inference_async(openai_client, item)
    
    openai_runner = AdaptiveTest(inference_wrapper)
    openai_results = await openai_runner.run_async(
        dataset,
        PROJECT_ID,
        EXPERIMENT,
        run_metadata=openai_metadata,
        return_dict=False,
    )
    
    print(f"Run {openai_results.run_id} completed.")
    if openai_results.score is not None:
        print_score(openai_results.score)
    else:
        print("No score available.")
    
    return openai_results

# Run the async test
openai_results = await run_openai_test()

## Summary

This notebook has demonstrated three different approaches to using the Trismik adaptive testing framework:

1. **Mock Inference**: A simple demonstration showing the basic structure and API usage
2. **Local Model Inference**: Using Hugging Face transformers for local model evaluation
3. **API-based Inference**: Using OpenAI's API with async methods for cloud-based evaluation

Each approach shows how to:
- Set up run metadata to track your model and test configuration
- Implement inference functions that work with Trismik's test items
- Run adaptive tests and interpret the results

The key metrics to focus on are:
- **Theta (θ)**: The primary score measuring model ability on the dataset
- **Standard Error**: The uncertainty in the theta estimate (lower is better)

Adaptive testing allows you to efficiently evaluate your models with significantly fewer test items while maintaining statistical rigor, making it ideal for iterative model development and evaluation workflows.