# Getting Started with Trismik's Adaptive Testing

This notebook demonstrates how to run Trismik's adaptive evaluations using Scorebook.

## What is Adaptive Testing?

Trismik’s adaptive testing service leverages item response theory (IRT), a psychometric framework, to evaluate large language models. Using computerized adaptive testing (CAT) it dynamically selects the most informative items, enabling faster, more cost-efficient model evaluations with fewer items required.

## Setup

### Generate a Trismik API Key

To run an adaptive evaluation, a Trismik API key is required. You can [sign up](https://dashboard.trismik.com/signup) for a free Trismik account and generate an API key.

**How to generate an API key from the Trismik dashboard**:
1. click on your initials in the top-right corner of the screen.
2. click on "API Keys" in the drop-down menu.
3. click "Create API Key" to create a new API key.

In [None]:
import scorebook

# Set your API key here and run this cell to login
TRISMIK_API_KEY = "add-your-trismik-api-key-here"

scorebook.login(TRISMIK_API_KEY)
print("✓ Logged in to Trismik")

### Create a Project

When running an adaptive evaluation, your evaluation results are stored under a project you have created and specified.

**How to create a project from the Trismik dashboard.**:
1. Click "New Project" on the landing page.
2. Create a name for the project.
3. Copy the project id, shown on the right.

In [None]:
# Set your project id here and run this cell to save it.
TRISMIK_PROJECT_ID = "add-your-project-id-here"

## Run an Adaptive Evaluation

For this quick-start guide, we will use a mock model, that replicates the responses generated by GPT-5 Mini.

In [None]:
from tutorials.utils import mock_llm

# Run adaptive evaluation
results = scorebook.evaluate(
    mock_llm,
    datasets="trismik/CommonSenseQA:adaptive",
    experiment_id="Getting-Started-Demo",
    project_id=TRISMIK_PROJECT_ID,
)

# Print the adaptive evaluation results
print("\n✓ Adaptive evaluation complete!")
print("\nResults :", results["aggregate_scores"]["trismik/CommonSenseQA:adaptive"]["score"])

### Adaptive Evaluation Results

The metrics generated by an adaptive evaluation are:

- Theta (θ): The primary score measuring model ability on the dataset, a higher value represents better performance.
- Standard Error: The theta score is a proxy for the underlying metric, and the standard error is the uncertainty in the theta estimate.

You can find more info [here](https://docs.trismik.com/adaptiveTesting/adaptive-testing-introduction/)!

## Next Steps

For a more detailed quick-start guide to understanding adaptive evaluations, see the [adaptive evaluation demo]().

To see how Scorebook can be used for _classical_ evaluations , see the [scorebook demo]().