# Getting Started with Trismik's Adaptive Testing

This notebook demonstrates how to run Trismik's adaptive evaluations using Scorebook.

## What is Adaptive Testing?

Trismik’s adaptive testing service leverages item response theory (IRT), a psychometric framework, to evaluate large language models. Using computerized adaptive testing (CAT) it dynamically selects the most informative items, enabling faster, more cost-efficient model evaluations with fewer items required.

## Setup

### Generate a Trismik API Key

To run an adaptive evaluation, a Trismik API key is required. You can [sign up](https://dashboard.trismik.com/signup) for a free Trismik account and generate an API key.

**How to generate an API key from the Trismik dashboard**:
1. click on your initials in the top-right corner of the screen.
2. click on "API Keys" in the drop-down menu.
3. click "Create API Key" to create a new API key.

In [1]:
import scorebook

# Set your API key here and run this cell to login
TRISMIK_API_KEY = "your-trismik-api-key"

scorebook.login(TRISMIK_API_KEY)
print("✓ Logged in to Trismik")

✓ Logged in to Trismik


### Create a Project

When running an adaptive evaluation, your evaluation results are stored under a project on the Trismik dashboard.

In [3]:
from scorebook import create_project

# Create a project
project = create_project(
    name = "Getting Started",
    description = "A project created as part of Trismik's quick-start guides."
)

print("✓ Project created")
print(f"Project ID: {project.id}")

✓ Project created
Project ID: 203ec21be1554b9f71d1b465feedffa2883da4f7


## Run an Adaptive Evaluation

For this quick-start guide, we will use a mock model, that replicates the responses generated by an LLM.

In [4]:
from tutorials.utils.mock_llm import mock_llm

# Run adaptive evaluation
results = scorebook.evaluate(
    inference = mock_llm,
    datasets = "trismik/MMLUPro:adaptive",
    experiment_id = "Getting-Started-Demo",
    project_id = project.id,
)

# Print the adaptive evaluation results
print("✓ Adaptive evaluation complete!")
print("Results: ", results[0]["score"])

⠋ Evaluating Model | 1 Dataset | 1 Hyperparam Configuration | 0/1 Runs   0%|          |

Evaluating Model Completed, 1 Runs Completed Successfully
✓ Adaptive evaluation complete!
Results:  {'theta': 0.5520873187712936, 'std_error': 0.19984025684019777}


### Adaptive Evaluation Results

The metrics generated by an adaptive evaluation are:

- Theta (θ): The primary score measuring model ability on the dataset, a higher value represents better performance.
- Standard Error: The theta score is a proxy for the underlying metric, and the standard error is the uncertainty in the theta estimate.

You can find more information about adaptive testing [here](https://docs.trismik.com/adaptiveTesting/adaptive-testing-introduction/)!

## Next Steps

**More Quick-Start Guides**:

- [Adaptive Evaluation demos](): For a more detailed quick-start guide to understanding adaptive evaluations.
- [Classical Evaluation demo](): To see how Scorebook can be used for _classical_ evaluations.

**More details on Adaptive Testing and Scorebook**:

- [dataset page](https://dashboard.trismik.com/datasets): Trismik's full set of currently adaptive datasets from the Trismik dashboard.
- [Scorebook docs](https://docs.trismik.com/scorebook/introduction-to-scorebook/): Scorebook's full documentation.
- [Scorebook repository](https://github.com/trismik/scorebook): Scorebook is an open-source library, view the code and more examples.