# Model Evaluation: Demo Selection Strategies

_Note_: for an introduction to model evaluation, see the [Quick Start](Unitxt_Quick_Start.ipynb) Cookbook.

In this example, we experiment with different demo selection strategies in a classification task to determine which is best for our needs. We can easily change the number of shots provided with each test prompt, and the strategy for selecting the shots provided.

We use the [Ledgar](https://www.unitxt.ai/en/latest/catalog/catalog.cards.ledgar.html) dataset from the Unitxt catalog as basis for this example.

## Load Dependencies

In [None]:
%pip install replicate
%pip install unitxt
%pip install openai
%pip install litellm
%pip install diskcache
%pip install tenacity
%pip install tabulate
%pip install scikit-learn
%pip install git+https://github.com/ibm-granite-community/utils

and

In [None]:
from unitxt.api import evaluate, load_dataset
from unitxt.inference import CrossProviderInferenceEngine
from unitxt.splitters import CloseTextSampler, FixedIndicesSampler, RandomSampler

from ibm_granite_community.notebook_utils import get_env_var

import pandas as pd

import nest_asyncio
nest_asyncio.apply()

## Configure demo selection strategies to evaluate

In [2]:
# RandomSampler - randomly sample a different set of examples for each test instance
# CloseTextSampler - select the lexically closest samples from the demo pool for each test instance
# FixedIndicesSampler - select the same fixed set of demo examples for all instances
samplers = [
    RandomSampler(),
    CloseTextSampler(field="text"),
    FixedIndicesSampler(indices=[0,1])
    ]

# Different number of shots to evaluate
number_of_demos = [0, 1, 3, 5]

## Instantiate the evaluation client

In [None]:
model = CrossProviderInferenceEngine(model="granite-3-8b-instruct", provider="replicate",credentials={'api_token': get_env_var('REPLICATE_API_TOKEN')})

## Iterate through the different strategies

In [None]:
df = pd.DataFrame(columns=["num_demos", "sampler", "f1_micro", "ci_low", "ci_high"])

for num_demos in number_of_demos:
    for demo_sampler in samplers:
        dataset = load_dataset(
            card="cards.ledgar",
            template="templates.classification.multi_class.title",
            format="formats.chat_api",
            num_demos=num_demos,
            demos_pool_size=50,
            loader_limit=200,
            max_test_instances=10,
            sampler=demo_sampler,
            split="test",
        )

        predictions = model(dataset)
        results = evaluate(predictions=predictions, data=dataset)

        global_scores = results.global_scores

        df.loc[len(df)] = [
            num_demos,
            demo_sampler.__type__,
            global_scores["score"],
            global_scores["score_ci_low"],
            global_scores["score_ci_high"],
        ]

## Print the results

In [None]:

df = df.round(decimals=2)
print(df.to_markdown())