# Model Evaluation: Granite as Judge

_Note_: for an introduction to model evaluation, see the [Quick Start](Unitxt_Quick_Start.ipynb) Cookbook.

In this example, we use Granite as an evaluator of predictions created by another model.

## Load Dependencies

In [None]:
%pip install replicate
%pip install unitxt
%pip install openai
%pip install litellm
%pip install diskcache
%pip install tenacity
%pip install tabulate
%pip install git+https://github.com/ibm-granite-community/utils

and

In [None]:
from unitxt.api import evaluate, create_dataset
from unitxt.inference import CrossProviderInferenceEngine
from unitxt.llm_as_judge import LLMJudgeDirect, EvaluatorNameEnum, DirectCriteriaCatalogEnum

from ibm_granite_community.notebook_utils import get_env_var


import nest_asyncio
nest_asyncio.apply()

## Set up the sample data and predictions

In [2]:
data = [
    {"question": "Who is Harry Potter?"},
    {"question": "How can I protect myself from the wind while walking outside?"},
    {"question": "What is a good low cost of living city in the US?"},
]

predictions = [
    """Harry Potter is a young wizard who becomes famous for surviving an attack by the dark wizard Voldemort, and later embarks on a journey to defeat him and uncover the truth about his past.""",
    """You can protect yourself from the wind by wearing windproof clothing, layering up, and using accessories like hats, scarves, and gloves to cover exposed skin.""",
    """A good low-cost-of-living city in the U.S. is San Francisco, California, known for its affordable housing and budget-friendly lifestyle.""",
]

## Define the judge metric

We would like to evaluate how relevant the answer is to the question asked.

In [None]:
metric = LLMJudgeDirect(    
    evaluator_name=EvaluatorNameEnum.GRANITE3_1_8B.name,
    inference_engine=CrossProviderInferenceEngine(model="granite-3-8b-instruct", provider="replicate",credentials={'api_token': get_env_var('REPLICATE_API_TOKEN')}),
    criteria=DirectCriteriaCatalogEnum.ANSWER_RELEVANCE.value,
    context_fields=["question"],
    criteria_field="criteria",
)

## Create the dataset

In [None]:
dataset = create_dataset(
    task="tasks.qa.open", 
    test_set=data, 
    metrics=[metric], 
    split="test"
)

## Perform the evaluation

In [None]:
results = evaluate(predictions=predictions, data=dataset)

## Print the results

In [None]:

print("Global Scores:")
print(results.global_scores.summary)

print("Instance Scores:")
print(results.instance_scores)