# Llamastack Eval 🦙

We would like to be able to evaluate our models and applications before they go into production.  
To do that we can use the llamastack eval endpoint 🙌  
It allows us to run prompts and expected answers through different evaluations to make sure that the model answers as we expect.  
The prompts and expected answers can either be some you custom add, or it can be taken from an evaluation dataset such as this one from HuggingFace: https://huggingface.co/datasets/llamastack/simpleqa  

We will be testing two types of evaluations here:
- "subset_of" which tests if the LLM output is an exact subset of the expected answer 
- "llm_as_judge" which lets an LLM evaluate how similar the LLM output is to the expected answer

In here we will both test the raw model to see how performant it is, as well as the backend endpoints as well so that we also evaluate how effective our system prompts are. We will se even more on evaluating the raw model in a later chapter 😉

## Set-up
Let's start by installing the llamastack client and pointing it to our llamastack server

In [None]:
!pip install -q git+https://github.com/meta-llama/llama-stack.git@release-0.2.12 rich

In [None]:
from llama_stack_client import LlamaStackClient
from rich.pretty import Pretty

base_url = "http://llama-stack.user1-test.svc.cluster.local:80"
client = LlamaStackClient(
    base_url=base_url,
    timeout=600.0 # Default is 1 min which is far too little for some agentic tests, we set it to 10 min
)

# Llamastack Eval endpoint

Before we evaluate our backend endpoint, let's just try out llamastack eval and see how it works.  
Here we create some `handmade_eval_rows` with the input and the expected answer, but we also add the generated answer already filled out.  

## Subset Of

Let's start by setting the `scoring_params` to use the `subset_of` function mentioned earlier and see what it comes back with.

In [None]:
handmade_eval_rows = [
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "Tapas are my favorites.",
        "expected_answer": "Tapas",
    },
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "I really like tapas.",
        "expected_answer": "Tapas",
    }
]
Pretty(handmade_eval_rows)

scoring_params = {
    "basic::subset_of": None,
}
scoring_response = client.scoring.score(
    input_rows=handmade_eval_rows, scoring_functions=scoring_params
)
Pretty(scoring_response)

Hmm, we got half of the answers correct 🤔  
This is because we expect tapas to be spelled with a big T in front. As mentioned before, the subset_of function expects exact matches within the generated answer.

## LLM as judge🧑‍⚖️

Now let's try the same thing but with LLM as judge.   
Here we are going to feed our eval_rows into an LLM to evaluate how well the generated answer matches the expected answer.  
To make sure that the judge LLM does this, we also provide it a `JUDGE_PROMPT` it should follow, as well as a regex of expected scores (`judge_score_regexes`) from the judge.  
In our case we let it grade the generated answers from A to E (no F's in this class 🙂‍↔️), each with its own meaning that you can see in the judge prompt.  
We also choose Llama 3.2 to be our judge (as we all know llamas to be the best of judges). This means that when we later later evaluate replies from the backend, we will use the same LLM to generate our answer and judge them, essentially doing a self-judge strategy. This is not always the best, but works pretty well with Llama 3.2 and we don't have any other model to use right now.

In [None]:
handmade_eval_rows = [
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "Tapas are my favorites.",
        "expected_answer": "Tapas",
    },
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "I really like tapas.",
        "expected_answer": "Tapas",
    }
]
Pretty(handmade_eval_rows)

judge_model_id = "llama32"
JUDGE_PROMPT = """
Given a QUESTION and GENERATED_RESPONSE and EXPECTED_RESPONSE.

Compare the factual content of the GENERATED_RESPONSE with the EXPECTED_RESPONSE. Ignore any differences in style, grammar, or punctuation.
  The GENERATED_RESPONSE may either be a subset or superset of the EXPECTED_RESPONSE, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The GENERATED_RESPONSE is a subset of the EXPECTED_RESPONSE and is fully consistent with it.
  (B) The GENERATED_RESPONSE is a superset of the EXPECTED_RESPONSE and is fully consistent with it.
  (C) The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.
  (D) There is a disagreement between the GENERATED_RESPONSE and the EXPECTED_RESPONSE.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.

Give your answer in the format "Answer: One of ABCDE, Explanation: ".

Your actual task:

QUESTION: {input_query}
GENERATED_RESPONSE: {generated_answer}
EXPECTED_RESPONSE: {expected_answer}
"""

scoring_params = {
    "llm-as-judge::base": {
        "judge_model": judge_model_id,
        "prompt_template": JUDGE_PROMPT,
        "type": "llm_as_judge",
        "judge_score_regexes": ["Answer: (A|B|C|D|E)"],
    },
}

scoring_response = client.scoring.score(
    input_rows=handmade_eval_rows, scoring_functions=scoring_params
)
Pretty(scoring_response)

You should have gotten at least a C from the judge, and you can see the reasoning for it in the `judge_feedback` field.  
Feel free to try out some other inputs, generated answers, and expected answers 🧪

# Involve the LLM

So far we have just hardcoded the generated answers but these should be generated from an LLM, otherwise we are just evaluating our human selves.  
To do this, let's send some requests to our LLM through llamastack, and then also to our backend and see how that looks like.

In [None]:
model_id = "llama32"

model_eval_rows = [
    {
        "input_query": "What is your favorite Spanish food?",
        "expected_answer": "Tapas",
    },
    {
        "input_query": "What is your favorite Turkish food?",
        "expected_answer": "Baklava",
    }
]

Note how this time we don't have any generated answers yet, those will come from the LLM directly this time.

In [None]:
for eval_row in model_eval_rows:
    response = client.inference.chat_completion(
        model_id=model_id,
        messages=[
            {"role": "user", "content": eval_row["input_query"]}
        ]
    )
    eval_row["generated_answer"] = response.completion_message.content

Pretty(model_eval_rows)


And now we can just evaluate these answers just like we did before ✅

In [None]:
scoring_response = client.scoring.score(
    input_rows=model_eval_rows, scoring_functions=scoring_params
)
Pretty(scoring_response)

# Evaluate the Backend

We can also evaluate our backend instead of the model, all we need to do is send the `input queries` to the backend and put the inputs into `generated_answer`.  
Since our backend is prompted to summarize text, we add tests that works well for such tasks.

In [None]:
backend_url = "http://canopy-backend.user1-canopy.svc.cluster.local:8000"
endpoint_to_test = "/summarize"

backend_eval_rows = [
    {
        "input_query": "Llama 3.2 is a state-of-the-art language model that excels in various natural language processing tasks, including summarization, translation, and question answering.",
        "expected_answer": "Llama 3.2 is a top-tier language model for NLP tasks.",
    },
    {
        "input_query": "Artificial intelligence and machine learning have revolutionized numerous industries in recent years. \
From healthcare diagnostics that can detect diseases earlier than human doctors, to autonomous vehicles that promise safer transportation, \
to recommendation systems that personalize our digital experiences, AI technologies are becoming increasingly sophisticated. \
However, these advances also bring challenges including ethical concerns about bias in algorithms, job displacement due to automation, and the need for robust data privacy protections?",
        "expected_answer": "AI and ML have transformed industries through healthcare diagnostics, autonomous vehicles, and recommendation systems, but also raise concerns about bias, job displacement, and privacy.",
    },
]

In [None]:
def send_request(payload, url):
    import httpx
    import json
    full_response = ""

    with httpx.Client(timeout=None) as client:
        with client.stream("POST", url, json=payload) as response:
            for line in response.iter_lines():
                if line.startswith("data: "):
                    try:
                        data = json.loads(line[len("data: "):])
                        full_response += data.get("delta", "")
                    except json.JSONDecodeError:
                        continue

    return full_response

def prompt_backend(prompt, backend_url, endpoint_to_test):
    from urllib.parse import urljoin
    url = urljoin(backend_url, endpoint_to_test)
    payload = {
        "prompt": prompt
    }
    return send_request(payload, url)

In [None]:
for eval_row in backend_eval_rows:
    eval_row["generated_answer"] = prompt_backend(eval_row["input_query"], backend_url, endpoint_to_test)

Pretty(backend_eval_rows)

And again, as soon as we have what we want to evaluate in a json format, we can evaluate it with Llamastack.

In [None]:
scoring_response = client.scoring.score(
    input_rows=backend_eval_rows, scoring_functions=scoring_params
)
Pretty(scoring_response)

# Datasets 📖

Finally, let's use a dataset with already populated inputs and expected answers and see how well our model and backend does against those.

## SimpleQA

SimpleQA is a (as the name suggests) simple dataset with questions and answers that you can run on your model.  
You can find the full dataset, as well as a few others, here: https://huggingface.co/llamastack/datasets  
In our case, we don't want to wait to evaluate the full dataset, so we will take 5 examples from the training part of this dataset to test our model on.  
Similarily, we will just test on our model this time, partially because our `summarize` endpoint is not prompted for handling QA, and partially because we have already seen how we can evaluate our backend above.  

To fetch the dataset we use Llamastack again, where we can register the dataset which allows us to fetch data from it.

In [None]:
simpleqa_dataset_id = "huggingface::simpleqa"

_ = client.datasets.register(
    purpose="eval/messages-answer",
    source={
        "type": "uri",
        "uri": "huggingface://datasets/llamastack/simpleqa?split=train",
    },
    dataset_id=simpleqa_dataset_id,
)

dataset_eval_rows = client.datasets.iterrows(
    dataset_id=simpleqa_dataset_id,
    limit=5,
)

In [None]:
Pretty(dataset_eval_rows)

As you can see, the dataset is already formated with `input_query` and `expected_answer`, and we get some extra information such as `metadata` and `chat_completion_input`.  

We could now just input this evaluation set to our model just like we did before, but since we are only evaluating the model we will make use of another Llamastack functionality called `benchmarks`.  
This simply passes the dataset through the model and returns the response.

In [None]:
client.benchmarks.register(
    benchmark_id="meta-reference::simpleqa",
    dataset_id=simpleqa_dataset_id,
    scoring_functions=["llm-as-judge::base"],
)

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=dataset_eval_rows.data,
    scoring_functions=["llm-as-judge::base"],
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": model_id,
            "sampling_params": {
                "strategy": {
                    "type": "greedy",
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
        },
    },
)
Pretty(response)


To summarize, we have now used the Llamastack Eval endpoint to evaluate our raw LLM as well as our backend system, both with custom evaluations and with those fetched from a dataset.  
With this knowledge, we can now build an evaluation workflow that lets us test our backend and model before it goes into production 👏