# Building an Evaluation Pipeline

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/weave/blob/master/docs/docs/guides/cookbooks/llamaindex_rag_ncert/notebooks/04_evaluation.ipynb)

To iterate on any AI application, we need a way to systematically evaluate its performace to check if it's improving or not. To do so, a common practice is to test it against the same set of examples when there is a change. In this recipe, we will build an evaluation pipeline to evaluate the responses of our AI assistant using [`weave.Evaluation`](https://wandb.github.io/weave/guides/core-types/evaluations) which is a flexible API that provides us with a first-class way to track evaluations.

## Install the dependencies

First, let us install all the libraries that we would need to build the application.

In [None]:
!pip install -qU rich
!pip install -U instructor
!pip install -qU wandb
!pip install -qU git+https://github.com/wandb/weave.git@feat/groq
!pip install -qU llama-index groq
!pip install -qU llama-index-embeddings-huggingface

In [None]:
from getpass import getpass
from typing import Dict, Optional, Tuple

import instructor
import wandb
from groq import Groq
from llama_index.core import ServiceContext, StorageContext, load_index_from_storage
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from openai import OpenAI
from pydantic import BaseModel

import weave

## Building the Assistant

In this recipe, we will demonstrate an evaluation strategy for the `EnglishStudentResponseAssistant`.

In [None]:
weave.init(project_name="groq-rag")

artifact = wandb.Api().artifact(
    "geekyrakshit/groq-rag/ncert-flamingoes-prose-embeddings:latest"
)
artifact_dir = artifact.download()

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)
storage_context = StorageContext.from_defaults(persist_dir=artifact_dir)
index = load_index_from_storage(storage_context, service_context=service_context)
retreival_engine = index.as_retriever(
    service_context=service_context,
    similarity_top_k=10,
)

In [None]:
GROQ_API_KEY = getpass("Enter your GROQ API key: ")
OPENAI_API_KEY = getpass("Enter your OpenAI API key: ")

In [None]:
class EnglishStudentResponseAssistant(weave.Model):
    model: str = "llama3-8b-8192"
    _groq_client: Optional[Groq] = None

    def __init__(self, model: Optional[str] = None):
        super().__init__()
        self.model = model if model is not None else self.model
        self._groq_client = Groq(api_key=GROQ_API_KEY)

    @weave.op()
    def get_prompt(
        self, question: str, context: str, word_limit_min: int, word_limit_max: int
    ) -> Tuple[str, str]:
        system_prompt = """
You are a student in a class and your teacher has asked you to answer the following question.
You have to write the answer in the given word limit."""
        user_prompt = f"""
We have provided context information below.

---
{context}
---

Answer the following question within {word_limit_min}-{word_limit_max} words:

---
{question}
---"""
        return system_prompt, user_prompt

    @weave.op()
    def predict(self, question: str, total_marks: int) -> str:
        response = retreival_engine.retrieve(question)
        context = response[0].node.text
        if total_marks < 3:
            word_limit_min = 5
            word_limit_max = 50
        elif total_marks < 5:
            word_limit_min = 50
            word_limit_max = 100
        else:
            word_limit_min = 100
            word_limit_max = 200
        system_prompt, user_prompt = self.get_prompt(
            question, context, word_limit_min, word_limit_max
        )
        chat_completion = self._groq_client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": user_prompt,
                },
            ],
            model=self.model,
        )
        return {
            "response": chat_completion.choices[0].message.content,
            "context": context,
        }

## Building an Evaluation Dataset

We built an evaluation dataset by scraping a question bank of solved question-answer pairs of the Flamigo textbook from [LearnCBSE](https://www.learncbse.in/chapter-wise-important-questions-class-12-english/). The dataset consists of 358 question-answer pairs corresponding to the 8 chapters from our knowledge base dataset. We log this dataset as a [`weave.Dataset`](https://wandb.github.io/weave/guides/core-types/datasets) which enables us to collect examples for evaluation and automatically track versions for accurate comparisons. The dataset consists of examples in the following format:

```json
{
  "question": "What was the mood in the classroom when M. Hamel gave his last French lesson? ",
  "answer": "When M.Hamel was giving his last French ; lesson, the mood in the classroom was solemn and sombre. When he announced that this was their last French lesson everyone present in the classroom suddenly developed patriotic feelings for their native language and genuinely regretted ignoring their mother tongue.",
  "marks": "3-4",
  "chapter_name": "The Last Lesson"
}
```

You can explore the evaluation dataset in the weave UI [here](https://wandb.ai/geekyrakshit/groq-rag/weave/objects/flamingos-prose-question-bank/versions/sfx9Qg4FYq4eOEBt5SjNdkhcjqXJaqRN4CGvlCQAFcU).

![](../images/weave_evaluation_dataset.gif)

## Evaluating with an LLM Judge

One approach to evaluate an LLM application is to use another LLM as a judge to evaluate aspects of it. In this recipe, we demonstrate a simple example of using an LLM judge as a `weave.Scorer` to try to measure the correctness of the AI assistant's response by prompting it to verify if the the response is relevant to the context and how well it holds up to the ground-truth answer from the dataset.

In [None]:
class JudgeResponse(BaseModel):
    marks: float
    explanation: str


class OpenaAIJudgeModel(weave.Scorer):
    model: str = "gpt-4"
    max_retries: int = 5
    _openai_client: Optional[instructor.Instructor] = None

    def __init__(self, model: Optional[str] = None):
        super().__init__()
        self.model = model if model is not None else self.model
        self._openai_client = instructor.from_openai(
            OpenAI(api_key=OPENAI_API_KEY),
            mode=instructor.Mode.TOOLS,
        )

    @weave.op()
    def compose_judgement(
        self,
        question: str,
        context: str,
        ground_truth_answer: str,
        assistant_answer: str,
        total_marks: int,
    ) -> JudgeResponse:
        system_prompt = """
You are an expert in teacher of English langugage and literature.
Given a question, a context, a ground truth answer and an answer from an AI assistant,
you have to judge the assistant's answer based on the following criteria and assign
a score between 0 and total marks:

1. how well the assistant answers the question with respect to the context.
2. how well the assistant's answer holds up in correctness and relevance to
    the ground truth answer (assuming the ground truth answer is perfect).

You have to extract the marks to be awarded to the assistant's answer and a detailed
explanation as to how the assistant's answer was judged."""
        user_prompt = f"""
We have asked the following question to an AI assistant for total marks of {total_marks}:

---
{question}
---

We have provided context information below.

---
{context}
---

Th AI assistant has responded with the following answer:

---
{assistant_answer}
---

An ideal answer to the question would be the following:

---
{ground_truth_answer}
---"""
        return self._openai_client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": user_prompt,
                },
            ],
            max_retries=self.max_retries,
            model=self.model,
            response_model=JudgeResponse,
        )

    @weave.op()
    def score(
        self,
        question: str,
        answer: str,
        marks: str,
        model_output: Dict[str, str],
    ) -> Dict[str, float]:
        if marks == "3-4":
            total_marks = 4
        elif marks == "5-6":
            total_marks = 6
        else:
            total_marks = 4
        judge_response = self.compose_judgement(
            question=question,
            context=model_output["context"],
            ground_truth_answer=answer,
            assistant_answer=model_output["response"],
            total_marks=total_marks,
        )
        if not hasattr(judge_response, "marks"):
            return {"marks": 0.0, "fractional_marks": 0.0, "percentage": 0.0}
        return {
            "marks": judge_response.marks,
            "fractional_marks": judge_response.marks / total_marks,
            "percentage": (judge_response.marks / total_marks) * 100,
        }

## Evaluating our LLM Application

Finally, let us put everything and evaluate our LLM assistant using [`weave.Evaluation`](https://wandb.github.io/weave/guides/core-types/evaluations).

In [None]:
assistant = EnglishStudentResponseAssistant()


@weave.op()
async def get_assistant_prediction(question: str, marks: str):
    if marks == "3-4":
        total_marks = 4
    elif marks == "5-6":
        total_marks = 6
    else:
        total_marks = 4
    return assistant.predict(question, total_marks)

In [None]:
dataset = weave.ref("flamingos-prose-question-bank:v1").get()
evaluation = weave.Evaluation(dataset=dataset, scorers=[OpenaAIJudgeModel()])
await evaluation.evaluate(get_assistant_prediction)

Using the [`weave.Evaluation`](https://wandb.github.io/weave/guides/core-types/evaluations) class, you can be sure you're comparing apples-to-apples by keeping track of all of the details that you're experimenting and evaluating with. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual ouputs and scores.

![](../images/weave_evaluation_dashboard.gif)