# Chapter 2:

CHANGELOG (commented out code):
- using eval dataset as weave.Dataset

**Comprehensive Evaluation Strategies**

In this chapter, we will evaluate the two main components of a RAG pipeline - retriever and response generator.

Evaluating the retriever can be considered component evaluation. Depending on your RAG pipeline, there can be a few components and for ensuring robustness of your system, it is recommended to come up with evaluation for each components. 


In [1]:
%load_ext autoreload
%autoreload 2


import json
import pathlib

import nest_asyncio
import pandas as pd

import wandb

nest_asyncio.apply()
import asyncio

import weave
from dotenv import load_dotenv

from scripts.utils import display_source

load_dotenv()

True

In the last chapter we used a few features from W&B and initialized (`wandb.init`)a W&B run.

In this chapter we will also use W&B Weave for our evaluation purposes. The `weave.Evaluation` class is a light weight class that can be used to evaluate the performance of a `weave.Model` on a `weave.Dataset`. We will go into more details.

We will initialize a weave client which will track traces and evaluation scores.

In [2]:
WANDB_ENTITY = "rag-course"
WANDB_PROJECT = "dev"

wandb.require("core")

run = wandb.init(
    entity=WANDB_ENTITY,
    project=WANDB_PROJECT,
    group="Chapter 2",
)

weave_client = weave.init(f"{WANDB_ENTITY}/{WANDB_PROJECT}")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mayut[0m ([33mrag-course[0m). Use [1m`wandb login --relogin`[0m to force relogin


weave version 0.50.10 is available!  To upgrade, please run:
 $ pip install weave --upgrade
Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/rag-course/dev/weave


## Building and improving an evaluation dataset

### Collecting data for evaluation

We used our [FAQs](https://docs.wandb.ai/guides/technical-faq) section from the docs website to build our evaluation set. 

The evaluation samples are logged as [`weave.Dataset`](https://wandb.github.io/weave/guides/core-types/datasets/). `weave.Dataset` enable you to collect examples for evaluation and automatically track versions for accurate comparisons. 

Below we will download the latest version locally with a simple API.

In [11]:
eval_dataset = weave.ref(
    "weave:///rag-course/dev/object/Dataset:9O0EmmPINmYjgbXW3kucVrDxlTUQJQs0fVZYJj2mtOk"
).get()

Iterating through each sample is easy.

We have the question, ground truth answer and ground truth contexts.

In [12]:
dict(eval_dataset.rows[0])

{'question': 'How can I access the run object from the Lightning WandBLogger function?',
 'answer': "In PyTorch Lightning, the `WandbLogger` is used to log metrics, model weights, and other data to Weights & Biases during training. To access the `wandb.Run` object from within a `LightningModule` when using `WandbLogger`, you can use the `Trainer.logger.experiment` attribute. This attribute provides direct access to the underlying `wandb.Run` object, allowing you to interact with the Weights & Biases API directly.\n\nHere's how you can access the `wandb.Run` object using `WandbLogger` in PyTorch Lightning:\n\n```python\nfrom pytorch_lightning import Trainer, LightningModule\nfrom pytorch_lightning.loggers import WandbLogger\n\nclass MyModel(LightningModule):\n    def training_step(self, batch, batch_idx):\n        # Your training logic here\n        loss = ...\n\n        # Log metrics\n        self.log('train_loss', loss)\n\n        # Access the wandb.Run object\n        run = self.trai

In [4]:
# # TODO: Remove this once we more to the final project

# from datetime import datetime

# run = wandb.init(
#     entity="rag-course",
#     project="dev",
#     group="Chapter 2",
# )
# eval_artifact = wandb.Artifact(
#     name="eval_dataset",
#     type="dataset",
#     description="Evaluation dataset for RAG",
#     metadata={
#         "total_samples": {"easy_eval": 20, "hard_eval": 50, "test": 100},
#         "date_collected": datetime.now().strftime("%Y-%m-%d"),
#         "chapter": "Chapter 2",
#     },
# )
# eval_artifact.add_dir("../data/eval")
# run.log_artifact(eval_artifact)
# run.finish()

In [5]:
# TODO: to be removed

# eval_artifact = run.use_artifact(
#     f"{WANDB_ENTITY}/{WANDB_PROJECT}/eval_dataset:latest", type="dataset"
# )
# eval_dir = eval_artifact.download("../data/eval")
# eval_dataset = pd.read_json(
#     f"{eval_dir}/eval_dataset.jsonl", lines=True, orient="records"
# )
# eval_samples = eval_dataset.to_dict(orient="records")
# eval_dataset

### Evaluating the Retriever

This is a search problem, it's easiest to start with tradiaional Information retrieval metrics.


ref: https://weaviate.io/blog/retrieval-evaluation-metrics

**TODO** Add weave model and evals in this section

In [5]:
# Reload the data from Chapter 1
chunked_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/chunked_data:latest", type="dataset"
)
artifact_dir = chunked_artifact.download()
chunked_data_file = pathlib.Path(f"{artifact_dir}/documents.jsonl")
chunked_data = list(map(json.loads, chunked_data_file.read_text().splitlines()))
chunked_data[:2]

2024/07/19 15:13:53 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/rag-course/dev/j8uh2i2o/artifact/961260984/wandb_manifest.json?Expires=1721385833&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=nzklkvYCGavFurUUMYIS99h05OXWtrThOuGzfB5BhK6NXYiodXlL5cL5OJQGtZLfRpS48kn3dWn6lRU1gaQ5SjqqOILLQCnr47rEqDRnW4qKozqDxMUK2gD2M9WyF99IWUTuoz4c389YOLS1UxlcPyQMec6wsvc1ltU0cQ5ccjcfctOQgA4xeRAZvE6byjv8i1aiw6GT7kSkWx%2BSNHjLhk0wuh79uOD7LuiTdJkaC%2FX2w2UAfabT5e%2Fc3eV8YEkQctbMw7d%2B8SkileEt2vBA2luDmKdaZuzxa008E0iGH1fl5eDHcFppE7vrpLx8beBI6a6%2FsGfI3bGHRfOIsmWKjw%3D%3D


[{'cleaned_content': 'Anonymous Mode Are you publishing code that you want anyone to be able to run easily? Use Anonymous Mode to let someone run your code, see a W&B dashboard, and visualize results without needing to create a W&B account first. Allow results to be logged in Anonymous Mode with wandb.init(anonymous="allow") :::info Publishing a paper? Please cite W&B, and if you have questions about how to make your code accessible while using W&B, reach out to us at support@wandb.com.\n::: How does someone without an account see results? If someone runs your script and you have to set anonymous="allow":  Auto-create temporary account: W&B checks for an account that\'s already signed in. If there\'s no account, we automatically create a new anonymous account and save that API key for the session. Log results quickly: The user can run and re-run the script, and automatically see results show up in the W&B dashboard UI.\nThese unclaimed anonymous runs will be available for 7 days. Claim

We will import the `TFIDFRetriever` which we created in the last chapter and index the chunked data.

In [9]:
from scripts.retriever import TFIDFRetriever

display_source(TFIDFRetriever)

retriever = TFIDFRetriever()
retriever.index_data(chunked_data)

In [10]:
from scripts.retrieval_metrics import IR_METRICS, LLM_METRICS as RETRIEVAL_LLM_METRICS
from scripts.utils import display_source

for scorer in IR_METRICS:
    display_source(scorer)

#### Evaluating retrieval on other metrics

In [13]:
retrieval_evaluation = weave.Evaluation(
    name="Retrieval_Evaluation",
    dataset=eval_dataset,
    scorers=IR_METRICS,
    preprocess_model_input=lambda x: {"query": x["question"], "k":5}
)
retrieval_scores = asyncio.run(retrieval_evaluation.evaluate(retriever))

🍩 https://wandb.ai/rag-course/dev/r/call/abeb2388-df96-47ae-b11b-80bb8748c52d


### Using an LLM as a Retrieval Judge

**ref: https://arxiv.org/pdf/2406.06519**

How do we evaluate if we don't have any ground truth? 

We can use a powerful LLM as a judge to evaluate the retriever. 


In [14]:
for metric in RETRIEVAL_LLM_METRICS:
    display_source(metric)

In [15]:
retrieval_evaluation = weave.Evaluation(
    name="LLM_Judge_Retrieval_Evaluation",
    dataset=eval_dataset,
    scorers=RETRIEVAL_LLM_METRICS,
    preprocess_model_input=lambda x: {"query": x["question"], "k":5}
)
retrieval_scores = asyncio.run(retrieval_evaluation.evaluate(retriever))

🍩 https://wandb.ai/rag-course/dev/r/call/51239910-956f-4e7c-840a-8debb0a03f8f


## Evaluating the Response

In [16]:
from scripts.rag_pipeline import SimpleRAGPipeline
from scripts.response_generator import SimpleResponseGenerator

INITIAL_PROMPT = open("prompts/initial_system.txt", "r").read()
response_generator = SimpleResponseGenerator(model="command-r", prompt=INITIAL_PROMPT)
rag_pipeline = SimpleRAGPipeline(retriever=retriever, response_generator=response_generator, top_k=5)

In [17]:
from scripts.response_metrics import NLP_METRICS, LLM_METRICS as RESPONSE_LLM_METRICS
for scorer in NLP_METRICS:
    display_source(scorer)

In [19]:
response_evaluations = weave.Evaluation(
    name="Response_Evaluation",
    dataset=eval_dataset,
    scorers=NLP_METRICS[:-1],
    preprocess_model_input=lambda x: {"query": x["question"]})
response_scores = asyncio.run(response_evaluations.evaluate(rag_pipeline))

🍩 https://wandb.ai/rag-course/dev/r/call/0a699274-7490-4956-a0ec-0ee7c468ac94


### Using an LLM as a Response Judge

Some metrics cannot be defined objectively and are particularly useful for more subjective or complex criteria.
We care about correctness, faithfulness, and relevance.

- **Answer Correctness** - Is the generated answer correct compared to the reference and thoroughly answers the user's query?
- **Answer Relevancy** - Is the generated answer relevant and comprehensive?
- **Answer Factfulness** - Is the generated answer factually consistent with the context document?


In [20]:
for metric in RESPONSE_LLM_METRICS:
    display_source(metric)

In [21]:
correctness_evaluations = weave.Evaluation(
    name="Correctness_Evaluation",
    dataset=eval_dataset,
    scorers=RESPONSE_LLM_METRICS, 
    preprocess_model_input=lambda x: {"query": x["question"]})
response_scores = asyncio.run(correctness_evaluations.evaluate(rag_pipeline))

🍩 https://wandb.ai/rag-course/dev/r/call/7f34a54b-af15-468f-bd0a-5330e61e749c


## Exercise

1. Implement the `Relevance` and `Faithfulness` evaluators and evaluate the pipeline on all the dimensions.
2. Generate and share a W&B report with the following sections in the form of tables and charts:
    
    - Summary of the evaluation
    - Retreival Evaluations
        - IR Metrics
        - LLM As a Retrieval Judge Metric
    - Response Evalations
        - Traditional NLP Metrics
        - LLM Judgement Metrics
    - Overall Evalations
    - Conclusion
