# Evaluations of RAG with FMEval library PoC

## Overview 
The goal of this PoC is to demonstrate the user interface when leveraging FMEval library to evaluate your RAG. (use [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html) metrics as an example).

**Background**

FMEeval library archtecture:

<div>
<img src="images/fmeval_arch.png" width="600"/>
</div>

<div>
<img src="images/fmeval_arch_2.png" width="600"/>
</div>

1. Data components: FMEval library provides `DataLoaders` that support data in JSON and JSONLines format, and load them as Ray Datasets. Ray was chosen as the data computation framework of choice, as it provides better performance and python-native distributed computing which makes it easy to debug, and increases maintainability. 
2. Model components: `ModelRunner` encapsulates the logic for invoking LLMs, exposing a predict method that greatly simplifies interactions with LLMs within evaluation algorithm code. The interface can be extended by the user for their LLMs. 
3. Evaluation components: Implementation of popular metrics (eval algorithms) such as Accuracy, Toxicity, Semantic Robustness and Prompt Stereotyping for evaluating LLMs across different tasks. **- In this PoC we will introduce new eval algorithms such as Faithfulness for evaluating RAG.**
4. Reporting components: Implements reporting modules. 

### Setting up the environment

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
!pip install --upgrade --force-reinstall ../dist/fmeval-0.4.0-py3-none-any.whl

Processing /Users/xiayche/workplace3/fmeval/dist/fmeval-0.4.0-py3-none-any.whl
Collecting IPython (from fmeval==0.4.0)
  Using cached ipython-8.22.2-py3-none-any.whl.metadata (4.8 kB)
Collecting aiohttp<4.0.0,>=3.9.2 (from fmeval==0.4.0)
  Using cached aiohttp-3.9.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (7.4 kB)
Collecting bert-score<0.4.0,>=0.3.13 (from fmeval==0.4.0)
  Using cached bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting evaluate<0.5.0,>=0.4.0 (from fmeval==0.4.0)
  Using cached evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting grpcio<2.0.0,>=1.60.0 (from fmeval==0.4.0)
  Using cached grpcio-1.62.1-cp310-cp310-macosx_12_0_universal2.whl.metadata (4.0 kB)
Collecting ipykernel<7.0.0,>=6.26.0 (from fmeval==0.4.0)
  Using cached ipykernel-6.29.4-py3-none-any.whl.metadata (6.3 kB)
Collecting jiwer<4.0.0,>=3.0.3 (from fmeval==0.4.0)
  Using cached jiwer-3.0.3-py3-none-any.whl.metadata (2.6 kB)
Collecting markdown (from fmeval==0.4.0)
  Using cached Ma

### Eval Algorithm and Data Config Setup

In [1]:
from fmeval.eval_algorithms.qa_ragas import RagasFaithfulness
from fmeval.data_loaders.data_config import DataConfig

eval_algo = RagasFaithfulness()

data_config = DataConfig(
    dataset_name="fiqa_sample",
    dataset_uri="fiqa_sample.jsonl",
    dataset_mime_type="application/jsonlines",
    model_input_location="question",
    model_output_location="answer",
    contexts_location="contexts",
    target_output_location="ground_truths"
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/xiayche/Library/Application Support/sagemaker/config.yaml


### Optional: Bring your own LLM and Embedding Models

In [2]:
from fmeval.ragas.util import get_bedrock_model, get_bedrock_embedding

# use provided util function to get bedrock model/embeddings
model = get_bedrock_model()
embeddings = get_bedrock_embedding()

# use open ai gpt-4
# from langchain_openai.chat_models import ChatOpenAI
# os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# model = ChatOpenAI(model_name="gpt-4")

### Run Evaluation

In [3]:
eval_output = eval_algo.evaluate(
    dataset_config=data_config, 
    llm=model, # If not pass, will use default Bedrock Claude model
    embeddings=embeddings # If not pass, will use default Bedrock Titan-text Embedding model 
)

2024-03-27 23:50:42,895	INFO worker.py:1724 -- Started a local Ray instance.


Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

  return transform_pyarrow.concat(tables)
2024-03-27 23:50:45,463	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition]
2024-03-27 23:50:45,464	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-03-27 23:50:45,464	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Repartition 1:   0%|          | 0/45 [00:00<?, ?it/s]

Split Repartition 2:   0%|          | 0/45 [00:00<?, ?it/s]

Running 0:   0%|          | 0/45 [00:00<?, ?it/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Evaluating:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

[
    {
        "eval_name": "Ragas_faithfulness_poc",
        "dataset_name": "fiqa_sample",
        "dataset_scores": [
            {
                "name": "faithfulness",
                "value": 0.6666666666666666
            }
        ],
        "prompt_template": null,
        "category_scores": null,
        "output_path": "/tmp/eval_results/Ragas_faithfulness_poc_fiqa_sample.jsonl",
        "error": null
    }
]


In [6]:
result = eval_algo.evaluate_sample(
    question="When was the first super bowl?",
    answer="The first superbowl was held on Jan 15, 1967",
    contexts=['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'],
    llm=model,
    embeddings=embeddings
)

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

  value = np.nanmean(self.scores[cn])


In [7]:
result

[EvalScore(name='faithfulness', value=nan)]

### Appendix

#### Directly use ragas library

In [8]:
from ragas.metrics import (
    context_precision,
    faithfulness,
    context_recall,
)
from ragas.metrics.critique import harmfulness

# list of metrics we're going to use
metrics = [
    faithfulness,
    context_recall,
    context_precision,
    harmfulness,
]

In [10]:
import os

os.environ["OPENAI_API_KEY"] = "your-api-key"

In [11]:
# data
from datasets import load_dataset

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
amnesty_qa

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    eval: Dataset({
        features: ['question', 'ground_truth', 'answer', 'contexts'],
        num_rows: 20
    })
})

In [13]:
from ragas import evaluate as ragas_evaluate

result = ragas_evaluate(
    amnesty_qa["eval"].select(range(1)),
    metrics=metrics,
)

result

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

  value = np.nanmean(self.scores[cn])


{'faithfulness': nan, 'context_recall': 1.0000, 'context_precision': 1.0000, 'harmfulness': 0.0000}