Skip to content

Commit

Permalink
readme-update
Browse files Browse the repository at this point in the history
  • Loading branch information
yisz committed Feb 21, 2024
1 parent 9f8cc08 commit 7996861
Show file tree
Hide file tree
Showing 3 changed files with 147 additions and 80 deletions.
226 changes: 147 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,32 +18,29 @@
</div>

<h2 align="center">
<p>Open-Source Evaluation Framework for LLM Pipelines</p>
<p>Open-Source Module-level Evaluation for LLM Pipelines</p>
</h2>

## Overview

`continuous-eval` is an open-source package created for the scientific and practical evaluation of LLM application pipelines. Currently, it focuses on retrieval-augmented generation (RAG) pipelines.

## Why another eval package?

Good LLM evaluation should help developers reliably identify weaknesses in the pipeline, inform what actions to take, and accelerate development from prototype to production. Although it is optimal to put LLM Evaluation as part of our CI/CD pipeline just like any other part of software, it remains challenging today because:
<h1 align="center">
<img
src="docs/public/module-level-eval.png"
width="500"
>
</h1>
**Human evaluation is trustworthy but not scalable**
- Eyeballing can only be done on a small dataset, and it has to be repeated for any pipeline update
- User feedback is spotty and lacks granularity
## Overview

**Using LLMs to evaluate LLMs is expensive, slow and difficult to trust**
- Can be very costly and slow to run at scale
- Can be biased towards certain answers and often doesn’t align well with human evaluation
`continuous-eval` is an open-source package created for granular and holistic evaluation of LLM application pipelines.

## How is continuous-eval different?

- **Comprehensive RAG Metric Library**: mix and match Deterministic, Semantic and LLM-based metrics.
- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.

- **Trustworthy Ensemble Metrics**: easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.
- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Tool Use, Agent Tool, Classification and a variety of LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.

- **Cheaper and Faster Evaluation**: our hybrid pipeline slashes cost by up to 15x compared to pure LLM-based metrics, and reduces eval time on large datasets from hours to minutes.
- **Leverage User Feedback in Evaluation**: easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.

- **Synthetic Dataset Generation**: Generate large-scale synthetic dataset to test your pipeline.

## Installation

Expand All @@ -62,98 +59,169 @@ poetry install --all-extras

## Getting Started

### Prerequisites
### Prerequisites for LLM-based metrics

The code requires the `OPENAI_API_KEY` (optionally `ANTHROPIC_API_KEY` and/or `GEMINI_API_KEY` and/or `AZURE_OPENAI_API_KEY` with deployment details) in .env to run the LLM-based metrics. Take a look at the example env file `.env.example`.
To run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.

### Usage
### Run a single metric
Here's how you run a single metric on a datum.
Check all available metrics here: [link](https://docs.relari.ai/)

```python
from continuous_eval.metrics import PrecisionRecallF1, RougeChunkMatch
from continuous_eval.metrics.retrieval import PrecisionRecallF1

datum = {
"question": "What is the capital of France?",
"retrieved_contexts": [
"retrieved_context": [
"Paris is the capital of France and its largest city.",
"Lyon is a major city in France.",
],
"ground_truth_contexts": ["Paris is the capital of France."],
"ground_truth_context": ["Paris is the capital of France."],
"answer": "Paris",
"ground_truths": ["Paris"],
}

metric = PrecisionRecallF1(RougeChunkMatch())
print(metric.calculate(**datum))
```

To run over a dataset, you can use one of the evaluator classes:
metric = PrecisionRecallF1()

```python
from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.evaluators import RetrievalEvaluator
from continuous_eval.metrics import PrecisionRecallF1, RankedRetrievalMetrics

# Build a dataset: create a dataset from a list of dictionaries containing question/answer/context/etc.
# Or download one of the of the examples...
dataset = example_data_downloader("retrieval")
# Setup the evaluator
evaluator = RetrievalEvaluator(
dataset=dataset,
metrics=[
PrecisionRecallF1(),
RankedRetrievalMetrics(),
],
)
# Run the eval!
evaluator.run(k=2, batch_size=1)
# Peaking at the results
print(evaluator.aggregated_results)
# Saving the results for future use
evaluator.save("retrieval_evaluator_results.jsonl")
print(metric(**datum))
```

For generation you can instead use the `GenerationEvaluator`.

## Metrics
### Run modularized evaluation over a dataset

### Retrieval-based metrics
Define your pipeline and select the metrics for each.

#### Deterministic

- `PrecisionRecallF1`: Rank-agnostic metrics including Precision, Recall, and F1 of Retrieved Contexts
- `RankedRetrievalMetrics`: Rank-aware metrics including Mean Average Precision (MAP), Mean Reciprical Rank (MRR), NDCG (Normalized Discounted Cumulative Gain) of retrieved contexts

#### LLM-based

- `LLMBasedContextPrecision`: Precision and Mean Average Precision (MAP) based on context relevancy classified by LLM
- `LLMBasedContextCoverage`: Proportion of statements in ground truth answer that can be attributed to Retrieved Contexts calcualted by LLM

### Generation metrics
```python
from continuous_eval.eval import Module, Pipeline, Dataset
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness
from typing import List, Dict

dataset = Dataset("dataset_folder")
Documents = List[Dict[str, str]]
DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])

# Simple 3-step RAG pipeline with Retriever->Reranker->Generation

retriever = Module(
name="Retriever",
input=dataset.question,
output=List[Dict[str, str]],
eval={
PrecisionRecallF1().use(
retrieved_context=DocumentsContent,
ground_truth_context=dataset.ground_truth_contexts,
),
},
)

#### Deterministic
reranker = Module(
name="reranker",
input=retriever,
output=List[Dict[str, str]],
eval={
RankedRetrievalMetrics().use(
retrieved_context=DocumentsContent,
ground_truth_context=dataset.ground_truth_contexts,
),
},
)

- `DeterministicAnswerCorrectness`: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
- `DeterministicFaithfulness`: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision and BLEU score
- `FleschKincaidReadability`: how easy or difficult it is to understand the LLM generated answer.
llm = Module(
name="answer_generator",
input=reranker,
output=str,
eval=[
FleschKincaidReadability().use(answer=ModuleOutput()),
DeterministicAnswerCorrectness().use(
answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
),
],
)

#### Semantic
pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
```

- `DebertaAnswerScores`: Entailment and contradiction scores between the Generated Answer and Ground Truth Answer
- `BertAnswerRelevance`: Similarity score based on the BERT model between the Generated Answer and Question
- `BertAnswerSimilarity`: Similarity score based on the BERT model between the Generated Answer and Ground Truth Answer
Log your results in your pipeline (see full example)

#### LLM-based
```python

- `LLMBasedFaithfulness`: Binary classifications of whether the statements in the Generated Answer can be attributed to the Retrieved Contexts
- `LLMBasedAnswerCorrectness`: Overall correctness of the Generated Answer based on the Question and Ground Truth Answer(s)
- `LLMBasedAnswerRelevance`: Relevance of the Generated Answer w.r.t the Question
- `LLMBasedStyleConsistency`: Consistency of style bwtween the Generated Answer and the Ground Truth Answer(s)
from continuous_eval.eval.manager import eval_manager

...
# Run and log module outputs
response = ask(q, reranked_docs)
eval_manager.log("answer_generator", response)
...
# Set the pipeline
eval_manager.set_pipeline(pipeline)
...
# View the results
eval_manager.run_eval()
print(eval_manager.eval_graph())
```

## Plug and Play Metrics

<table border="1">
<tr>
<th>Module</th>
<th>Subcategory</th>
<th>Metrics</th>
</tr>
<tr>
<td rowspan="2">Retrieval</td>
<td>Deterministic</td>
<td>PrecisionRecallF1, RankedRetrievalMetrics</td>
</tr>
<tr>
<td>LLM-based</td>
<td>LLMBasedContextPrecision, LLMBasedContextCoverage</td>
</tr>
<tr>
<td rowspan="3">Text Generation</td>
<td>Deterministic</td>
<td>DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability</td>
</tr>
<tr>
<td>Semantic</td>
<td>DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity</td>
</tr>
<tr>
<td>LLM-based</td>
<td>LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency</td>
</tr>
<tr>
<td rowspan="1">Classification</td>
<td>Deterministic</td>
<td>ClassificationAccuracy</td>
</tr>
<tr>
<td rowspan="2">Code Generation</td>
<td>Deterministic</td>
<td>CodeStringMatch, PythonASTSimilarity</td>
</tr>
<tr>
<td>LLM-based</td>
<td>LLMBasedCodeGeneration</td>
</tr>
<tr>
<td>Agent Tools</td>
<td>Deterministic</td>
<td>ToolSelectionAccuracy</td>
</tr>
</table>

You can also define your own metrics (coming soon).

## Resources

- **Docs:** [link](https://docs.relari.ai/)
- **Blog Post: Practical Guide to RAG Pipeline Evaluation:** [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
- **Blog Posts:**
- Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893)
- Practical Guide to RAG Pipeline Evaluation: [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
- How important is a Golden Dataset for LLM evaluation?
[link](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)

- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)
- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https://cal.com/yizhang/continuous-eval)

Expand Down
Binary file added docs/public/module-level-eval.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion examples/single_metric.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
from continuous_eval import Dataset
from continuous_eval.metrics.retrieval import PrecisionRecallF1

# A dataset is just a list of dictionaries containing the relevant information
Expand Down

0 comments on commit 7996861

Please sign in to comment.