Skip to content

Commit

Permalink
docs update
Browse files Browse the repository at this point in the history
  • Loading branch information
yisz committed Feb 22, 2024
1 parent fc989bd commit e4edb7b
Show file tree
Hide file tree
Showing 21 changed files with 138 additions and 43 deletions.
2 changes: 1 addition & 1 deletion docs/src/content/docs/examples/0_single_metric.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@ dataset = Dataset([q])
metric = PrecisionRecallF1()

# Let's calculate the metric for the first datum
print(metric.calculate(**dataset.datum(0))) # alternatively `metric.calculate(**q)`
print(metric(**dataset.datum(0))) # alternatively `metric.calculate(**q)`
```
31 changes: 13 additions & 18 deletions docs/src/content/docs/getting-started/Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,34 +9,29 @@ sidebar:

## What is continuous-eval?

`continuous-eval` is an open-source package created for the scientific and practical evaluation of LLM application pipelines. Currently, it focuses on retrieval-augmented generation (RAG) pipelines.
`continuous-eval` is an open-source package created for granular and holistic evaluation of GenAI application pipelines.

## Why another eval package?
<img src="/public/module-level-eval.png"></img>

Good LLM evaluation should help you reliably identify weaknesses in the pipeline, inform what actions to take, and accelerate your development from prototype to production.

As AI developers, we always wanted to put LLM Evaluation as part of our CI/CD pipeline just like any other part of software, but today it remains challenging because:

**Human evaluation is trustworthy but not scalable**
- Eyeballing can only be done on a small dataset, and it has to be repeated for any pipeline update
- User feedback is spotty and lacks granularity

**Using LLMs to evaluate LLMs is expensive, slow and difficult to trust**
- Can be very costly and slow to run at scale
- Can be biased towards certain answers and often doesn’t align well with human evaluation

## How is continuous-eval different?

- **The Most Complete RAG Metric Library**: mix and match Deterministic, Semantic and LLM-based metrics.
- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.

- **Trustworthy Ensemble Metrics**: easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.
- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.

- **Cheaper and Faster Evaluation**: our hybrid pipeline slashes cost by up to 15x compared to pure LLM-based metrics, and reduces eval time on large datasets from hours to minutes.
- **Leverage User Feedback in Evaluation**: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.

- **Synthetic Dataset Generation**: Generate large-scale synthetic dataset to test your pipeline.

- **Tailored to Your Data**: our evaluation pipeline can be customized to your use case and leverages the data you trust. We can help you curate a golden dataset if you don’t have one.

## Resources

- **Blog Post: Practical Guide to RAG Pipeline Evaluation:** [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
- **Blog Posts:**
- Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893)
- Practical Guide to RAG Pipeline Evaluation: [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
- How important is a Golden Dataset for LLM evaluation?
[link](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)

- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)
- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https://cal.com/yizhang/continuous-eval)
2 changes: 2 additions & 0 deletions docs/src/content/docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: Making LLM development a science rather than an art.

**Start today with continuous-eval and make your LLM development a science not an art!**

<img src="/public/module-level-eval.png"></img>

import { LinkCard, CardGrid } from '@astrojs/starlight/components';
import { Icon } from '@astrojs/starlight/components';

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ datum = {
},

metric = PythonASTSimilarity()
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ datum = {
},

metric = CodeStringMatch()
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ datum = {
}

metric = DeterministicAnswerCorrectness()
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ datum = {
}

metric = DeterministicFaithfulness()
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ datum = {
}

metric = FleschKincaidReadability()
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ datum = {
}

metric = LLMBasedAnswerCorrectness(LLMFactory("gpt-4-1106-preview"))
print(metric.calculate(**datum))
print(metric(**datum))
```

### Sample Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ datum = {
"answer": "Shakespeare wrote 'Romeo and Juliet'",
}
metric = LLMBasedAnswerCorrectness(LLMFactory("gpt-4-1106-preview"))
print(metric.calculate(**datum))
print(metric(**datum))
```

### Sample Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ datum = {
}

metric = LLMBasedAnswerRelevance(LLMFactory("gpt-4-1106-preview"))
print(metric.calculate(**datum))
print(metric(**datum))
```

### Sample Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ datum = {
}

metric = LLMBasedAnswerRelevance(LLMFactory("gpt-4-1106-preview"))
print(metric.calculate(**datum))
print(metric(**datum))
```

### Sample Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ datum = {
}

metric = BertAnswerSimilarity()
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ datum = {
}

metric = BertAnswerSimilarity()
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ datum = {
}

metric = DebertaAnswerScores()
print(metric.calculate(**datum))
print(metric(**datum))

reverse_metric = DebertaAnswerScores(reverse=True)
print(reverse_metric.calculate(**datum))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ datum = {
}

metric = PrecisionRecallF1(RougeChunkMatch())
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ datum = {
}

metric = RankedRetrievalMetrics(RougeChunkMatch())
print(metric.calculate(**datum))
print(metric(**datum))
```

### Example Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ datum = {
}

metric = LLMBasedContextCoverage(LLMFactory("gpt-4-1106-preview"))
print(metric.calculate(**datum))
print(metric(**datum))
```

### Sample Output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ datum = {
}

metric = LLMBasedContextPrecision(LLMFactory("gpt-4-1106-preview"), log_relevance_by_context=True)
print(metric.calculate(**datum))
print(metric(**datum))
```

Note: optional variable `log_relevance_by_context` outputs `LLM_based_context_relevance_by_context` - the LLM judgement of relevance to answer the question per context retrieved.
Expand Down
22 changes: 14 additions & 8 deletions docs/src/content/docs/pipeline/overview.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: Overview
title: Pipeline
sidebar:
badge:
text: beta
text: new
variant: tip
---

Expand All @@ -18,30 +18,36 @@ Consider the following example:

```d2
direction: right
Retriever -> LLM
Retriever -> Reranker -> Generator
```

In the example above, the pipeline consists of two modules: a retriever and a language model (LLM).
In the example above, the pipeline consists of three simple modules: a retriever and an LLM generator.

```python
from continuous_eval.eval import Module, Pipeline, Dataset
from typing import List, Dict

dataset = Dataset("dataset_folder")
dataset = Dataset("dataset_folder") # This is the dataset you will use you evaluate the pipeline module.

retriever = Module(
name="Retriever",
input=dataset.question,
output=List[Dict[str, str]],
)

reranker = Module(
name="Reranker",
input=retriever,
output=List[Dict[str, str]],
)

llm = Module(
name="LLM",
input=retriever,
input=reranker,
output=str,
)

pipeline = Pipeline([retriever, llm], dataset=dataset)
pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
print(pipeline.graph_repr()) # visualize the pipeline in Mermaid graph format
```

> We will talk about the dataset later
92 changes: 92 additions & 0 deletions docs/src/content/docs/pipeline/pipeline_eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
title: Evaluators and Tests
sidebar:
badge:
text: new
variant: tip
---

## Adding Evaluators and Tests to a Pipeline

You can optionally add `eval` and `tests` to the modules you want to measure the performance of.


`eval` field to select relevant evaluation metrics
- Select the metrics and specify the input according to the data fields required for each metric. `MetricName().use(data_fields)`.
- Metric inputs can be referenced using items from two sources:
- **From `dataset`**: e.g. `ground_truth_context = dataset.ground_truth_context`
- **From current module**: e.g. `answer = ModuleOutput()`
- **From prior modules**: e.g. `retrieved_context = ModuleOutput(DocumentsContent, module=reranker)`, where
`DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])` to select specific items from the prior module's output


`tests` field to define specific performance criteria
- Select testing class `GreaterOrEqualThan` or `MeanGreaterOrEqualThan` to run test over each datapoint or the mean of the aggregate dataset
- Define `test_name`, `metric_name` (must be part of the metric_name that `eval` calculates), and `min_value`.



Below is a full example of a two-step pipeline.

```python
from continuous_eval.eval import Module, Pipeline, Dataset, ModuleOutput
from continuous_eval.metrics.retrieval import PrecisionRecallF1 # Deterministic metric
from continuous_eval.metrics.generation.text import (
FleschKincaidReadability, # Deterministic metric
DebertaAnswerScores, # Semantic metric
LLMBasedFaithfulness, # LLM-based metric
)
from typing import List, Dict
from continuous_eval.eval.tests import GreaterOrEqualThan
dataset = Dataset("data/eval_golden_dataset")

Documents = List[Dict[str, str]]
DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])

base_retriever = Module(
name="base_retriever",
input=dataset.question,
output=Documents,
eval=[
PrecisionRecallF1().use( # Reference-based metric that compares the Retrieved Context with the Ground Truths
retrieved_context=DocumentsContent,
ground_truth_context=dataset.ground_truth_contexts,
),
],
tests=[
GreaterOrEqualThan( # Set a test using context_recall, a metric calculated by PrecisionRecallF1()
test_name="Context Recall", metric_name="context_recall", min_value=0.9
),
],
)

llm = Module(
name="answer_generator",
input=reranker,
output=str,
eval=[
FleschKincaidReadability().use( # Reference-free metric that only uses the output of the module
answer=ModuleOutput()
),
DebertaAnswerScores().use( # Reference-based metric that compares the Answer with the Ground Truths
answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
),
LLMBasedFaithfulness().use( # Reference-free metric that uses output from a prior module (Retrieved Context) to evaluate the answer
answer=ModuleOutput(),
retrieved_context=ModuleOutput(DocumentsContent, module=reranker), # DocumentsContent from the reranker module
question=dataset.question,
),
],
tests=[
MeanGreaterOrEqualThan( # Compares the aggregate result over the dataset against the min_value
test_name="Readability", metric_name="flesch_reading_ease", min_value=20.0
),
GreaterOrEqualThan( # Compares each result in the dataset against the min_value, and outputs the mean
test_name="Deberta Entailment", metric_name="deberta_entailment", min_value=0.8
),
],
)

pipeline = Pipeline([retriever, llm], dataset=dataset)
print(pipeline.graph_repr()) # visualize the pipeline in Mermaid graph format
```

0 comments on commit e4edb7b

Please sign in to comment.