diff --git a/README.md b/README.md index b7ce6f2..4d25536 100644 --- a/README.md +++ b/README.md @@ -45,7 +45,7 @@ ## Getting Started -This code is provided as a Python package. To install it, run the following command: +This code is provided as a PyPi package. To install it, run the following command: ```bash python3 -m pip install continuous-eval @@ -57,9 +57,11 @@ if you want to install from source git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval poetry install --all-extras ``` + To run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`. ## Run a single metric + Here's how you run a single metric on a datum. Check all available metrics here: [link](https://docs.relari.ai/) @@ -81,6 +83,7 @@ metric = PrecisionRecallF1() print(metric(**datum)) ``` + ### Off-the-shelf Metrics @@ -132,45 +135,40 @@ print(metric(**datum))
-You can also define your own metrics (coming soon). -## Run modularized evaluation over a dataset - -Define your pipeline and select the metrics for each. +You can also define your own metrics, you only need to extend the [Metric](continuous_eval/metrics/base.py#23) class implementing the `__call__` method. +Optional methods are `batch` (if it is possible to implement optimizations for batch processing) and `aggregate` (to aggregate metrics results over multiple samples_). ```python -from continuous_eval.eval import Module, Pipeline, Dataset +from continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness from typing import List, Dict dataset = Dataset("dataset_folder") -Documents = List[Dict[str, str]] -DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x]) # Simple 3-step RAG pipeline with Retriever->Reranker->Generation - retriever = Module( name="Retriever", input=dataset.question, - output=List[Dict[str, str]], - eval={ + output=List[str], + eval=[ PrecisionRecallF1().use( - retrieved_context=DocumentsContent, + retrieved_context=ModuleOutput(), ground_truth_context=dataset.ground_truth_contexts, ), - }, + ], ) reranker = Module( name="reranker", input=retriever, output=List[Dict[str, str]], - eval={ + eval=[ RankedRetrievalMetrics().use( - retrieved_context=DocumentsContent, + retrieved_context=ModuleOutput(), ground_truth_context=dataset.ground_truth_contexts, ), - }, + ], ) llm = Module( @@ -188,25 +186,28 @@ llm = Module( pipeline = Pipeline([retriever, reranker, llm], dataset=dataset) ``` -Log your results in your pipeline (see full example) +Now you can run the evaluation on your pipeline ```python +eval_manager.start_run() + while eval_manager.is_running(): + if eval_manager.curr_sample is None: + break + q = eval_manager.curr_sample["question"] # get the question or any other field + # run your pipeline ... + eval_manager.next_sample() +``` -from continuous_eval.eval.manager import eval_manager +To **log** the results you just need to call the `eval_manager.log` method with the module name and the output, for example: -... -# Run and log module outputs -response = ask(q, reranked_docs) +```python eval_manager.log("answer_generator", response) -... -# Set the pipeline -eval_manager.set_pipeline(pipeline) -... -# View the results -eval_manager.run_eval() -print(eval_manager.eval_graph()) ``` +the evaluator manager also offers + +- `eval_manager.run_metrics()` to run all the metrics defined in the pipeline +- `eval_manager.run_tests()` to run the tests defined in the pipeline (see the documentation [docs](docs.relari.ai) for more details) ## Resources