<p>
    <a href="https://colab.research.google.com/drive/1i78-peTBdhK5y4ZskFzC_NtLRaqvySXM">
        <img alt="Open In Colab"
             align="left"
             src="https://colab.research.google.com/assets/colab-badge.svg">
    </a>
    <br>
    <h1> Quickstart </h1>
</p>

Welcome to the ragas quickstart. We're going to get you up and running with ragas as qickly as you can so that you can go back to improving your Retrieval Augmented Generation pipelines while this library makes sure your changes are improving your entire pipeline.

to kick things of lets start with the data

In [None]:
# if using colab uncomment this
!pip install ragas

Collecting ragas
  Downloading ragas-0.0.17-py3-none-any.whl (39 kB)
Collecting transformers (from ragas)
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers (from ragas)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets (from ragas)
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken (from ragas)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m 

Ragas also uses OpenAI for running some metrics so make sure you have your openai key ready and available in your environment

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-"

## The Data

Ragas performs a `ground_truth` free evaluation of your RAG pipelines. This is because for most people building a gold labeled dataset which represents in the distribution they get in production is a very expensive process.

Hence to work with ragas all you need are the following data
- question: `list[str]` - These are the questions you RAG pipeline will be evaluated on.
- answer: `list[str]` - The answer generated from the RAG pipeline and give to the user.
- contexts: `list[list[str]]` - The contexts which where passed into the LLM to answer the question.

Ideally your list of questions should reflect the questions your users give, including those that you have been problamatic in the past.

Here we're using an example dataset from on of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/) we created. If you want to want to know more about the baseline, feel free to check the `experiements/baseline` section

In [None]:
# data
from datasets import load_dataset

fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval

Downloading builder script:   0%|          | 0.00/6.60k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/150k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating baseline split: 0 examples [00:00, ? examples/s]

/root/.cache/huggingface/datasets/downloads/20b11f7bc7d6f5376bff17ddcc67dc99098950366dee70ae19916f44aa3ae33f


DatasetDict({
    baseline: Dataset({
        features: ['question', 'ground_truths', 'answer', 'contexts'],
        num_rows: 30
    })
})

## Metrics

Ragas measures your pipeline's performance against two dimensions

1. Faithfulness: measures the factual consistency of the generated answer against the given context.
2. Relevancy: measures how relevant retrieved contexts and the generated answer are to the question.

Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.

now lets import these metrics and understand more about what they denote

In [None]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision
)

here you can see that we are using 4 metrics, but what do the represent?

1. answer_relevancy - a measure of how relevant the answer is to the question

2. faithfulness - the factual consistency of the answer to the context base on the question.

3. context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.

4. context_precision - a measure of how relevant the retrieved context is to the question. Conveys quality of the retrieval pipeline.

**Note:** *by default these metrics are using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key. You can also try other LLMs for evaluation, check the [llm guide](./guides/llms.ipynb) to learn more*

If you're interested in learning more, feel free to check the [docs](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)

## Evaluation

Running the evalutation is as simple as calling evaluate on the `Dataset` with the metrics of your choice.

In [None]:
from ragas import evaluate

result = evaluate(
    fiqa_eval["baseline"].select(range(1)),
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall
    ],
)

result

evaluating with [context_precision]


100%|██████████| 1/1 [00:05<00:00,  5.66s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [00:10<00:00, 10.10s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:05<00:00,  5.77s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:09<00:00,  9.84s/it]


{'ragas_score': 0.2867, 'context_precision': 0.4118, 'faithfulness': 0.6667, 'answer_relevancy': 0.9769, 'context_recall': 0.1111}

and there you have the it, all the scores you need. `ragas_score` gives you a single metric that you can use while the other onces measure the different parts of your pipeline.

now if we want to dig into the results and figure out examples where your pipeline performed worse or really good you can easily convert it into a pandas array and use your standard analytics tools too!

In [None]:
df = result.to_pandas()
df.head()

Unnamed: 0,question,contexts,answer,ground_truths,context_precision,faithfulness,answer_relevancy,context_recall
0,How to deposit a cheque issued to an associate...,[Just have the associate sign the back and the...,\nThe best way to deposit a cheque issued to a...,[Have the check reissued to the proper payee.J...,0.411765,0.666667,0.976915,0.111111


And thats it!

You can check out the [ragas in action] notebook to get a feel of what is like to use it while trying to improve your pipelines.

if you have any suggestion/feedbacks/things your not happy about, please do share it in the [issue section](https://github.com/explodinggradients/ragas/issues). We love hearing from you 😁