Dataset Link: https://huggingface.co/datasets/allenai/qasper

While leaderboards and reports offer insights into overall model performance, they don't reveal how a model handles your specific needs. The Gen AI evaluation service helps you define your own evaluation criteria, ensuring a clear understanding of how well generative AI models and applications align with your unique use case.

In [6]:
%pip install google-genai -q

Note: you may need to restart the kernel to use updated packages.


# Benchmarking Gemini Models on using Ragas

In this tutorial, we will see how we can benchmark the Gemini models on the AllenAI's QASPER dataset using the RAGAS metrics on Question Answering Task. 

![Data collection Process of QASPER Dataset](qasper_data_collection.png)

## 

For the sake of demonstration, we will use only a subset of the whole dataset. You can perform benchmarking using the complete dataset.

In [25]:
from datasets import load_dataset
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

dataset = load_dataset("allenai/qasper")

Dataset({
    features: ['id', 'title', 'abstract', 'full_text', 'qas', 'figures_and_tables'],
    num_rows: 100
})

![](benchmarking.png)

In [16]:
def convert_full_text_to_markdown(full_text_dict):
    """
    Converts a full_text dictionary into a markdown-formatted string.

    Expected keys:
      - "section_name": list of section titles.
      - "paragraphs": list of lists of paragraphs corresponding to each section.

    Each section becomes a markdown header (##) followed by its paragraphs.
    """
    sections = full_text_dict.get("section_name", [])
    paragraphs = full_text_dict.get("paragraphs", [])

    markdown_lines = []
    for section, paragraph in zip(sections, paragraphs):
        markdown_lines.append(f"## {section}")
        markdown_lines.append("")  # Blank line
        markdown_lines.append("\n".join(map(str, paragraph)))
        markdown_lines.append("")  # End of section
        markdown_lines.append("")  # Extra blank line for separation
    return "\n".join(markdown_lines)

In [17]:
def combine_responses(row):
    """
    Combines 'extractive_spans', 'yes_no', and 'free_form_answer'
    into one single string. Skips components that are missing.
    """
    responses = []
    if pd.notna(row.get("extractive_spans")):
        if isinstance(row["extractive_spans"], list):
            responses.append(" ".join(map(str, row["extractive_spans"])))
        else:
            responses.append(str(row["extractive_spans"]))
    if pd.notna(row.get("yes_no")):
        responses.append(str(row["yes_no"]))
    if pd.notna(row.get("free_form_answer")):
        responses.append(str(row["free_form_answer"]))
    return "\n".join(responses) if responses else np.nan

In [18]:
def preprocess_hf_dataset(hf_ds):
    """
    Processes a HuggingFace dataset split into a cleaned Pandas DataFrame.

    Steps:
      1. For each sample, convert 'full_text' to a markdown string.
      2. For every QA pair in the sample, extract the question and first answer.
      3. Build lists for answers, questions, and full_text (duplicated per question).
      4. Create a DataFrame from the collected data.
      5. Clean columns by replacing empty lists/strings with NaN and joining lists.
      6. Combine the answer components into a single 'golden response'.

    The function uses nested tqdm progress bars for real-time feedback.

    Returns:
        pd.DataFrame: The preprocessed DataFrame.
    """
    answers_list = []  # Stores the first answer for each question
    questions_list = []  # Stores each question text
    full_text_list = []  # Stores the formatted full text per QA pair

    # Outer loop: iterate over samples with progress bar
    for sample in tqdm(hf_ds, desc="Processing samples", unit="sample"):
        # Convert full text once per sample
        formatted_text = convert_full_text_to_markdown(sample["full_text"])
        # Create a list of QA pairs
        qa_pairs = list(zip(sample["qas"]["question"], sample["qas"]["answers"]))

        # Inner loop: iterate over each QA pair with its own progress bar
        for question, answer_set in tqdm(
            qa_pairs, desc="Processing QAs", total=len(qa_pairs), leave=False, unit="qa"
        ):
            answers_list.append(answer_set["answer"][0])
            questions_list.append(question)
            full_text_list.append(formatted_text)

    # Create DataFrame from the collected data
    df = pd.DataFrame(answers_list)
    df["question"] = questions_list
    df["full_text"] = full_text_list

    # Data Cleaning: Replace empty lists/strings with NaN and join lists if needed
    df["extractive_spans"] = df["extractive_spans"].apply(
        lambda x: np.nan if isinstance(x, list) and len(x) == 0 else x
    )
    df["free_form_answer"] = df["free_form_answer"].apply(
        lambda x: np.nan if isinstance(x, str) and x.strip() == "" else x
    )
    df["yes_no"] = df["yes_no"].apply(lambda x: np.nan if x is None else x)
    df["extractive_spans"] = df["extractive_spans"].apply(
        lambda x: "\n".join(x) if isinstance(x, list) else x
    )

    # Combine the answer components into a single 'golden response'
    df["golden response"] = df.apply(lambda row: combine_responses(row), axis=1)

    return df

In [None]:
train_ds = dataset["train"]
validation_ds = dataset["validation"]
test_ds = dataset["test"]

In [20]:
train_df = preprocess_hf_dataset(train_ds)
validation_df = preprocess_hf_dataset(validation_ds)
test_df = preprocess_hf_dataset(test_ds)

Processing samples: 100%|██████████| 888/888 [00:04<00:00, 211.20sample/s]
Processing samples: 100%|██████████| 281/281 [00:01<00:00, 199.77sample/s]
Processing samples: 100%|██████████| 416/416 [00:02<00:00, 198.84sample/s]


In [None]:
from llama_index.llms.google_genai import GoogleGenAI
from dotenv import load_dotenv

load_dotenv()

gemini_2 = GoogleGenAI(
    model="gemini-2.0-flash",
)

In [4]:
import os
from google import genai
from dotenv import load_dotenv

load_dotenv()

client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

response = client.models.generate_content(
    model="gemini-2.0-flash", contents="Explain how AI works in a few words"
)
print(response.text)

AI learns patterns from data to make predictions or decisions.



In [21]:
idx = 0
query = validation_df.iloc[idx]["question"]
context = validation_df.iloc[idx]["full_text"]
query

'which multilingual approaches do they compare with?'

In [59]:
context_str = context
query_str = query


qa_prompt = (
    f"Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "If you cannot answer the query, just say that it cannot be answered.\n"
    "Query: {query_str}\n"
    "Answer: "
)

formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
response = gemini_2.complete(formatted_prompt)
response.text

'The paper compares its approach with multilingual NMT (MNMT) from  BIBREF19.  Another comparison is made against a pivoting method that uses MNMT (pivoting<sub>m</sub>), which uses MNMT to translate source to pivot and then to target.\n'

In [23]:
from google import genai

client = genai.Client()

context_str = context
query_str = query

qa_prompt = (
    f"Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "If you cannot answer the query, just say that it cannot be answered.\n"
    "Query: {query_str}\n"
    "Answer: "
)

formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)


response = await client.aio.models.generate_content(
    model='gemini-2.0-flash', contents=formatted_prompt
)
print(response.text)

They compare their approaches with Multilingual NMT (MNMT) described in BIBREF19 and BIBREF22.



In [11]:
validation_df.iloc[idx]["golden response"]

'BIBREF19\nBIBREF20'

In [None]:
from async_executor import AsyncExecutor

gemini_2 = GoogleGenAI(
    model="gemini-2.0-flash",
)

async def query_llm(query_str: str, context_str: str):
    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
    response = await gemini_2.acomplete(formatted_prompt)
    return response


# Create an instance of the asynchronous executor
executor = AsyncExecutor(
    desc="LLM Processing",
    show_progress=True,
    raise_exceptions=False,
    max_calls_per_minute=1250,
)

df = validation_df

In [19]:
for idx in range(df.shape[0]):
    query = df.iloc[idx]["question"]
    context = df.iloc[idx]["full_text"]
    executor.submit(query_llm, query, context)

# Execute the jobs and get the results in order
validation_responses = executor.results()

LLM Processing: 100%|██████████| 1005/1005 [00:52<00:00, 19.13it/s]


In [26]:
from ragas.dataset_schema import EvaluationDataset

dataset_list = []

for i in range(df.shape[0]):
    sample = {
        "user_input": (
            "" if pd.isna(df.iloc[i].get("question")) else df.iloc[i].get("question")
        ),
        "reference": (
            ""
            if pd.isna(df.iloc[i].get("golden response"))
            else df.iloc[i].get("golden response")
        ),
        "response": (
            ""
            if pd.isna(validation_responses[i].text)
            else validation_responses[i].text
        ),
    }
    dataset_list.append(sample)

dataset = EvaluationDataset.from_list(dataset_list)
dataset.to_pandas()

Unnamed: 0,user_input,response,reference
0,which multilingual approaches do they compare ...,They compare their approaches with Multilingua...,BIBREF19\nBIBREF20
1,what are the pivot-based baselines?,The pivot-based method is used as a baseline. ...,pivoting\npivoting$_{\rm m}$
2,which datasets did they experiment with?,They experimented with two public datasets: Eu...,Europarl\nMultiUN
3,what language pairs are explored?,The language pairs explored in this paper are:...,"De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E..."
4,what ner models were evaluated?,"Stanford NER, spaCy 2.0, and a recurrent model...",Stanford NER\nspaCy 2.0 \nrecurrent model with...
...,...,...,...
1000,What approaches do they use towards text analy...,"Based on the provided text, the approaches use...",Domain experts and fellow researchers can prov...
1001,What dataset do they use for analysis?,The context information mentions using data fr...,
1002,Do they demonstrate why interdisciplinary insi...,"Yes, the text explicitly states that interdisc...",False
1003,What background do they have?,The authors are scholars from very different d...,


In [None]:
from ragas.metrics import (
    AnswerAccuracy,
    AnswerCorrectness,
    FactualCorrectness,
    AspectCritic,
)
import getpass
import os

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

aspect_critic = AspectCritic(
    name="unanswerable",
    definition="Return 1 if the query cannot be answered by the provided context, otherwise return 0.",
    llm=evaluator_llm,
)

metrics = [
    AnswerAccuracy(llm=evaluator_llm),
    AnswerCorrectness(llm=evaluator_llm, weights=[1, 0]),
    aspect_critic,
    FactualCorrectness(llm=evaluator_llm),
]

In [21]:
validation_responses[0].text

'They compare their approaches with Multilingual NMT (MNMT) from BIBREF19 and BIBREF22.\n'

In [27]:
from ragas import evaluate

gemini_2_score = evaluate(dataset=dataset, metrics=metrics)
gemini_2_score.to_pandas()

Evaluating: 100%|██████████| 4020/4020 [25:38<00:00,  2.61it/s] 


Unnamed: 0,user_input,response,reference,nv_accuracy,answer_correctness,unanswerable,factual_correctness(mode=f1)
0,which multilingual approaches do they compare ...,They compare their approaches with Multilingua...,BIBREF19\nBIBREF20,0.25,0.5,0,0.67
1,what are the pivot-based baselines?,The pivot-based method is used as a baseline. ...,pivoting\npivoting$_{\rm m}$,0.50,0.8,0,0.00
2,which datasets did they experiment with?,They experimented with two public datasets: Eu...,Europarl\nMultiUN,1.00,0.8,0,0.40
3,what language pairs are explored?,The language pairs explored in this paper are:...,"De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...",0.00,1.0,0,0.00
4,what ner models were evaluated?,"Stanford NER, spaCy 2.0, and a recurrent model...",Stanford NER\nspaCy 2.0 \nrecurrent model with...,0.50,0.8,0,0.00
...,...,...,...,...,...,...,...
1000,What approaches do they use towards text analy...,"Based on the provided text, the approaches use...",Domain experts and fellow researchers can prov...,0.00,0.0,0,0.00
1001,What dataset do they use for analysis?,The context information mentions using data fr...,,0.50,0.0,0,0.00
1002,Do they demonstrate why interdisciplinary insi...,"Yes, the text explicitly states that interdisc...",False,0.00,0.0,0,0.00
1003,What background do they have?,The authors are scholars from very different d...,,0.50,0.0,1,0.00


In [None]:
gemini_2_score

A completely optional step, if you want to upload the evaluation results to your Ragas app, you can run the command below.You can learn more about Ragas app here.

In [28]:
gemini_2_score.upload()

Evaluation results uploaded! View at https://app.ragas.io/dashboard/alignment/evaluation/908c34a5-3996-4703-8eae-a7daf210c6d7


'https://app.ragas.io/dashboard/alignment/evaluation/908c34a5-3996-4703-8eae-a7daf210c6d7'

In [None]:
preds = gemini_2_score["unanswerable"]
actuals = validation_df["unanswerable"].astype(int)

In [None]:
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# Calculate and print basic metrics
print("Accuracy:", accuracy_score(actuals, preds))
print("Precision:", precision_score(actuals, preds))
print("Recall:", recall_score(actuals, preds))
print("F1 Score:", f1_score(actuals, preds))

# Generate and print the classification report
print("\nClassification Report:")
print(classification_report(actuals, preds))

Accuracy: 0.844776119402985
Precision: 0.31736526946107785
Recall: 0.5578947368421052
F1 Score: 0.40458015267175573

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.87      0.91       910
           1       0.32      0.56      0.40        95

    accuracy                           0.84      1005
   macro avg       0.63      0.72      0.66      1005
weighted avg       0.89      0.84      0.86      1005



### Benchmarking Gemini 1.5 Flash

In [None]:
gemini_1_5 = GoogleGenAI(
    model="gemini-1.5-flash",
)


async def query_llm(query_str: str, context_str: str):
    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
    response = await gemini_1_5.acomplete(formatted_prompt)
    return response


# Create an instance of the asynchronous executor
executor = AsyncExecutor(
    desc="Querying LLM",
    show_progress=True,
    raise_exceptions=False,
    max_calls_per_minute=1250,
)

df = validation_df

In [None]:
for idx in range(df.shape[0]):
    query = df.iloc[idx]["question"]
    context = df.iloc[idx]["full_text"]
    executor.submit(query_llm, query, context)

# Execute the jobs and get the results in order
validation_responses = executor.results()

In [65]:
from ragas.dataset_schema import EvaluationDataset

dataset_list = []

for i in range(df.shape[0]):
    sample = {
        "user_input": (
            "" if pd.isna(df.iloc[i].get("question")) else df.iloc[i].get("question")
        ),
        "reference": (
            ""
            if pd.isna(df.iloc[i].get("golden response"))
            else df.iloc[i].get("golden response")
        ),
        "response": (
            ""
            if pd.isna(validation_responses[i].text)
            else validation_responses[i].text
        ),
    }
    dataset_list.append(sample)

dataset = EvaluationDataset.from_list(dataset_list)
dataset.to_pandas()

Unnamed: 0,user_input,response,reference
0,which multilingual approaches do they compare ...,The paper compares its approach with multiling...,BIBREF19\nBIBREF20
1,what are the pivot-based baselines?,The provided text mentions two types of pivot-...,pivoting\npivoting$_{\rm m}$
2,which datasets did they experiment with?,The experiments were conducted on two public d...,Europarl\nMultiUN
3,what language pairs are explored?,The paper explores the following language pair...,"De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E..."
4,what ner models were evaluated?,"Stanford NER, spaCy 2.0, and a recurrent model...",Stanford NER\nspaCy 2.0 \nrecurrent model with...
...,...,...,...
1000,What approaches do they use towards text analy...,The authors utilize several approaches to text...,Domain experts and fellow researchers can prov...
1001,What dataset do they use for analysis?,The primary dataset used for analysis in the p...,
1002,Do they demonstrate why interdisciplinary insi...,"Yes, the authors demonstrate the importance of...",False
1003,What background do they have?,The authors have diverse disciplinary backgrou...,


In [67]:
from ragas import evaluate

gemini_1_5_score = evaluate(dataset=dataset, metrics=metrics)
gemini_1_5_score.to_pandas()

Evaluating: 100%|██████████| 4020/4020 [27:40<00:00,  2.42it/s] 


Unnamed: 0,user_input,response,reference,nv_accuracy,answer_correctness,unanswerable,factual_correctness(mode=f1)
0,which multilingual approaches do they compare ...,The paper compares its approach with multiling...,BIBREF19\nBIBREF20,0.25,0.000000,0,0.0
1,what are the pivot-based baselines?,The provided text mentions two types of pivot-...,pivoting\npivoting$_{\rm m}$,0.25,0.500000,0,0.0
2,which datasets did they experiment with?,The experiments were conducted on two public d...,Europarl\nMultiUN,1.00,1.000000,0,0.0
3,what language pairs are explored?,The paper explores the following language pair...,"De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...",0.00,0.250000,0,0.0
4,what ner models were evaluated?,"Stanford NER, spaCy 2.0, and a recurrent model...",Stanford NER\nspaCy 2.0 \nrecurrent model with...,0.50,0.571429,0,0.0
...,...,...,...,...,...,...,...
1000,What approaches do they use towards text analy...,The authors utilize several approaches to text...,Domain experts and fellow researchers can prov...,0.00,0.000000,0,0.0
1001,What dataset do they use for analysis?,The primary dataset used for analysis in the p...,,1.00,0.000000,0,0.0
1002,Do they demonstrate why interdisciplinary insi...,"Yes, the authors demonstrate the importance of...",False,0.00,0.000000,0,0.0
1003,What background do they have?,The authors have diverse disciplinary backgrou...,,0.75,0.000000,0,0.0


In [72]:
gemini_1_5_score

{'nv_accuracy': 0.4724, 'answer_correctness': 0.3366, 'unanswerable': 0.1841, 'factual_correctness(mode=f1)': 0.2269}

In [None]:
preds = gemini_1_5_score["unanswerable"]
actuals = validation_df["unanswerable"].astype(int)

In [None]:
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# Calculate and print basic metrics
print("Accuracy:", accuracy_score(actuals, preds))
print("Precision:", precision_score(actuals, preds))
print("Recall:", recall_score(actuals, preds))
print("F1 Score:", f1_score(actuals, preds))

# Generate and print the classification report
print("\nClassification Report:")
print(classification_report(actuals, preds))

Accuracy: 0.83681592039801
Precision: 0.31351351351351353
Recall: 0.6105263157894737
F1 Score: 0.4142857142857143

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.86      0.91       910
           1       0.31      0.61      0.41        95

    accuracy                           0.84      1005
   macro avg       0.63      0.74      0.66      1005
weighted avg       0.89      0.84      0.86      1005



In [71]:
gemini_1_5_score.upload()

Evaluation results uploaded! View at https://app.ragas.io/dashboard/alignment/evaluation/2a3849ff-b142-4440-9c13-42f5fda332c9


'https://app.ragas.io/dashboard/alignment/evaluation/2a3849ff-b142-4440-9c13-42f5fda332c9'

## Comparing the Results

## Next Steps

If you follow the steps like above you can Benchmark any model using Ragas Metrics you will you need to figure out How to convert the benchmark dataset to ragas EvaluationDataset then select the metrics of you choice and using then use the evaluate function.