Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating more than one dataset at a time returns incorrect results #50

Closed
elsatch opened this issue Jun 9, 2024 · 0 comments
Closed

Comments

@elsatch
Copy link
Contributor

elsatch commented Jun 9, 2024

Last weeks I have been evaluating the evaluation features of ARES without achieving the expected results. The errors I've found are related to #44, which was marked as closed, but never solved.

Given that the current status of the code (ares-ai pypi library 0.6.1) makes impossible to get proper ARES Ranking for different datasets in the final results, I decided to explore further.

Baseline

To establish an initial baseline, I executed the reference code from the Quick Start Guide 2. This is the relevant code:

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_unlabeled_output.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

The NQ datasets were downloaded using the wget commands from the setup part of the guide. The checkpoint wasn't trained but downloaded from the provided drive link.

These are the results:

Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300

Test - Evaluating more than one dataset at a time

To test this example, we will download two different datasets from the NQ dataset, available from the repository at datasets/eval_datasets/nq, using the following commands:

wget https://github.com/stanford-futuredata/ARES/raw/main/datasets/eval_datasets/nq/nq_ratio_0.65.tsv
wget https://github.com/stanford-futuredata/ARES/raw/main/datasets/eval_datasets/nq/nq_ratio_0.7.tsv

This is the resulting code:

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

And these are the results:

--------------------------------------------------------
Evaluation Sets: ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv']
Checkpoints: ['checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt']
Labels: ['Context_Relevance_Label']
--------------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624]
ARES Confidence Interval: [[0.577, 0.694]]
Number of Examples in Evaluation Set: [4081]
Ground Truth Performance: [0.65]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792]
Annotated Examples used for PPI: 300
--------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624, 0.6638786279683391]
ARES Confidence Interval: [[0.577, 0.694], [0.605, 0.722]]
Number of Examples in Evaluation Set: [4081, 3790]
Ground Truth Performance: [0.65, 0.7]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792, 0.798]
Annotated Examples used for PPI: 300
--------------------------------------------------
# Reformated to make clear that the results are duplicated
[{'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300},
 {'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300}]

The evaluation returns first the results for the first dataset, then appends the results for the second dataset. Then in the final recap, it doubles the initial score into the second result, returning incorrect results.

This problem multiplies when analyzing several datasets and several labels. In that case the evaluation system keeps providing incorrect results when evaluating more than one dataset at a time. It overwrites the results of the second dataset with the results of the first dataset for the same label.

[
    # First label - First dataset
    {
        "ARES_Prediction": 0.6354300416564624,
        "ARES_Confidence_Interval": [0.577, 0.694],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # First label - should be second dataset. Duplicated
    {
        "ARES_Prediction": 0.6354300416564624,
        "ARES_Confidence_Interval": [0.577, 0.694],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # Second label - First dataset
    {
        "ARES_Prediction": 0.5664216286857816,
        "ARES_Confidence_Interval": [0.51, 0.622],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # Second label - should be second dataset. Duplicated
    {
        "ARES_Prediction": 0.5664216286857816,
        "ARES_Confidence_Interval": [0.51, 0.622],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
        "Annotated_Examples_used_for_PPI": 300,
    },
]
robbym-dev added a commit that referenced this issue Jun 10, 2024
Updated eval loop and output to fix #50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant