You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last weeks I have been evaluating the evaluation features of ARES without achieving the expected results. The errors I've found are related to #44, which was marked as closed, but never solved.
Given that the current status of the code (ares-ai pypi library 0.6.1) makes impossible to get proper ARES Ranking for different datasets in the final results, I decided to explore further.
Baseline
To establish an initial baseline, I executed the reference code from the Quick Start Guide 2. This is the relevant code:
The NQ datasets were downloaded using the wget commands from the setup part of the guide. The checkpoint wasn't trained but downloaded from the provided drive link.
These are the results:
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300
Test - Evaluating more than one dataset at a time
To test this example, we will download two different datasets from the NQ dataset, available from the repository at datasets/eval_datasets/nq, using the following commands:
--------------------------------------------------------
Evaluation Sets: ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv']
Checkpoints: ['checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt']
Labels: ['Context_Relevance_Label']
--------------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624]
ARES Confidence Interval: [[0.577, 0.694]]
Number of Examples in Evaluation Set: [4081]
Ground Truth Performance: [0.65]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792]
Annotated Examples used for PPI: 300
--------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624, 0.6638786279683391]
ARES Confidence Interval: [[0.577, 0.694], [0.605, 0.722]]
Number of Examples in Evaluation Set: [4081, 3790]
Ground Truth Performance: [0.65, 0.7]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792, 0.798]
Annotated Examples used for PPI: 300
--------------------------------------------------
# Reformated to make clear that the results are duplicated
[{'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300},
{'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300}]
The evaluation returns first the results for the first dataset, then appends the results for the second dataset. Then in the final recap, it doubles the initial score into the second result, returning incorrect results.
This problem multiplies when analyzing several datasets and several labels. In that case the evaluation system keeps providing incorrect results when evaluating more than one dataset at a time. It overwrites the results of the second dataset with the results of the first dataset for the same label.
[
# First label - First dataset
{
"ARES_Prediction": 0.6354300416564624,
"ARES_Confidence_Interval": [0.577, 0.694],
"Number_of_Examples_in_Evaluation_Set": 4081,
"Ground_Truth_Performance": 0.65,
"ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
"Annotated_Examples_used_for_PPI": 300,
},
# First label - should be second dataset. Duplicated
{
"ARES_Prediction": 0.6354300416564624,
"ARES_Confidence_Interval": [0.577, 0.694],
"Number_of_Examples_in_Evaluation_Set": 4081,
"Ground_Truth_Performance": 0.65,
"ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
"Annotated_Examples_used_for_PPI": 300,
},
# Second label - First dataset
{
"ARES_Prediction": 0.5664216286857816,
"ARES_Confidence_Interval": [0.51, 0.622],
"Number_of_Examples_in_Evaluation_Set": 4081,
"Ground_Truth_Performance": 0.65,
"ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
"Annotated_Examples_used_for_PPI": 300,
},
# Second label - should be second dataset. Duplicated
{
"ARES_Prediction": 0.5664216286857816,
"ARES_Confidence_Interval": [0.51, 0.622],
"Number_of_Examples_in_Evaluation_Set": 4081,
"Ground_Truth_Performance": 0.65,
"ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
"Annotated_Examples_used_for_PPI": 300,
},
]
The text was updated successfully, but these errors were encountered:
Last weeks I have been evaluating the evaluation features of ARES without achieving the expected results. The errors I've found are related to #44, which was marked as closed, but never solved.
Given that the current status of the code (ares-ai pypi library 0.6.1) makes impossible to get proper ARES Ranking for different datasets in the final results, I decided to explore further.
Baseline
To establish an initial baseline, I executed the reference code from the Quick Start Guide 2. This is the relevant code:
The NQ datasets were downloaded using the wget commands from the setup part of the guide. The checkpoint wasn't trained but downloaded from the provided drive link.
These are the results:
Test - Evaluating more than one dataset at a time
To test this example, we will download two different datasets from the NQ dataset, available from the repository at
datasets/eval_datasets/nq
, using the following commands:This is the resulting code:
And these are the results:
The evaluation returns first the results for the first dataset, then appends the results for the second dataset. Then in the final recap, it doubles the initial score into the second result, returning incorrect results.
This problem multiplies when analyzing several datasets and several labels. In that case the evaluation system keeps providing incorrect results when evaluating more than one dataset at a time. It overwrites the results of the second dataset with the results of the first dataset for the same label.
The text was updated successfully, but these errors were encountered: