-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iteration over labels and datasets not working in PPI #44
Comments
Hi @WJ44 The issue with the return statement in the loop within the rag_scoring_config method has been resolved, allowing all combinations of datasets and labels to be correctly evaluated. |
Could you confirm @WJ44 if this issue has been solved on your side? I have created an environment to test the results, using ares-ai 0.60.0 library and I am getting very strange results in the output. Every single dataset is different, yet the evaluation I get is the same for every model, like if the variable was being overwritten. eval_datasets = ['/mnt/data/dataset1.tsv', '/mnt/data/work/dataset2.tsv', '/mnt/data/work/dataset3.tsv', '/mnt/data/work/dataset4.tsv']
ppi_config = {
"evaluation_datasets": eval_datasets,
"few_shot_examples_filepath": "data/interim/few_shot_prompt_filename_customized_pytorch_v2.tsv",
"checkpoints": [
"notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Context_Relevance_Label_few_shot_prompt_filename_customized_v2_545281.pt",
"notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Faithfulness_Label_few_shot_prompt_filename_customized_v2_568298.pt",
"notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Relevance_Label_few_shot_prompt_filename_customized_v2_428380.pt",
],
"labels": [
#"Context_Relevance_Label",
"Answer_Faithfulness_Label",
#"Answer_Relevance_Label",
],
"model_choice": "microsoft/mdeberta-v3-base",
"GPT_scoring": False,
# This file had to be modified manually to change the column names
"gold_label_path": "data/interim/gold_queries_pytorch.tsv",
"swap_human_labels_for_gpt4_labels": False,
}
ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results) In the output I am getting: [{'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}] Additionally, I am not getting the ARES ranking in the output. Another thing related to the labels and datasets is that if I add more than one label in the output it fails producing any output at all. So for example, if I uncomment the Context_Relevance_Label at the above config and run the evaluation again using Labels: ['Context_Relevance_Label', 'Answer_Faithfulness_Label'] then the response is:
Note that the 'Answer_Faithfulness_Label exists, as it was used for the first evaluation. I have searched in the updated repo documentation and readme and all the multilabel, multidataset examples have disappeared. Every single example is using just one label and one dataset. |
As far as I can tell, the code in the main branch here still has a return statement in the most nested for loop which seems like it would only evaluate the first combination but perhaps I am missing something. |
Issue resolved in PR #51 |
For evaluating RAG systems, the PPI config allows specifying multiple datasets and labels. These labels and datasets are iterated over in the rag_scoring_config method, however there is a return statement in the loop so only the first combination is actually evaluated.
Let me know if you could look into this. I could also make a PR to solve this if you let me know what the expected return value should be in this case.
The text was updated successfully, but these errors were encountered: