Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration over labels and datasets not working in PPI #44

Closed
WJ44 opened this issue May 27, 2024 · 4 comments
Closed

Iteration over labels and datasets not working in PPI #44

WJ44 opened this issue May 27, 2024 · 4 comments

Comments

@WJ44
Copy link
Contributor

WJ44 commented May 27, 2024

For evaluating RAG systems, the PPI config allows specifying multiple datasets and labels. These labels and datasets are iterated over in the rag_scoring_config method, however there is a return statement in the loop so only the first combination is actually evaluated.

Let me know if you could look into this. I could also make a PR to solve this if you let me know what the expected return value should be in this case.

@robbym-dev
Copy link
Collaborator

Hi @WJ44

The issue with the return statement in the loop within the rag_scoring_config method has been resolved, allowing all combinations of datasets and labels to be correctly evaluated.

@elsatch
Copy link
Contributor

elsatch commented Jun 4, 2024

Could you confirm @WJ44 if this issue has been solved on your side?

I have created an environment to test the results, using ares-ai 0.60.0 library and I am getting very strange results in the output. Every single dataset is different, yet the evaluation I get is the same for every model, like if the variable was being overwritten.

eval_datasets = ['/mnt/data/dataset1.tsv', '/mnt/data/work/dataset2.tsv', '/mnt/data/work/dataset3.tsv', '/mnt/data/work/dataset4.tsv']

ppi_config = {
    "evaluation_datasets": eval_datasets,
    "few_shot_examples_filepath": "data/interim/few_shot_prompt_filename_customized_pytorch_v2.tsv",
    "checkpoints": [
        "notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Context_Relevance_Label_few_shot_prompt_filename_customized_v2_545281.pt",
        "notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Faithfulness_Label_few_shot_prompt_filename_customized_v2_568298.pt",
        "notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Relevance_Label_few_shot_prompt_filename_customized_v2_428380.pt",
    ],
    "labels": [
        #"Context_Relevance_Label",
        "Answer_Faithfulness_Label",
        #"Answer_Relevance_Label",
    ],
    "model_choice": "microsoft/mdeberta-v3-base",
    "GPT_scoring": False,
    # This file had to be modified manually to change the column names
    "gold_label_path": "data/interim/gold_queries_pytorch.tsv",
    "swap_human_labels_for_gpt4_labels": False,
}


ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

In the output I am getting:

[{'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}]

Additionally, I am not getting the ARES ranking in the output.

Another thing related to the labels and datasets is that if I add more than one label in the output it fails producing any output at all. So for example, if I uncomment the Context_Relevance_Label at the above config and run the evaluation again using Labels: ['Context_Relevance_Label', 'Answer_Faithfulness_Label'] then the response is:

Loaded model from checkpoint: notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Context_Relevance_Label_few_shot_prompt_filename_customized_v2_545281.pt
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                          
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Answer_Faithfulness_Label'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data/work/external/pip-ARES-0_60_env/notebooks/1-10-cgs-ARES-evaluate-dataset_v2.py", line 132, in <module>
    results = ares.evaluate_RAG()
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/ares.py", line 144, in evaluate_RAG
    return rag_scoring_config(**self.ppi_config)
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/rag_scoring.py", line 130, in rag_scoring_config
    test_set, Y_labeled_dataset, Y_labeled_dataloader, Y_labeled_predictions, Yhat_unlabeled_dataset, prediction_column = post_process_predictions(post_process_settings)
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py", line 1042, in post_process_predictions
    test_set = test_set[test_set[label] != 0]
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'Answer_Faithfulness_Label'

Note that the 'Answer_Faithfulness_Label exists, as it was used for the first evaluation.

I have searched in the updated repo documentation and readme and all the multilabel, multidataset examples have disappeared. Every single example is using just one label and one dataset.

@WJ44
Copy link
Contributor Author

WJ44 commented Jun 5, 2024

As far as I can tell, the code in the main branch here still has a return statement in the most nested for loop which seems like it would only evaluate the first combination but perhaps I am missing something.

@robbym-dev
Copy link
Collaborator

Issue resolved in PR #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants