Add support for analyzing evaluators with custom cross-annotations #281

rdnfn · 2024-04-17T16:49:45Z

Firstly, thanks a lot for creating and sharing the AlpacaEval package! I am finding it very useful (and well documented).

This PR fixes a small bug when analyzing evaluators on a custom (new) cross-annotation dataset. I have found that the main.analyze_evaluators function does not support this use-case yet. In particular, the alpaca_eval.analyze.Analyzer class assumes that the default cross-annotation dataset is being used when computing the correlations. As the generator column is not present in this dataset, it is being extracted/matched from the main annotation dataset (referring to this line). This matching fails if you use a different cross-annotation dataset. Thus, I updated the if-statement to only run if there is no generator columns present. If the generator is present in the custom cross-annotation dataset (as below), it no longer runs the specific matching code - and does not throw an error.

Let me know if this fix would be helpful to add.

Reproducing use-case

Simple code to test the default annotator on a custom cross-annotation dataset:

# note that this assumes that the OpenAI API key is set in client_configs
from alpaca_eval import main, constants

# Default annotator leaderboard on standard AlpacaEval cross-annotated dataset
evaluator_leaderboard, all_crossannotations = main.analyze_evaluators(
    annotators_config= constants.DEFAULT_ANNOTATOR_CONFIG,
    is_return_instead_of_print=True,
    precomputed_leaderboard="tmp_leaderboard.csv",
    is_single_annotator=True,
    analyzer_kwargs={
        "gold_crossannotations": "test_custom_crossannotations.json",
        "gold_annotations": None,
    }
)

It can be run with the following (toy) test_custom_crossannotations.json file:

[
    {
        "instruction": "Do you prefer cats or dogs?",
        "output_1": "Cats",
        "output_2": "Dogs",
        "preference": 1,
        "annotator_index": 15,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_01"
    },
    {
        "instruction": "Do you prefer cats or dogs?",
        "output_1": "Cats",
        "output_2": "Dogs",
        "preference": 1,
        "annotator_index": 0,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_01"
    },
    {
        "instruction": "Do you prefer cats or dogs?",
        "output_1": "Cats",
        "output_2": "Dogs",
        "preference": 1,
        "annotator_index": 9,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_01"
    },
    {
        "instruction": "Do you prefer cats or dogs?",
        "output_1": "Cats",
        "output_2": "Dogs",
        "preference": 2,
        "annotator_index": 7,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_01"
    },
    {
        "instruction": "Should I get the blue or green box?",
        "output_1": "Green",
        "output_2": "Blue",
        "preference": 1,
        "annotator_index": 10,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_02"
    },
    {
        "instruction": "Should I get the blue or green box?",
        "output_1": "Green",
        "output_2": "Blue",
        "preference": 2,
        "annotator_index": 15,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_02"
    },
    {
        "instruction": "Should I get the blue or green box?",
        "output_1": "Green",
        "output_2": "Blue",
        "preference": 1,
        "annotator_index": 0,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_02"
    },
    {
        "instruction": "Should I get the blue or green box?",
        "output_1": "Green",
        "output_2": "Blue",
        "preference": 2,
        "annotator_index": 4,
        "dataset": "custom_dataset",
        "datasplit": "eval",
        "generator": "dummy_model_02"
    }
]

YannDubs · 2024-04-18T09:33:11Z

Great, thanks @rdnfn for the detailed PR and kind words! 💯

update if statement to only apply if no generator column

7e38694

rdnfn changed the title ~~Add support for adding evaluators with custom cross-annotations~~ Add support for analyzing evaluators with custom cross-annotations Apr 17, 2024

YannDubs merged commit d1b3061 into tatsu-lab:main Apr 18, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for analyzing evaluators with custom cross-annotations #281

Add support for analyzing evaluators with custom cross-annotations #281

rdnfn commented Apr 17, 2024 •

edited

YannDubs commented Apr 18, 2024 •

edited

Add support for analyzing evaluators with custom cross-annotations #281

Add support for analyzing evaluators with custom cross-annotations #281

Conversation

rdnfn commented Apr 17, 2024 • edited

Reproducing use-case

YannDubs commented Apr 18, 2024 • edited

rdnfn commented Apr 17, 2024 •

edited

YannDubs commented Apr 18, 2024 •

edited