Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs in parsing alpaca eval 2.0 #204

Closed
chengjl19 opened this issue Jan 11, 2024 · 8 comments
Closed

Bugs in parsing alpaca eval 2.0 #204

chengjl19 opened this issue Jan 11, 2024 · 8 comments

Comments

@chengjl19
Copy link

alpaca_eval_cot_gpt4_turbo_fn is not stable the returned results always miss "ordered_models" as well as "rank", thus failed in parsing results

parsers_to_kwargs:
json_parser:
annotation_key: "ordered_models"
ranking_parser:
model_1_name: "m"

an example output: {"rankings": [{"model": "m", "position": 1, "model": "M", "position": 2}]}

@YannDubs
Copy link
Collaborator

thanks @chengjl19, are you using alpaca_eval_cot_gpt4_turbo_fn instead of AlapcaEval2 annotator weighted_alpaca_eval_gpt4_turbo on purpose? Which version of alpaca_eval are you using?

I'm still surprised alpaca_eval_cot_gpt4_turbo_fn doesn't work, I've evaluated 11 model with it and never had an issue. What's the input you gave?

@chengjl19
Copy link
Author

截屏2024-01-12 09 41 52 As said in readme, alpaca_eval_cot_gpt4_turbo_fn is the default annotator. I am using the newest version 0.5.2.

Here is an example input and output:

"input": [
    {
        "content": "You are a highly efficient assistant, who evaluates and rank large language models (LLMs) based on the quality of their responses to given prompts. This process will create a leaderboard reflecting the most accurate and human-preferred answers.",
        "role": "system"
    },
    {
        "content": "I require a leaderboard for various large language models. I'll provide you with prompts given to these models and their corresponding responses. Your task is to assess these responses, ranking the models in order of preference from a human perspective. Once ranked, please output the results in a structured JSON format for the make_partial_leaderboard function.\n\n## Prompt\n\n{\n    \"instruction\": \"\"\"What are the names of some famous actors that started their careers on Broadway?\"\"\",\n}\n\n## Model Outputs\n\nHere are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier.\n\n{\n    {\n        \"model\": \"m\",\n        \"output\": \"\"\"1. Meryl Streep - This Academy Award-winning actress got her start on Broadway in the 1975 production of \"Trelawny of the 'Wells'\" by Arthur Wing Pinero. She played the role of Laura Trelawny, the youngest daughter of the Trelawny family.\n\n2. Audra McDonald - Known for her impressive range in musical theater, McDonald made her Broadway debut in \"A Raisin in the Sun\" in 1995. She played the role of Beneatha Younger, the daughter of the lead character, Walter Lee Younger.\n\n3. Hugh Jackman - Before his international fame as Wolverine in the \"X-Men\" franchise, Jackman started his career on Broadway in the 1996 revival of \"Oklahoma!\". He played the lead role of Curly McClain.\n\n4. Phillipa Soo - Best known for originating the role of Eliza Schuyler in \"Hamilton\", Soo began her career on Broadway in the 2014 revival of \"Les Mis\u00e9rables\". She played the role of Fantine.\n\n5. Mark Ruffalo - The \"Avengers\" star made his Broadway debut in 1992 in \"The Exonerated\", a play based on the true stories of death row inmates. Ruffalo played the role of Robert Henry Duke.\n\n6. Liev Schreiber - This Tony Award-winning actor made his Broadway debut in 1993 in \"A View from the Bridge\" by Arthur Miller. He played the role of Eddie Carbone.\n\n7. Cynthia Nixon - Best known for her role as Miranda Hobbes in \"Sex and the City\", Nixon began her career on Broadway in the 1981 production of \"Born Yesterday\". She played the role of Judy Haynes, a newspaper reporter.\n\n8. Ben Platt - The \"Dear Evan Hansen\" composer and star made his Broadway debut in the 2011 revival of \"The Book of Mormon\". He played the role of Elder Cunningham.\n\n9. Nathan Lane - A Tony Award-winning actor, Lane began his Broadway career in the 1982 production of \"Lend Me a Tenor\". He played the role of Max.\n\n10. Glenn Close - The three-time Academy Award nominee got her start on Broadway in the 1974 production of \"Love for Love\" by William Congreve. Close played the role of Celia.\n\n11. Richard Dreyfuss - Before his breakout role in \"Jaws\", Dreyfuss made his Broadway debut in the 1967 production of \"The Aspern Papers\" by Jennifer Johnston. He played the role of Jeff Trissel.\n\n12. Bernadette Peters - A Tony Award-winning actress with an impressive Broadway resume, Peters made her debut in the 1967 production of \"Mrs. Santa Claus\". She played the title role.\n\n13. Patti LuPone - This Tony Award-winning actress made her Broadway debut in the 1972 production of \"The Cradle Will Rock\". She played the role of Sister Rose.\n\n14. Swoosie Kurtz - The TV star made her Broadway debut in the 1974 production of \"The House of Blue Leaves\" by John Guare. Kurtz played the role of Corabeth.\n\n15. Matthew Broderick - Known for his roles in \"Ferris Bueller's Day Off\" and \"The Lion King\", Broderick began his Broadway career in the 1982 production of \"Brighton Beach Memoirs\" by Neil Simon. He played the role of Eugene Morris Jerome.\n\n16. F. Murray Abraham - The \"Amadeus\" star made his Broadway debut in the 1968 production of \"The Great White Hope\". Abraham played the role of Jack Jefferson.\n\n17. Joanne Woodward - Before her marriage to Paul Newman, Woodward made her Broadway debut in the 1954 production of \"Three Tables\". She played the role of Lucy.\n\n18. James Earl Jones - The \"Star Wars\" actor made his Broadway debut in the 1957 production of \"Sunrise at Campobello\". Jones played the role of Franklin D. Roosevelt.\n\n19. Angela Lansbury - The Tony Award-winning actress started her Broadway career in the 1942 production of \"Strike Up the Band\". She played the role of Kitty Canfield.\n\n20. George C. Scott - The late actor, known for his roles in \"Patton\" and \"The Hustler\", started his Broadway career in the 1950 production of \"The Caine Mutiny Court-Martial\". Scott played the role of Lt. Barney Greenwald.\"\"\"\n    },\n    {\n        \"model\": \"M\",\n        \"output\": \"\"\"1. Meryl Streep - This Academy Award-winning actress got her start on Broadway in the 1975 production of \"Trelawny of the 'Wells'\" by Arthur Wing Pinero.\"\"\"\n    }\n}\n\n## Task\n\nEvaluate and rank the models based on the quality and relevance of their outputs. The ranking should be such that the model with the highest quality output is ranked first.",
        "role": "user"
    }
],

    "output": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "To evaluate and rank the models, I will consider the following criteria:\n\n1. **Relevance**: How well the response answers the prompt.\n2. **Completeness**: The extent to which the response provides a comprehensive answer.\n3. **Accuracy**: The factual correctness of the information provided.\n4. **Clarity**: How clear and understandable the response is.\n\nBased on the provided outputs:\n\n- **Model \"m\"** provides a detailed list of 20 actors, including their Broadway debuts and the roles they played. The response is relevant, comprehensive, and appears to be factually accurate, offering a rich set of examples that directly answer the prompt. The clarity is high, with a well-structured list that is easy to follow.\n\n- **Model \"M\"** offers a single example of an actor's Broadway debut. While the response is relevant and accurate, it lacks completeness compared to Model \"m\". It provides a very brief answer when the prompt seems to request multiple examples.\n\nGiven these observations, the ranking is as follows:\n\n1. Model \"m\" - for its comprehensive, relevant, accurate, and clear response.\n2. Model \"M\" - for its accurate but less comprehensive response.\n\nHere is the ranking in the requested JSON format:\n\n```json\n{\n  \"rankings\": [\n    {\n      \"model\": \"m\",\n      \"position\": 1\n    },\n    {\n      \"model\": \"M\",\n      \"position\": 2\n    }\n  ]\n}\n```\n\nThis JSON structure can be used as input for the `make_partial_leaderboard` function to reflect the ranking based on the quality of the responses."
            },
            "finish_reason": "stop"
        }
    ]

@YannDubs
Copy link
Collaborator

Sorry, that was an old version of the README, I fixed it now. The annotator for AE2 is weighted_alpaca_eval_gpt4_turbo. Note that you don't even need to specify it, it's automatically set here https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/constants.py#L41

I'm still surprised that alpaca_eval_cot_gpt4_turbo_fn doesn't work. Are you using it with the alpaca_eval configs? if you are using the directly in your code it will not work because you are not giving the JSON schema of the function to openai: https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/alpaca_eval_cot_gpt4_turbo_fn/configs.yaml#L10

feel free to reopen if you still encounter any issues

@RZFan525
Copy link

RZFan525 commented Mar 8, 2024

I'm using weighted_alpaca_eval_gpt4_turbo
However, there is an error:
image

@YannDubs
Copy link
Collaborator

YannDubs commented Mar 8, 2024

seems like a different issue. where is the model? you seem to be using /lonlie.plus7.plus rather than OpenAI

@RZFan525
Copy link

RZFan525 commented Mar 9, 2024

seems like a different issue. where is the model? you seem to be using /lonlie.plus7.plus rather than OpenAI

This is an intermediary interface, which will forward to OpenAI.

@YannDubs
Copy link
Collaborator

YannDubs commented Mar 9, 2024

My guess is that the issue is either on the intermediary interface or that you need to update openai (less likely). Can you try using the base interface for one example?

@RZFan525
Copy link

My guess is that the issue is either on the intermediary interface or that you need to update openai (less likely). Can you try using the base interface for one example?

Yes, it is on the intermediary interface. Thank you for your answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants