Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble re-creating the guanaco_33b evaluator baseline #220

Closed
mathewhuen opened this issue Jan 29, 2024 · 6 comments
Closed

Trouble re-creating the guanaco_33b evaluator baseline #220

mathewhuen opened this issue Jan 29, 2024 · 6 comments

Comments

@mathewhuen
Copy link

Hi, I am unfortunately restricted to using offline evaluators. I tried re-running the guanaco_33b evaluator baseline using

alpaca_eval analyze_evaluators --annotators_config=guanaco_baseline

after copying the guanaco_33b evaluator to a new guanaco_baseline directory and updating the fn_completions and is_fast_tokenizer fields in the the configs.yaml file to be

guanaco_baseline:
  prompt_template: "guanaco_baseline/basic_prompt.txt"
  fn_completions: "huggingface_local_completions"
  completion_kwargs:
    model_name: "timdettmers/guanaco-33b-merged"
    max_new_tokens: 50
    is_fast_tokenizer: False
  completion_parser_kwargs:
    outputs_to_match:
      1: '(?:^|\n) ?Output \(a\)'
      2: '(?:^|\n) ?Output \(a\)'
  batch_size: 1

but the human agreement was much lower than the published agreement (59.02 vs 62.75 published). Can you give me any feedback on re-creating this evaluation?

I am using

accelerate==0.26.1
bitsandbyts==0.42.0
datasets==2.16.1
pandas==2.2.0
openai==1.9.0
optimum==1.16.2
scipy==1.12.0
tiktoken==0.5.2
torch==2.1.2
Transformers==4.36.2
triton==2.1.0

and a version of alpaca_eval from just before the alpaca-eval 2 update (495b606).

I should note that I had to slightly modify the code to get it to work with optimum. Since bettertransformer has been implemented in Transformers for the LLaMA architecture, optimum was raising a value error which makes me suspect that it might be a version issue. By chance, does someone have a requirements.txt file with exact versions?

Thank you for your help!

@YannDubs
Copy link
Collaborator

Hi @mathewhuen! The Guanaco results were computed with hugging face API a long time ago so it will be hard to recreate those results, unfortunately. That being said 3pp seems like a big drop! Are you using temperature: 0.7?

@mathewhuen
Copy link
Author

Thank you for the quick reply @YannDubs! It looks like I was not using any temperature setting, so I ran it again with do_sample: True and temperature: 0.7 and the human correlation dropped to 53.3--which seems extreme. I noticed that more evaluations were skipped because they didn't match the expected output patterns, so I relaxed the regex and ran it again, but performance was similar.

I'll try running the evaluation with the HuggingFace API to confirm if it's a problem with my local environment.

I will update again with the results!

@YannDubs
Copy link
Collaborator

How many outputs were actually annotated?

@mathewhuen
Copy link
Author

Annotations for the do_sample: False runs was good, but each of the four runs for the temperature: 0.7 had around 50 samples not annotated:

Sample Seed Annotated
False 0 648
False 1 647
False 2 647
False 3 647
True 0 602
True 1 594
True 2 596
True 3 601

@YannDubs
Copy link
Collaborator

Yeah, that's definitely not ideal. I would stick to do_sample: False for Guanaco, which is what we do for our main evaluator (temperature 0 for GPT4).

That being said I don't think you should be using Guanaco given that it's pretty old. I would instead consider newer open-source models. E.g. I think you should use at least use a llama2 base and ideally mixtral. If you do, please add a PR so that we can update the evaluators' table!

Let me know if you have other issues

@mathewhuen
Copy link
Author

Yeah, that's definitely not ideal. I would stick to do_sample: False for Guanaco, which is what we do for our main evaluator (temperature 0 for GPT4).

Yes, it looks like do_sample: False will be more stable as far as generating parsable outputs goes.

That being said I don't think you should be using Guanaco given that it's pretty old.

I definitely agree! I originally ran Guanaco since it was already on the evaluator leaderboard, and I wanted to make sure that my setup was reproducing results correctly. If I ever manage find the problem, I'll leave an update.

I think you should use at least use a llama2 base and ideally mixtral. If you do, please add a PR so that we can update the evaluators' table!

Exactly--I modified the basic_prompt.txt for Mixtral (mistralai/Mixtral-8x7B-Instruct-v0.1) which scored a 64.86 on human correlation (so better than chatgpt_fn, but slightly lower than claude). I'll double check everything and try to submit a PR this weekend or next week.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants