Trouble re-creating the guanaco_33b evaluator baseline #220

mathewhuen · 2024-01-29T03:58:52Z

Hi, I am unfortunately restricted to using offline evaluators. I tried re-running the guanaco_33b evaluator baseline using

alpaca_eval analyze_evaluators --annotators_config=guanaco_baseline

after copying the guanaco_33b evaluator to a new guanaco_baseline directory and updating the fn_completions and is_fast_tokenizer fields in the the configs.yaml file to be

guanaco_baseline:
  prompt_template: "guanaco_baseline/basic_prompt.txt"
  fn_completions: "huggingface_local_completions"
  completion_kwargs:
    model_name: "timdettmers/guanaco-33b-merged"
    max_new_tokens: 50
    is_fast_tokenizer: False
  completion_parser_kwargs:
    outputs_to_match:
      1: '(?:^|\n) ?Output \(a\)'
      2: '(?:^|\n) ?Output \(a\)'
  batch_size: 1

but the human agreement was much lower than the published agreement (59.02 vs 62.75 published). Can you give me any feedback on re-creating this evaluation?

I am using

accelerate==0.26.1
bitsandbyts==0.42.0
datasets==2.16.1
pandas==2.2.0
openai==1.9.0
optimum==1.16.2
scipy==1.12.0
tiktoken==0.5.2
torch==2.1.2
Transformers==4.36.2
triton==2.1.0

and a version of alpaca_eval from just before the alpaca-eval 2 update (495b606).

I should note that I had to slightly modify the code to get it to work with optimum. Since bettertransformer has been implemented in Transformers for the LLaMA architecture, optimum was raising a value error which makes me suspect that it might be a version issue. By chance, does someone have a requirements.txt file with exact versions?

Thank you for your help!

The text was updated successfully, but these errors were encountered:

YannDubs · 2024-01-29T20:41:08Z

Hi @mathewhuen! The Guanaco results were computed with hugging face API a long time ago so it will be hard to recreate those results, unfortunately. That being said 3pp seems like a big drop! Are you using temperature: 0.7?

mathewhuen · 2024-01-31T01:32:21Z

Thank you for the quick reply @YannDubs! It looks like I was not using any temperature setting, so I ran it again with do_sample: True and temperature: 0.7 and the human correlation dropped to 53.3--which seems extreme. I noticed that more evaluations were skipped because they didn't match the expected output patterns, so I relaxed the regex and ran it again, but performance was similar.

I'll try running the evaluation with the HuggingFace API to confirm if it's a problem with my local environment.

I will update again with the results!

YannDubs · 2024-01-31T02:27:27Z

How many outputs were actually annotated?

mathewhuen · 2024-01-31T03:43:15Z

Annotations for the do_sample: False runs was good, but each of the four runs for the temperature: 0.7 had around 50 samples not annotated:

Sample	Seed	Annotated
False	0	648
False	1	647
False	2	647
False	3	647
True	0	602
True	1	594
True	2	596
True	3	601

YannDubs · 2024-01-31T18:48:49Z

Yeah, that's definitely not ideal. I would stick to do_sample: False for Guanaco, which is what we do for our main evaluator (temperature 0 for GPT4).

That being said I don't think you should be using Guanaco given that it's pretty old. I would instead consider newer open-source models. E.g. I think you should use at least use a llama2 base and ideally mixtral. If you do, please add a PR so that we can update the evaluators' table!

Let me know if you have other issues

mathewhuen · 2024-02-01T04:06:38Z

Yeah, that's definitely not ideal. I would stick to do_sample: False for Guanaco, which is what we do for our main evaluator (temperature 0 for GPT4).

Yes, it looks like do_sample: False will be more stable as far as generating parsable outputs goes.

That being said I don't think you should be using Guanaco given that it's pretty old.

I definitely agree! I originally ran Guanaco since it was already on the evaluator leaderboard, and I wanted to make sure that my setup was reproducing results correctly. If I ever manage find the problem, I'll leave an update.

I think you should use at least use a llama2 base and ideally mixtral. If you do, please add a PR so that we can update the evaluators' table!

Exactly--I modified the basic_prompt.txt for Mixtral (mistralai/Mixtral-8x7B-Instruct-v0.1) which scored a 64.86 on human correlation (so better than chatgpt_fn, but slightly lower than claude). I'll double check everything and try to submit a PR this weekend or next week.

Thanks again!

YannDubs closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble re-creating the guanaco_33b evaluator baseline #220

Trouble re-creating the guanaco_33b evaluator baseline #220

mathewhuen commented Jan 29, 2024

YannDubs commented Jan 29, 2024

mathewhuen commented Jan 31, 2024

YannDubs commented Jan 31, 2024

mathewhuen commented Jan 31, 2024

YannDubs commented Jan 31, 2024

mathewhuen commented Feb 1, 2024

Trouble re-creating the guanaco_33b evaluator baseline #220

Trouble re-creating the guanaco_33b evaluator baseline #220

Comments

mathewhuen commented Jan 29, 2024

YannDubs commented Jan 29, 2024

mathewhuen commented Jan 31, 2024

YannDubs commented Jan 31, 2024

mathewhuen commented Jan 31, 2024

YannDubs commented Jan 31, 2024

mathewhuen commented Feb 1, 2024