-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble re-creating the guanaco_33b evaluator baseline #220
Comments
Hi @mathewhuen! The Guanaco results were computed with hugging face API a long time ago so it will be hard to recreate those results, unfortunately. That being said 3pp seems like a big drop! Are you using |
Thank you for the quick reply @YannDubs! It looks like I was not using any temperature setting, so I ran it again with I'll try running the evaluation with the HuggingFace API to confirm if it's a problem with my local environment. I will update again with the results! |
How many outputs were actually annotated? |
Annotations for the
|
Yeah, that's definitely not ideal. I would stick to That being said I don't think you should be using Guanaco given that it's pretty old. I would instead consider newer open-source models. E.g. I think you should use at least use a llama2 base and ideally mixtral. If you do, please add a PR so that we can update the evaluators' table! Let me know if you have other issues |
Yes, it looks like
I definitely agree! I originally ran Guanaco since it was already on the evaluator leaderboard, and I wanted to make sure that my setup was reproducing results correctly. If I ever manage find the problem, I'll leave an update.
Exactly--I modified the basic_prompt.txt for Mixtral (mistralai/Mixtral-8x7B-Instruct-v0.1) which scored a 64.86 on human correlation (so better than chatgpt_fn, but slightly lower than claude). I'll double check everything and try to submit a PR this weekend or next week. Thanks again! |
Hi, I am unfortunately restricted to using offline evaluators. I tried re-running the guanaco_33b evaluator baseline using
after copying the guanaco_33b evaluator to a new guanaco_baseline directory and updating the fn_completions and is_fast_tokenizer fields in the the configs.yaml file to be
but the human agreement was much lower than the published agreement (59.02 vs 62.75 published). Can you give me any feedback on re-creating this evaluation?
I am using
and a version of alpaca_eval from just before the alpaca-eval 2 update (495b606).
I should note that I had to slightly modify the code to get it to work with optimum. Since bettertransformer has been implemented in Transformers for the LLaMA architecture, optimum was raising a value error which makes me suspect that it might be a version issue. By chance, does someone have a requirements.txt file with exact versions?
Thank you for your help!
The text was updated successfully, but these errors were encountered: