Quick evals with no judge. It works like this:
# get some preference data
data = [{
"prompt": "The story began with a cat:",
"chosen": "The cat jumped.",
"rejected": "The dog ran."
}]
# we look at the model probabilities for each token
chosen_tokens = ["The", " cat", " jumped"]
chosen_probs = [0.9, 0.8, 0.9]
rejected_tokens = ["The", " dog", " ran"]
rejected_probs = [0.9, 0.2, 0.1]
# At the point which the completions diverge we see which is more probable
chosen_probs[1]>rejected_probs[1] # True
This has the advantages of
- Re-using preference datasets
- Not needing a judge
- Giving highly informatives returns (we can get uncalibrated probabilities for each token or accuracy)
- Being hackable by reusing popular huggingface libraries like transformers
pip install git+https://github.com/wassname/open_pref_eval.git
from open_pref_eval import evaluate, load_model
results = evaluate(model_name="gpt2", datasets=["unalignment/toxic-dpo-v0.2"])
print(results)
Output:
dataset | correct | prob | n | model |
---|---|---|---|---|
help_steer2-dpo | 0.39 | 0.486 | 200 | gpt2 |
toxic-dpo-v0.2 | 1 | 0.715 | 200 | gpt2 |
truthful_qa_binary | 0.52 | 0.505 | 200 | gpt2 |
See more ./examples/evaluate_gpt2.ipynb
- example notebook
- test
- make radar chart
- add more datasets (math, ethics, etc)
- push mmlu and ethics to huggingface and commit gen notebooks
- improve radar plot
- look at the best way to extract probs (ipo, dpo, first diverging token, brier score, calbirated probs etc)
- change to first div token
- add option to calibrate
- GENIES generalisation datasets
Q: Why use the weighted prob over all the tokens?
A: This gives the best accuracy contrast. And it best shows things like generalisation.
It also provides scores that correspond with the accuracy. See this notebook for more details. In the following plot of Llama-7b on a simple sentiment prediction task, we expect high accuracy, and indeed we see it for "1st_diverg", which is the method we use in this repository.
TODO: add follow up experiments using GENIES adapters I made
Q: Why preference datasets?
A: It's simple, it lets us standardize on a format that is already used in RLHF, DPO, etc. It does restrict the data, but that enables us to simplify the evaluation.
Q: Why use this library?
A: I've found other evaluations to be slow and hard to modify. As a result, people hardly use them. This is an attempt to make measurement fast, hackable, and simple. If we can all measure more, we will all learn more.