# Auto Annotations


Note: AlpacaFarm now uses [AlpacaEval 1](https://github.com/tatsu-lab/alpaca_eval/tree/main). For annotations and evaluations, since `text-davinci-003` has been depreciated by OpenAI.

In [1]:
cd ..

/Users/yanndubois/Desktop/GitHub/alpaca_farm


## Setting up

All of our annotators currently use OpenAI API. So the first step is to setup your OpenAI key (and potentially your organization). This can be done by either running setting your environment variable like 
- `export OPENAI_API_KEY="sk..."` 

The key object that you need for automatic annotations (both for training or for eval) is the `PairwiseAutoAnnotator`, which is a minor wrapper around [AlapcaEval's `PairwiseAnnotator`](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/annotators/pairwise_evaluator.py#L20) (wrapper to deal with separate instruction and input). 

In [2]:
from alpaca_farm.utils import jload
from alpaca_farm.auto_annotations import PairwiseAutoAnnotator

annotator = PairwiseAutoAnnotator(annotators_config="alpaca_eval_gpt4")

INFO:root:Creating the annotator from `alpaca_eval_gpt4`.
INFO:root:Saving annotations to `/Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json`.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.


## Annotating paired outputs

Now let's annotate some pairwise preference. All we need is some data. The annotator takes in either list of dictionaries, or pandas dataframes. The keys of dictionaries (or columns in dataframe) need to always contain `instruction` and `input` which defines the instruction. The annotator also needs a pair of outputs to compare, if you have a sequence of such pairs under the keys `output_1` and `output_2` then you can directly use `annotate_pairs`.

In [3]:
outputs_pairs = jload("examples/data/outputs_pairs.json")[:6]
print("Example of paired output:\n")
outputs_pairs[-1:]

Example of paired output:



[{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.',
  'input': '',
  'output_1': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]",
  'output_2': "Hey everyone! \n\nI'm hosting a dinner party this Friday night and I'd love for all of you to come over. We'll have a delicious spread of food and some great conversations. \n\nLet me know if you can make it - I'd love to see you all there!\n\nCheers,\n[Your Name]"}]

In [4]:
annotated = annotator.annotate_pairs(outputs_pairs)

Annotation chunk:   0%|                                                       | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 6 examples with alpaca_eval_gpt4
INFO:root:Using `openai_completions` on 6 prompts using gpt-4.
INFO:root:Kwargs to completion: {'model': 'gpt-4', 'is_chat': True, 'temperature': 0}. num_procs=5

prompt_batches:   0%|                                                         | 0/6 [00:00<?, ?it/s][AINFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Using OAI client number 1 out of 1.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST h

In [5]:
annotated[-1:]

[{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.',
  'input': '',
  'output_1': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]",
  'output_2': "Hey everyone! \n\nI'm hosting a dinner party this Friday night and I'd love for all of you to come over. We'll have a delicious spread of food and some great conversations. \n\nLet me know if you can make it - I'd love to see you all there!\n\nCheers,\n[Your Name]",
  'annotator': 'alpaca_eval_gpt4',
  'preference': 2.0,
  'price_per_example': 0.012989999999999998,
  'time_per_example': 0.8035085201263428,
  'raw_completion': "[\n    {'model': 'model_1', 'rank': 1},\n    {'model': 'model_2', 'rank': 2}\n]"}]

Here we see that the annotations adds two keys:
- `'preference'`: the index of the preferred output, here `preference=1` so `output_1` is preferred. In the case where both outputs are the same we give `preference=0`
- `'annotator'`: the name of the simulated annotator as found in the `annotators_config`

`annotate_pairs` is the main function and should be used if you have paired outputs to annotate. In many usecases, however, you  will outputs in different formats. In the following we discuss two additional helper function `annotate_head2head` and `annotate_samples` which are paticularly well suited for the typical format during evaluation and training respectively. Both functions call `annotate_pairs` under the hood after a reformatting step.

## Evaluation through pairwise comparison
For evaluation we need two components:
- outputs from the model we want to compare
- outputs on the same examples from the baseline model

Often case both of those components will be in a different list of dictionary (one list for each model). In this case all dictionaries need to contain an `output` column. Let us load such data from our simulated RLHF model and text-davinci-003 baseline.

In [6]:
outputs_baseline = jload("examples/data/outputs_baseline.json")[:6]
print("Example of baseline output:\n")
outputs_baseline[-1:]

Example of baseline output:



[{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.',
  'input': '',
  'output': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]"}]

In [7]:
outputs_rlhf = jload("examples/data/outputs_rlhf.json")[:6]
print("Example of rlhf output:\n")
outputs_rlhf[-1:]

Example of rlhf output:



[{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.',
  'input': '',
  'output': 'Dear Friends, \n\nI am writing to invite you all to a dinner on Friday evening. It is a casual affair, and I am looking forward to a fun evening catching up with you all. I am planning to make a selection of delicious dishes, ranging from appetizers to mains and desserts. There will be something for everyone to enjoy, and I am sure it will be a night to remember.\n\nThe dinner will be held at my place on Friday, April 17th at 7pm. If you are interested in joining me, please RSVP to this email by Thursday, April 16th. I am looking forward to seeing you all there! \n\nThank you, \n\n[Name]'}]

The annotator's function of interest when we have two sequences of outputs is `annotate_head2head`.

In [8]:
annotated = annotator.annotate_head2head(outputs_1=outputs_baseline, outputs_2=outputs_rlhf)

Annotation chunk:   0%|                                                       | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 6 examples with alpaca_eval_gpt4
INFO:root:Using `openai_completions` on 6 prompts using gpt-4.
INFO:root:Kwargs to completion: {'model': 'gpt-4', 'is_chat': True, 'temperature': 0}. num_procs=5

prompt_batches:   0%|                                                         | 0/6 [00:00<?, ?it/s][AINFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Using OAI client number 1 out of 1.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST h

In [9]:
annotated[:1]

[{'instruction': 'The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words.\n\nIf you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.',
  'input': 'If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.',
  'output_1': 'If you have questions about my rate or need to modify the scope of this project, please let me know.',
  'output_2': 'I am available to answer any questions you may have about my rate or if you need to alter the scope of this project. Feel free to contact me if you have any queries or require any additional information.',
  'annotator': 'alpaca_eval_gpt4',
  'preference': 1.0,
  'price_per_example': 0.01287,
  'time_pe

We see that the format of the output is the same as before. Here `preference` indicates that the simulator prefered the output `output_1`, which corresponds to the baseline model.

## Training
For training we typically have multiple outputs for each instruction which are sampled from the same (SFT) model as seen in the following data:

In [10]:
outputs_samples = jload("examples/data/multisamples_sft.json")[:1]
print("Example of different sampled outputs from SFT:\n")
outputs_samples

Example of different sampled outputs from SFT:



[{'instruction': 'Describe a time when you had to make a difficult decision.',
  'input': '',
  'output': ["I had to make a difficult decision a few years ago when I was offered a job that would require me to move to a different city. I had to weigh the pros and cons of this job offer, and consider what it would mean for me to leave my friends and family in order to take it. In the end, I decided to turn down the job offer since I wasn't ready to quit my current job, move to a different city, and leave my loved ones behind.",
   'I had to make a difficult decision last year when I was faced with a life-changing opportunity. I had to decide whether or not to leave my current job and move to a different city for the chance to further my career. After much consideration, I chose to take the risk and make the move. Thankfully, it paid off and I am now enjoying my new job and exploring the city.',
   "I had to make a difficult decision when I was offered a job in a city that was significant

In this case, you can use the annotator's `annotate_samples`, which first samples pairs of outputs that have the same instruction/input and then annotate those.

In [11]:
annotated = annotator.annotate_samples(outputs_samples)

INFO:root:Filtered unique instruction/input pairs: 4 -> 1
Annotation chunk:   0%|                                                       | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 1 examples with alpaca_eval_gpt4
INFO:root:Using `openai_completions` on 1 prompts using gpt-4.
INFO:root:Kwargs to completion: {'model': 'gpt-4', 'is_chat': True, 'temperature': 0}. num_procs=5

prompt_batches:   0%|                                                         | 0/1 [00:00<?, ?it/s][AINFO:root:Using OAI client number 1 out of 1.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

prompt_batches: 100%|█████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.78s/it][A
INFO:root:Completed 1 examples in 3.0 seconds.
INFO:root:Saving all annotations to /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/Git

In [12]:
annotated

[{'sample_id': 0,
  'instruction': 'Describe a time when you had to make a difficult decision.',
  'input': '',
  'output_1': "I had to make a difficult decision a few years ago when I was offered a job that would require me to move to a different city. I had to weigh the pros and cons of this job offer, and consider what it would mean for me to leave my friends and family in order to take it. In the end, I decided to turn down the job offer since I wasn't ready to quit my current job, move to a different city, and leave my loved ones behind.",
  'output_2': "I had to make a difficult decision when I was offered a job in a city that was significantly farther away from my friends and family. I weighed the pros and cons and consulted with people I trusted, but in the end I had to make the choice that was best for my future and my well-being. It was a difficult decision, but I'm glad I made the choice that allowed me to further my career and start my journey towards independence.",
  'ann

By default there will only be one pair per instruction,input. If you use `is_unique_instructions=False` then you will get as many pairs as outputs.    

## Going further

### Adding noise

During training we typically flip the the label with probability 0.25 to emulate the variability of human annotations. To do so you can either:
- initialize the annotator with `PairwiseAutoAnnotator(p_label_flip=0.25)`
- set the noise of an initialized annotator `annotator.set_noise(p_label_flip=0.25)` 
- give the noise to annotate_samples `annotate_samples(..., p_label_flip=0.25)` 

In [13]:
annotated = annotator.annotate_samples(outputs_samples, p_label_flip=0.25)

INFO:root:Filtered unique instruction/input pairs: 4 -> 1
Annotation chunk:   0%|                                                       | 0/1 [00:00<?, ?it/s]INFO:root:Adding random noise to the labels p_label_flip=0.25.
INFO:root:Annotating 0 examples with alpaca_eval_gpt4
INFO:root:Saving all annotations to /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
Annotation chunk: 100%|███████████████████████████████████████████████| 1/1 [00:01<00:00,  1.46s/it]


In [14]:
annotated

[{'sample_id': 0,
  'instruction': 'Describe a time when you had to make a difficult decision.',
  'input': '',
  'output_1': "I had to make a difficult decision a few years ago when I was offered a job that would require me to move to a different city. I had to weigh the pros and cons of this job offer, and consider what it would mean for me to leave my friends and family in order to take it. In the end, I decided to turn down the job offer since I wasn't ready to quit my current job, move to a different city, and leave my loved ones behind.",
  'output_2': "I had to make a difficult decision when I was offered a job in a city that was significantly farther away from my friends and family. I weighed the pros and cons and consulted with people I trusted, but in the end I had to make the choice that was best for my future and my well-being. It was a difficult decision, but I'm glad I made the choice that allowed me to further my career and start my journey towards independence.",
  'ann

### Cost & time efficiency: avoiding duplicate annotation
Often time you will find yourself reannotating examples that you already annotated. This is particularly true if you sample many times from the same model (eg to get pairwise preferences for training or to evaluate best-of-n) or if the instructions you are considering require short outputs => many models might give the same exact answer.

This means that you have to spend unecessary money and time. Thankfully `PairwiseAnnotator` stores previous annotations and will reuse those. For example, let us reannotate the previous evaluation

In [15]:
annotated = annotator.annotate_head2head(outputs_1=outputs_baseline, outputs_2=outputs_rlhf)

Annotation chunk:   0%|                                                       | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 0 examples with alpaca_eval_gpt4
INFO:root:Saving all annotations to /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
Annotation chunk: 100%|███████████████████████████████████████████████| 1/1 [00:01<00:00,  1.45s/it]


In [16]:
annotated[-1:]

[{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.',
  'input': '',
  'output_1': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]",
  'output_2': 'Dear Friends, \n\nI am writing to invite you all to a dinner on Friday evening. It is a casual affair, and I am looking forward to a fun evening catching up with you all. I am planning to make a selection of delicious dishes, ranging from appetizers to mains and desserts. There will be something for everyone to enjoy, and I am sure it will be a night to remember.\n\nThe dinner will be held at my place on Friday, April 17th at 7pm. If you are interested in joining me, please RSVP to this email by Thursday, April 16th. I am looking forward to seeing you all there! \n\nThank you, \n\n[Name]',
  'annotator': '

We now get the annotations without actually having to reannotate any example.
By default, the annotations are saved on disk in the directory that contains the annotators configs. You can remove this caching by giving `caching_path=None` to `PairwiseAutoAnnotator`.


### Evaluating win rates 

Let's show how to get win rates.

In [17]:
from alpaca_eval.metrics import pairwise_to_winrate

In [18]:
# all the following is selfinstruct eval
outputs_baseline = jload("examples/data/outputs_baseline.json")
outputs_rlhf = jload("examples/data/outputs_rlhf.json")
outputs_sft = jload("examples/data/outputs_sft.json")

In [19]:
annotator = PairwiseAutoAnnotator()

INFO:root:Creating the annotator from `alpaca_eval_gpt4`.
INFO:root:Saving annotations to `/Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json`.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.


In [25]:
annotated_sft = annotator.annotate_head2head(outputs_1=outputs_baseline, outputs_2=outputs_sft)
pairwise_to_winrate(preferences=[a["preference"] for a in annotated_sft])

Annotation chunk:   0%|                                                       | 0/2 [00:00<?, ?it/s]INFO:root:Annotating 0 examples with alpaca_eval_gpt4
INFO:root:Saving all annotations to /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
Annotation chunk:  50%|███████████████████████▌                       | 1/2 [00:01<00:01,  1.51s/it]INFO:root:Annotating 0 examples with alpaca_eval_gpt4
INFO:root:Saving all annotations to /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
Annotation chunk: 100%|███

{'win_rate': 34.0,
 'standard_error': 2.9343554261657316,
 'n_wins': 80,
 'n_wins_base': 160,
 'n_draws': 10,
 'n_total': 250,
 'discrete_win_rate': 34.0}

In [24]:
annotated_rlhf = annotator.annotate_head2head(outputs_1=outputs_baseline, outputs_2=outputs_rlhf)
pairwise_to_winrate(preferences=[a["preference"] for a in annotated_rlhf])

Annotation chunk:   0%|                                                       | 0/2 [00:00<?, ?it/s]INFO:root:Annotating 0 examples with alpaca_eval_gpt4
INFO:root:Saving all annotations to /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
Annotation chunk:  50%|███████████████████████▌                       | 1/2 [00:01<00:01,  1.70s/it]INFO:root:Annotating 0 examples with alpaca_eval_gpt4
INFO:root:Saving all annotations to /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/yanndubois/Desktop/GitHub/alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/annotations_seed0_configs.json.
Annotation chunk: 100%|███

{'win_rate': 53.98406374501992,
 'standard_error': 3.145897059163296,
 'n_wins': 135,
 'n_wins_base': 115,
 'n_draws': 1,
 'n_total': 251,
 'discrete_win_rate': 53.98406374501992}

### Configuring annotators

The most important argument to `PairwiseAnnotator` is `annotators_config` which defines the pool of annotators (the API provider, the model, the prompt, and the decoding parameters) . We provide many options out of the box.

Here's the desciption of `annotators_config` from the docstring:
```
A dictionary or path to a yaml file containing the configuration for the pool of annotators. The keys in the first dictionary should be the annotator's name, and the value should be a dictionary of the annotator's configuration which should have the following keys:

- prompt_templates (dict): a dictionary of prompt templates or path to the prompts. The keys should be "without_inputs" and "with_inputs". Each template should contain placeholders for keys in the example dictionary, typically {instruction} and {output_1} {output_2}.
- fn_decoder (str): function in `alpaca_farm.auto_annotations.pairwise_annotators.decoders.py` for completions.
- decoder_kwargs (dict): kwargs for fn_decode. E.g. model_name, max_tokens, temperature, tokens_to_avoid, tokens_to_favor
- outputs_to_match (dict): Kwargs for fn_completion_parser. With the default regex parser it needs `outputs_to_match` which is a dictionary of outputs to match from the completions. The values should be a regex pattern that should be matched, the keys should be the corresponding preference value. For example {1: 'Output \(a\)'} will match the output "Output (a)" and set the preference to 1.
- other kwargs to `SinglePairwiseAutoAnnotator` such as batch_size
```

And here is config `"annotators/test/configs.yaml"` of the annotator we used above. 

For more information, please refer to [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval/tree/main). In our code, we directly use AlpacaEval, with a minor modification to separate instructions with and without inputs.