## Generating SFT and Reward Model Datasets

Here we walk through how to generate all intermediate datasets used to train the LC SFT, LC RL, Factuality SFT, and Factuality RL methods in the paper.
We provide cached SFT and reward model datasets for all methods and baselines at <https://huggingface.co/datasets/tatsu-lab/linguistic_calibration>. 
However, if you want to generate the datasets yourself---for example, to use a more sophisticated Summarize() function during summary distillation, or to replicate the pipeline with a stronger base model---you can follow the steps below.

### Note on Converting QA Pairs to Decision Tasks
As described in the Methods section of the paper (Section 3), we have pre-converted all questions $x$ from off-the-shelf QA datasets into open-ended queries $q$, which prompt the model for open-ended paragraph generations.
If you want to convert another QA dataset, use the prompt at `linguistic_calibration/prompts/generating_open_ended_queries/generate_open_ended_query_claude_10shot.txt`.

### Generating LC SFT Training Data

Our LC SFT model is finetuned using the summary distillation algorithm. Specifically, we use the following procedure:
1. For each example in the SFT dataset, sample M long-form paragraph generations from a base model. In our experiments, we use a Llama 2 7B 8-Shot ICL base model and M=8. 
2. For each example, summarize the M long-form paragraph generations into a single consensus paragraph using an API-based LLM (here, Claude 2.0).
3. Construct a dataset of (query, summary paragraph) pairs and finetune the base model (Llama 2 7B Base) on them. 

This pipeline can also be used to generate the Factuality SFT training dataset, by changing the paragraph generation prompt type and model in step 1, and by removing step 2.
Specifically, in step 1, Factuality SFT uses:
- model_name="llama-2-7b-hf"
- paragraph_generation_prompt_type="generate_paragraphs_llama_trivia_qa_icl_8shot"
This is the same as summary distillation, but we only need a single paragraph generation per example which will be specified in a `decoding_args` object.

In [None]:
cd ..

First, you should make sure you have set your OpenAI and Anthropic API keys.

```bash
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-...
```

### 1. For each example in the SFT dataset, sample M long-form paragraph generations from a base model. In our experiments, we use a Llama 2 7B 8-Shot ICL base model and M=8.

In [3]:
# Ensure that you have specified your ICL model base path in the constants.py file.
# By default we prompt Llama 2 7B Base with 8-shot ICL, but it is straightforward to extend this pipeline to a base model of choice.

from linguistic_calibration import constants

ICL_BASE_PATH = constants.SHORT_NAME_TO_MODEL_PATH.get('llama-2-7b-hf'); ICL_BASE_PATH

'/home/nband/models/llama-2-7b-hf'

In [4]:
# To test this pipeline out, feel free to set MAX_EXAMPLES to a small number.
# If you want to generate the full size SFT and reward model datasets, set MAX_EXAMPLES to None.
MAX_EXAMPLES = 10

In [None]:
import datasets
import pandas as pd

from linguistic_calibration.inference.decode import HFDecodingArguments, decode_prompts_with_model
from linguistic_calibration.auto_annotations.qa_auto_eval_utils import format_paragraph_generation_prompt


# Load SFT prompts
train_dataset = datasets.load_dataset(
    constants.HF_DATASETS_PATH,
    name="sft_training",
    split="train",
    cache_dir=constants.DEFAULT_CACHE_DIR,
)
train_dataset_df = pd.DataFrame(train_dataset)

if MAX_EXAMPLES is not None:
    train_dataset_df = train_dataset_df.head(MAX_EXAMPLES)
    print(f"Using only {MAX_EXAMPLES} examples for SFT dataset.")
    
# These are also the same prompts you would use to generate the Factuality SFT training dataset
sft_prompts = format_paragraph_generation_prompt(
    train_dataset_df,
    model_name="llama-2-7b-hf",
    paragraph_generation_prompt_type="generate_paragraphs_llama_trivia_qa_icl_8shot"
)

In [None]:
# By default, we sample 8 long-form paragraph generations per example, with temperature 0.7
# For Factuality SFT, you should just use num_return_sequences=1
generate_multi_sample_decoding_args = HFDecodingArguments(
    temperature=0.7,
    max_new_tokens=512,
    num_return_sequences=8
)
lc_sft_multisample_generations = decode_prompts_with_model(
        prompts=sft_prompts,
        model_name=ICL_BASE_PATH,
        decoding_args=generate_multi_sample_decoding_args,
        per_device_batch_size=1,
)

In [7]:
# Our output list is shape (N, M), where
# N is the number of examples in the SFT dataset, and
# M is the number of paragraph generations per example.

# Since we sample from an ICL model, we postprocess generations to split on the first "\n\n" token. 

new_paragraph_generations = []
for paragraph_generations in lc_sft_multisample_generations:
    new_paragraph_generations.append([generation.strip().split("\n\n")[0] for generation in paragraph_generations])
    
new_paragraph_generations[0]

['Steven Spielberg directed the 2002 film "Minority Report," which is set primarily in the year 2054. Based on a short story by Philip K. Dick, the film follows a group of people known as "pre-cogs," who can foresee crimes before they happen. The pre-cogs are used by the police to stop crimes before they occur, but the system soon begins to unravel, leading to a dangerous chase between the pre-cogs and the police.',
 'Steven Spielberg directed the 2002 film "Minority Report," which is set primarily in the year 2054. The film is based on the short story "The Minority Report" by Philip K. Dick and stars Tom Cruise, Colin Farrell, and Max von Sydow. "Minority Report" tells the story of a futuristic society where individuals with psychic abilities are used to predict and prevent crimes before they happen. However, the use of these psychics raises ethical and legal questions, leading to a conflict between the police and the psychics themselves. The film received positive reviews for its act

### 2. For each example, summarize the M long-form paragraph generations into a single consensus paragraph using an API-based LLM (here, Claude 2.0).

In [8]:
from linguistic_calibration.types import List
from linguistic_calibration.utils import read

LC_SUMMARIZATION_PREFIX = "### Thought {idx}"
SUMMARIZATION_PROMPT_TEMPLATE = read("src/linguistic_calibration/prompts/paragraph_generation/summarize_claude.txt")

def format_summarization_example(
    prompt_template: str, 
    list_of_M_samples: List[str]
):
    formatted_paragraphs = ""
    for i, generated_paragraph in enumerate(list_of_M_samples):
        prefix = LC_SUMMARIZATION_PREFIX.format(idx=i+1)
        formatted_paragraphs += prefix + "\n" + generated_paragraph + "\n\n"

    return prompt_template.format(formatted_paragraphs=formatted_paragraphs)


lc_sft_summarization_prompts = [
    format_summarization_example(SUMMARIZATION_PROMPT_TEMPLATE, paragraph_generations)
    for paragraph_generations in new_paragraph_generations
]

In [None]:
# Run summarization with Claude 2.0

from linguistic_calibration.openai_utils import OpenAIDecodingArguments
from linguistic_calibration.inference.decode import get_text_from_completions

claude_summarization_decoding_args = OpenAIDecodingArguments(temperature=0.3)

lc_sft_summarizations = decode_prompts_with_model(
    prompts=lc_sft_summarization_prompts,
    model_name="claude-2.0",
    decoding_args=claude_summarization_decoding_args,
)
lc_sft_summarizations = get_text_from_completions(lc_sft_summarizations)

### 3. Construct a dataset of (query, summary paragraph) pairs and finetune the base model (Llama 2 7B Base) on them. 

In [10]:
new_sft_dataset = pd.DataFrame(
    {
        "question_id": train_dataset_df["question_id"],
        "paragraph_generation_prompt": train_dataset_df["paragraph_generation_prompt"],
        "claude_summary": lc_sft_summarizations
    }
)

new_sft_hf_dataset = datasets.Dataset.from_pandas(new_sft_dataset); new_sft_hf_dataset

Dataset({
    features: ['question_id', 'paragraph_generation_prompt', 'claude_summary'],
    num_rows: 10
})

In [ ]:
new_sft_hf_dataset

*Finetuning on the LC SFT Dataset*

Now, you can finetune your base model (in our paper, Llama 2 7B Base) on this dataset.
Specifically, you could upload this dataset to HF datasets hub or save it locally, and then load it in the `supervised.py` training script by altering the loader method `linguistic_calibration.data_utils.make_linguistic_calibration_supervised_data_module`.

See README.md in the main directory for an example of running the LC SFT training script.

### Generating LC Reward Model Training Data

LC RL is trained using decision-based RL. In our instantiation of decision-based RL, we decompose surrogate forecasting into two operations: `ExtractAnswers` and `ForecastProbs` (for more information, refer to the paper).
    
Following Algorithm 1, we need to:
1. Use the LC SFT model to generate paragraphs using the prompts from the Reward Model split of TriviaQA,
2. and then use an API-based LLM (Claude 2.0 in our case) to generate answer extractions and probability forecasts.

This can be done straightforwardly in a single function call by using the QA auto-evaluation pipeline on the reward model dataset.

In [None]:
from examples.qa_automated_eval import main as qa_auto_eval_main

# Can set however you like
REWARD_MODEL_DATASET_OUTPUT_PATH = constants.DEFAULT_OUTPUT_DIR

# We assume that you have already finetuned the LC SFT model on the SFT dataset, 
# and specified its path in the constants.py file with key "lc_sft" in the dict constants.SHORT_NAME_TO_MODEL_PATH.
LC_SFT_MODEL_NAME = "lc_sft"

# We specify the prompts for answer extraction and forecasting used in the paper.
# We specify "failout" for the semantic equivalence prompt, which will intentionally end the script after probability forecasting. You should expect a FileNotFoundError.

qa_auto_eval_main(
    paragraph_generator_model_name=LC_SFT_MODEL_NAME,
    paragraph_generation_prompt="generate_paragraphs_llama_finetuned",
    answer_extractor_model_name="claude-2.0",
    answer_extractor_prompt="train/extract_answers_claude_8shot",
    forecast_probs_model_name="claude-2.0",
    forecast_probs_prompt="train/forecast_probs_claude_0shot",
    semantic_equivalence_prompt="failout",
    dataset_name="trivia_qa",
    dataset_split="reward_model",
    max_n_examples=MAX_EXAMPLES,
    generation_temperature=0.7,  # The same temp we use when sampling during PPO
    output_root_dir=REWARD_MODEL_DATASET_OUTPUT_PATH,
)

Now, we need to process the outputs from the QA auto-eval pipeline into a format that the reward_modeling.py script accepts.
Specifically, we need to use the format in the `reward_model_training` subset of the `tatsu-lab/linguistic_calibration` dataset on HuggingFace.

We need the following columns in order to train `ExtractAnswers` and `ForecastProbs`:
* question_id: str
* lc_sft_generated_paragraph: str
* lc_sft_ground_truth_and_extracted_answers: List[str], where the first entry is the ground-truth answer and the remaining are answers extracted with the API-based LLM.
* lc_sft_forecasted_probs: List[float], where the first entry is the API-based LLM--forecasted prob for the ground-truth answer and the rest for the extracted answers.

In [None]:
from collections import defaultdict

FORECAST_PROBS_RESULTS_PATH = f"{REWARD_MODEL_DATASET_OUTPUT_PATH}/forecast_probs/trivia_qa/reward_model/lc_sft/claude-2.0/claude-2.0/skip_answer_extraction-False--max_ex-{MAX_EXAMPLES}--seed-42/gen_prompt-generate_paragraphs_llama_finetuned/extr_prompt-train__extract_answers_claude_8shot/forecast_prompt-train__forecast_probs_claude_0shot/gen_temp-0.7/ext_temp-0.2/forecast_temp-0.2/probability_forecasts.csv"

forecast_probs_df = pd.read_csv(FORECAST_PROBS_RESULTS_PATH)

reward_model_question_ids = []
reward_model_generated_paragraphs = []
reward_model_question_id_to_answers = defaultdict(list)
reward_model_question_id_to_forecasts = defaultdict(list)

for _, row in forecast_probs_df.iterrows():
    question_id = row["question_id"]
    
    if question_id not in reward_model_question_ids:
        reward_model_question_ids.append(question_id)
        reward_model_generated_paragraphs.append(row["generated_paragraph"])
    
    reward_model_question_id_to_answers[question_id].append(row["ground_truth_top_answer"])
    reward_model_question_id_to_forecasts[question_id].append(row["interpretation__forecast_probs"])
    
reward_model_dataset = pd.DataFrame(
    {
        "question_id": reward_model_question_ids,
        "lc_sft_generated_paragraph": reward_model_generated_paragraphs,
        "lc_sft_ground_truth_and_extracted_answers": [reward_model_question_id_to_answers[question_id] for question_id in reward_model_question_ids],
        "lc_sft_forecasted_probs": [reward_model_question_id_to_forecasts[question_id] for question_id in reward_model_question_ids]
})

*Finetuning ExtractAnswers and ForecastProbs on the RM Dataset*

Now, you can finetune the two reward models used in the LC RL pipeline: `ExtractAnswers` and `InterpretProbs`.

Specifically, you could upload this dataset to HF datasets hub or save it locally. 
Then you can:
1. Load it during ExtractAnswers training (using the `supervised.py` training script) by altering the loader method `linguistic_calibration.data_utils.make_linguistic_calibration_supervised_data_module`.
2. Load it during ForecastProbs training (using the `reward_modeling.py` training script) by altering the loader method `linguistic_calibration.data_utils.make_linguistic_calibration_reward_modeling_data_module`.

See README.md in the main directory for a walkthrough of training the ExtractAnswers and ForecastProbs functions.

### Factuality Binary Correctness Reward Model Dataset

You can follow an almost identical approach to generate the Factuality Binary Correctness Reward Model dataset:

In [None]:
from examples.qa_automated_eval import main as qa_auto_eval_main

# Can set however you like
REWARD_MODEL_DATASET_OUTPUT_PATH = constants.DEFAULT_OUTPUT_DIR

# We assume that you have already finetuned the Factuality SFT model on the Factuality SFT dataset, 
# and specified its path in the constants.py file with key "factuality_sft" in the dict constants.SHORT_NAME_TO_MODEL_PATH.
FACTUALITY_SFT_MODEL_NAME = "factuality_sft"

# We specify the prompts for binary correctness annotation used in the paper.
# We specify "failout" for the semantic equivalence prompt, which will intentionally end the script after binary correctness annotation. You should expect a FileNotFoundError.

qa_auto_eval_main(
    paragraph_generator_model_name=FACTUALITY_SFT_MODEL_NAME,
    paragraph_generation_prompt="generate_paragraphs_llama_finetuned",
    forecast_probs_model_name="claude-2.0",
    forecast_probs_prompt="train/score_binary_correctness_claude_0shot",
    semantic_equivalence_prompt="failout",
    skip_answer_extraction=True,
    skip_forecast_probs=False,
    dataset_name="trivia_qa",
    dataset_split="reward_model",
    max_n_examples=MAX_EXAMPLES,
    generation_temperature=0.7,  # The same temp we use when sampling during PPO
    output_root_dir=REWARD_MODEL_DATASET_OUTPUT_PATH,
)

In [None]:
# Once again, we process the results from the QA auto-eval pipeline into a format that the reward_modeling.py script accepts, this time for the Factuality Binary Correctness Reward Model dataset.

BINARY_CORRECTNESS_RM_RESULTS_PATH = f"{REWARD_MODEL_DATASET_OUTPUT_PATH}/forecast_probs/trivia_qa/reward_model/factuality_sft/claude-2.0/claude-2.0/skip_answer_extraction-True--max_ex-{MAX_EXAMPLES}--seed-42/gen_prompt-generate_paragraphs_llama_finetuned/extr_prompt-eval__extract_answers_claude_10shot/forecast_prompt-train__score_binary_correctness_claude_0shot/gen_temp-0.7/ext_temp-0.2/forecast_temp-0.2/probability_forecasts.csv"

binary_correctness_rm_df = pd.read_csv(BINARY_CORRECTNESS_RM_RESULTS_PATH)
binary_correctness_rm_df = pd.DataFrame(
    {
        "question_id": binary_correctness_rm_df["question_id"],
        "factuality_sft_generated_paragraph": binary_correctness_rm_df["generated_paragraph"],
        "factuality_sft_binary_correctness": binary_correctness_rm_df["interpretation__forecast_probs"]
})

*Finetuning Factuality RM on the Binary Correctness RM Dataset*

Now, you can finetune the reward model used in the training of Factuality RL.

Specifically, you could upload this dataset to HF datasets hub or save it locally. 
Then you can load it during factuality reward modeling (using the `reward_modeling.py` training script) by altering the loader method `linguistic_calibration.data_utils.make_linguistic_calibration_reward_modeling_data_module`.

See README.md in the main directory for a walkthrough of training the Factuality RM.