# Fine-tuning to Memorize Knowledge

In this tutorial we experiment with some basic approaches of "baking in knowledge with fine-tuning."

- Synthesizing questions from existing context
- Trying text completion

In [1]:
import os
import openai
from llama_index import ServiceContext
from llama_index.llms import OpenAI

In [None]:
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

## Load Data

In [3]:
!mkdir data && wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: data: File exists


In [2]:
from pathlib import Path
from llama_hub.file.pdf.base import PDFReader
from llama_hub.file.unstructured.base import UnstructuredReader
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

In [3]:
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

In [4]:
from llama_index import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
metadata = {
    "paper_title": "Llama 2: Open Foundation and Fine-Tuned Chat Models"
}
docs = [Document(text=doc_text, metadata=metadata)]

In [None]:
print(docs[0].get_content())

In [20]:
from llama_index.callbacks import CallbackManager
callback_manager = CallbackManager([])

gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-0613", temperature=0.3),
    callback_manager=callback_manager
)
gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4-0613", temperature=0.3),
    callback_manager=callback_manager
)

## Generate Dataset

In [21]:
from llama_index.evaluation import DatasetGenerator
from llama_index.node_parser import SimpleNodeParser
# try evaluation modules
from llama_index.evaluation import QueryResponseEvaluator, ResponseEvaluator
from llama_index import PromptTemplate

In [22]:
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)

In [25]:
from tqdm.notebook import tqdm
import json

num_questions_per_chunk = 10
question_gen_query = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, "
    f"formulate {num_questions_per_chunk} that captures an important fact from the "
    "context. \n"
    "You MUST obey the following criteria:\n"
    "- Restrict the question to the context information provided.\n"
    "- Do NOT create a question that cannot be answered from the context.\n"
    "- Phrase the question so that it does NOT refer to specific context. "
    "For instance, do NOT put phrases like \"given provided context\" or \"in this work\" in the question, "
    "because if the question is asked elsewhere it wouldn't be provided specific context. Replace these terms "
    "with specific details.\n"
    "BAD questions:\n"
    "What did the author do in his childhood\n"
    "What were the main findings in this report\n\n"
    "GOOD questions:\n"
    "What did Barack Obama do in his childhood\n"
    "What were the main findings in the original Transformers paper by Vaswani et al.\n\n"
    "Generate the questions below:\n"
)

# go through each node one at a time - 
# generate questions, filter using eval modules, and dump to file 

fp = open("data/qa_pairs.jsonl", "w")
for idx, node in enumerate(nodes):
    dataset_generator = DatasetGenerator(
        [node],
        question_gen_query=question_gen_query,
        service_context=gpt_4_context,
        metadata_mode="all"
    )
    node_questions_0 = dataset_generator.generate_questions_from_nodes(num=10)
    print(f"[Node {idx}] Generated questions:\n {node_questions_0}")
    # for each question, get a response
    for question in tqdm(node_questions_0):
        index = SummaryIndex([node], service_context=gpt_35_context)  
        query_engine = index.as_query_engine()
        response = query_engine.query(question)
        out_dict = {
            "query": question,
            "response": str(response)
        }
        print(f"[Node {idx}] Outputs: {out_dict}")
        fp.write(json.dumps(out_dict) + "\n")

fp.close()
    

[Node 0] Generated questions:
 ['What is the name of the collection of pretrained and fine-tuned large language models discussed in the paper?', 'What is the range of parameters in the Llama 2 models?', 'What is the specific use case for the Llama 2-Chat models?', 'How do the Llama 2 models perform compared to open-source chat models on most benchmarks?', 'What are the two main areas of focus in the development of Llama 2-Chat models?', 'What is the purpose of providing a detailed description of the fine-tuning and safety improvements of Llama 2-Chat?', 'What is the scale of the large language models developed in Llama 2?', 'What is the potential substitute for closed-source models according to the paper?', 'What are the two methods used for fine-tuning in Llama 2?', 'What is the significance of safety in the development of Llama 2 models?']


  0%|          | 0/10 [00:00<?, ?it/s]

[Node 0] Outputs: {'query': 'What is the name of the collection of pretrained and fine-tuned large language models discussed in the paper?', 'response': 'The name of the collection of pretrained and fine-tuned large language models discussed in the paper is Llama 2.'}
[Node 0] Outputs: {'query': 'What is the range of parameters in the Llama 2 models?', 'response': 'The range of parameters in the Llama 2 models is from 7 billion to 70 billion.'}
[Node 0] Outputs: {'query': 'What is the specific use case for the Llama 2-Chat models?', 'response': 'The Llama 2-Chat models are optimized for dialogue use cases.'}
[Node 0] Outputs: {'query': 'How do the Llama 2 models perform compared to open-source chat models on most benchmarks?', 'response': 'The Llama 2 models outperform open-source chat models on most benchmarks.'}
[Node 0] Outputs: {'query': 'What are the two main areas of focus in the development of Llama 2-Chat models?', 'response': 'The two main areas of focus in the development of 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 1] Outputs: {'query': 'What is the title of the section that discusses the safety evaluation of the Llama 2-Chat?', 'response': 'The title of the section that discusses the safety evaluation of the Llama 2-Chat is "Safety Evaluation of Llama 2-Chat".'}
[Node 1] Outputs: {'query': 'What is the content of section 4.3 in the document?', 'response': 'The content of section 4.3 in the document is "Red Teaming".'}
[Node 1] Outputs: {'query': 'What topic is covered in section 5.2 of the document?', 'response': 'The topic covered in section 5.2 of the document is "Limitations and Ethical Considerations."'}
[Node 1] Outputs: {'query': 'What is the focus of the discussion in section 5.3?', 'response': 'The focus of the discussion in section 5.3 is the responsible release strategy.'}
[Node 1] Outputs: {'query': 'What is the subject of section 6 in the document?', 'response': 'The subject of section 6 in the document is "Related Work."'}
[Node 1] Outputs: {'query': 'What is the final section

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 2] Outputs: {'query': 'What is the primary function of Large Language Models (LLMs)?', 'response': 'The primary function of Large Language Models (LLMs) is to serve as highly capable AI assistants that excel in complex reasoning tasks requiring expert knowledge across a wide range of fields. They enable interaction with humans through intuitive chat interfaces, which has led to rapid and widespread adoption among the general public.'}
[Node 2] Outputs: {'query': 'What is the training methodology for Large Language Models?', 'response': 'The training methodology for Large Language Models involves pretraining the models on a large corpus of self-supervised data using auto-regressive transformers. After pretraining, the models are aligned with human preferences through techniques such as Reinforcement Learning with Human Feedback (RLHF). This methodology, although simple in concept, requires high computational requirements and has limited the development of Large Language Models to 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 3] Outputs: {'query': 'What is the name of the updated version of the Llama 1 model?', 'response': 'The name of the updated version of the Llama 1 model is Llama 2.'}
[Node 3] Outputs: {'query': 'What is the percentage increase in the pretraining corpus size of the Llama 2 model compared to its predecessor?', 'response': 'The percentage increase in the pretraining corpus size of the Llama 2 model compared to its predecessor is 40%.'}
[Node 3] Outputs: {'query': 'What type of attention has been adopted in the Llama 2 model?', 'response': 'The Llama 2 model has adopted grouped-query attention.'}
[Node 3] Outputs: {'query': 'What are the parameter sizes of the Llama 2 variants that are being released?', 'response': 'The parameter sizes of the Llama 2 variants being released are 7B, 13B, and 70B.'}
[Node 3] Outputs: {'query': 'What is the optimized version of Llama 2 for dialogue use cases called?', 'response': 'The optimized version of Llama 2 for dialogue use cases is called Llama 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 4] Outputs: {'query': 'What is the initial step in the training process of Llama 2-Chat?', 'response': 'The initial step in the training process of Llama 2-Chat is the pretraining of Llama 2 using publicly available online sources.'}
[Node 4] Outputs: {'query': 'What methodologies are used in the iterative refinement of the Llama 2-Chat model?', 'response': 'The iterative refinement of the Llama 2-Chat model involves the use of Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO).'}
[Node 4] Outputs: {'query': 'What is the significance of iterative reward modeling data during the RLHF stage of Llama 2-Chat model training?', 'response': "The iterative reward modeling data is crucial during the RLHF stage of Llama 2-Chat model training to ensure that the reward models remain within distribution. This means that by accumulating iterative reward modeling data in parallel with model enhancements

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 5] Outputs: {'query': 'What type of encoding algorithm does the tokenizer in Llama 2 use?', 'response': 'The tokenizer in Llama 2 uses a bytepair encoding (BPE) algorithm.'}
[Node 5] Outputs: {'query': 'What is the total vocabulary size of the tokenizer used in Llama 2?', 'response': 'The total vocabulary size of the tokenizer used in Llama 2 is 32k tokens.'}
[Node 5] Outputs: {'query': 'What type of hardware was used to pretrain the models in Llama 2?', 'response': "The hardware used to pretrain the models in Llama 2 includes Meta's Research Super Cluster (RSC) and internal production clusters. Both clusters are equipped with NVIDIA A100s GPUs. The RSC cluster uses NVIDIA Quantum InfiniBand as the interconnect, while the production cluster uses a RoCE (RDMA over converged Ethernet) solution based on commodity ethernet switches. The power consumption cap per GPU is 400W for RSC and 350W for the production cluster."}
[Node 5] Outputs: {'query': 'What are the two key differences be

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 6] Outputs: {'query': 'What is the estimated total carbon emissions for training the Llama 2 family of models?', 'response': 'The estimated total carbon emissions for training the Llama 2 family of models is 539 tCO2eq.'}
[Node 6] Outputs: {'query': 'What type of hardware was used for the computation of the Llama 2 family of models?', 'response': 'The type of hardware used for the computation of the Llama 2 family of models was A100-80GB.'}
[Node 6] Outputs: {'query': 'What is the estimated power usage of a GPU dependent on?', 'response': 'The estimated power usage of a GPU is dependent on its utilization.'}
[Node 6] Outputs: {'query': 'What are some power demands that the calculations for GPU power do not account for?', 'response': 'The calculations for GPU power do not account for further power demands such as those from interconnect or non-GPU server power consumption, nor from datacenter cooling systems.'}
[Node 6] Outputs: {'query': 'What is the potential additional contribu

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 7] Outputs: {'query': 'Which model outperforms all open-source models according to the data in Table 3?', 'response': 'Llama 2 70B model outperforms all open-source models according to the data in Table 3.'}
[Node 7] Outputs: {'query': 'How does the performance of Llama 2 70B compare to GPT-3.5 on MMLU and GSM8K benchmarks?', 'response': 'Llama 2 70B is close to GPT-3.5 on MMLU and GSM8K benchmarks.'}
[Node 7] Outputs: {'query': 'What is the performance gap between Llama 2 70B and GPT-4 and PaLM-2-L?', 'response': 'The performance gap between Llama 2 70B and GPT-4 and PaLM-2-L is not specified in the given context.'}
[Node 7] Outputs: {'query': 'How does Llama 2 70B compare to PaLM (540B) on almost all benchmarks?', 'response': 'Llama 2 70B performs on par or better than PaLM (540B) on almost all benchmarks.'}
[Node 7] Outputs: {'query': 'Which models does Llama 2 7B and 30B outperform on all categories besides code benchmarks?', 'response': 'Llama 2 7B and 30B models outperform 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 8] Outputs: {'query': 'What is the name of the model that is the result of several months of research and iterative applications of alignment techniques?', 'response': 'Llama 2-Chat'}
[Node 8] Outputs: {'query': 'What are the two techniques used in the fine-tuning of Llama 2-Chat model?', 'response': 'The two techniques used in the fine-tuning of the Llama 2-Chat model are supervised fine-tuning and RLHF (Reinforcement Learning from Human Feedback).'}
[Node 8] Outputs: {'query': 'Which model performed the best on the GSM8K (8-shot) benchmark according to the data provided?', 'response': 'The model that performed the best on the GSM8K (8-shot) benchmark according to the data provided is GPT-4 with a score of 92.0.'}
[Node 8] Outputs: {'query': 'What is the new technique introduced that helps control dialogue flow over multiple turns?', 'response': 'The new technique introduced that helps control dialogue flow over multiple turns is Ghost Attention (GAtt).'}
[Node 8] Outputs: {'que

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 9] Outputs: {'query': 'What is the purpose of Supervised Fine-Tuning (SFT) in the Llama 2 model?', 'response': 'The purpose of Supervised Fine-Tuning (SFT) in the Llama 2 model is to align the model towards dialogue-style instructions. It involves collecting high-quality SFT data to improve the performance of the model. By using a limited set of clean instruction-tuning data, the model can achieve a high level of quality. SFT annotations are used to fine-tune the model and improve its performance.'}
[Node 9] Outputs: {'query': 'What is the role of third-party SFT data in the fine-tuning process of the Llama 2 model?', 'response': 'The role of third-party SFT data in the fine-tuning process of the Llama 2 model is to provide additional examples for aligning the model towards dialogue-style instructions. However, the context does not provide specific details about how the third-party SFT data is utilized in the fine-tuning process.'}
[Node 9] Outputs: {'query': 'How does the Llama 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 10] Outputs: {'query': 'What is the primary method used to collect human preference data for reward modeling in Llama 2-Chat?', 'response': 'The primary method used to collect human preference data for reward modeling in Llama 2-Chat is a binary comparison protocol.'}
[Node 10] Outputs: {'query': 'How does the annotation procedure for Llama 2-Chat work?', 'response': 'The annotation procedure for Llama 2-Chat involves asking annotators to write a prompt and then choose between two sampled model responses based on provided criteria. The two responses are sampled from two different model variants and vary the temperature hyper-parameter to maximize diversity. Annotators are also asked to label the degree to which they prefer their chosen response over the alternative. The annotation focuses on helpfulness and safety, with separate guidelines for each. Additionally, a safety label is collected during the safety stage, categorizing model responses into three categories: one response 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 11] Outputs: {'query': 'What is the purpose of the reward model in Llama 2?', 'response': 'The purpose of the reward model in Llama 2 is to learn human preferences for Llama 2-Chat outputs.'}
[Node 11] Outputs: {'query': 'What is the input of the reward model in Llama 2?', 'response': 'The input of the reward model in Llama 2 is the response, which is the prompt (including previous dialogue if available) and the completion generated by the model.'}
[Node 11] Outputs: {'query': 'What is the model architecture and hyper-parameters of the reward model in Llama 2?', 'response': 'The model architecture and hyper-parameters of the reward model in Llama 2 are identical to those of the pretrained language models, except that the classification head for next-token prediction is replaced with a regression head for outputting a scalar reward.'}
[Node 11] Outputs: {'query': 'What is replaced with a regression head in the model architecture of the reward model in Llama 2?', 'response': 'The c

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 12] Outputs: {'query': 'What is the purpose of keeping open-source preference datasets in the data mixture for Llama 2-Chat?', 'response': 'The purpose of keeping open-source preference datasets in the data mixture for Llama 2-Chat is to enable better generalization for the reward model and prevent reward hacking.'}
[Node 12] Outputs: {'query': 'What are the different sources of training data used for Llama 2-Chat?', 'response': 'The different sources of training data used for Llama 2-Chat include Meta Helpfulness data, Meta Safety data, Anthropic Harmless data, and open-source helpfulness data.'}
[Node 12] Outputs: {'query': 'What is the training data mixture used for the Helpfulness reward model in Llama 2-Chat?', 'response': 'The training data mixture used for the Helpfulness reward model in Llama 2-Chat consists of all Meta Helpfulness data combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets.'}
[Node 12] Outp

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 13] Outputs: {'query': 'What is the method used to prompt GPT-4 in the study described?', 'response': 'The method used to prompt GPT-4 in the study described is by providing a zero-shot question "Choose the best answer between A and B," where A and B are the two responses for comparison.'}
[Node 13] Outputs: {'query': 'What are the two responses for comparison in the study?', 'response': 'The two responses for comparison in the study are labeled as A and B.'}
[Node 13] Outputs: {'query': 'What is the metric used to report the results of the study?', 'response': 'The metric used to report the results of the study is accuracy.'}
[Node 13] Outputs: {'query': 'Which reward model performed the best on the internal test sets based on Llama 2-Chat?', 'response': 'The Helpfulness reward model performed the best on the internal test sets based on Llama 2-Chat.'}
[Node 13] Outputs: {'query': 'Which reward model performed the best on the Meta Helpfulness test set?', 'response': 'The Helpful

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 14] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}
[Node 14] Outputs: {'query': 'What is the main objective of the Llama 2-Chat model?', 'response': 'The main objective of the Llama 2-Chat model is to improve the performance of Llama 2-Chat by training better reward models and collecting more prompts through iterative fine-tuning.'}
[Node 14] Outputs: {'query': 'What is the relationship between the size of the model and the accuracy of the reward model?', 'response': 'The relationship between the size of the model and the accuracy of the reward model is that larger models tend to obtain higher performance for a similar volume of data. In other words, increasing the size of the model generally improves the accuracy of the reward model. However, it is mentioned that the scaling performance has not yet plateaued given the existing volume of data annotation used for training, indicating

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 15] Outputs: {'query': 'What is the main difference between Rejection Sampling and PPO in the context of RL algorithms?', 'response': 'Rejection Sampling and PPO differ mainly in terms of breadth and depth. In Rejection Sampling, the model explores multiple samples for a given prompt, while PPO only generates one sample. In terms of depth, PPO uses the updated model policy from the previous step during training, while Rejection Sampling fine-tuning samples all the outputs given the initial policy of the model before applying fine-tuning. However, due to iterative model updates, the fundamental differences between the two RL algorithms are less pronounced.'}
[Node 15] Outputs: {'query': 'What is the role of the reward score in the RLHF model?', 'response': 'The reward score in the RLHF model plays a crucial role in determining the quality of the generated outputs. It is used to rank and select the best answer for a given prompt during the iterative stages of the model. The highest

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 16] Outputs: {'query': 'What increases with more samples in the Llama 2 model?', 'response': 'The delta increases with more samples in the Llama 2 model.'}
[Node 16] Outputs: {'query': 'What is the role of the temperature parameter in exploration?', 'response': 'The temperature parameter plays an important role in exploration as it enables the sampling of more diverse outputs.'}
[Node 16] Outputs: {'query': 'What is the optimal temperature when sampling between 10 and 100 outputs for Llama 2-Chat-RLHF?', 'response': 'The optimal temperature when sampling between 10 and 100 outputs for Llama 2-Chat-RLHF is T ∈ [1.2, 1.3].'}
[Node 16] Outputs: {'query': 'What is the objective that is sought to be optimized during the training phase of the language model?', 'response': 'The objective that is sought to be optimized during the training phase of the language model is to maximize the expected reward.'}
[Node 16] Outputs: {'query': 'What does the final reward function used during optimiz

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 17] Outputs: {'query': 'What is the purpose of the Ghost Attention (GAtt) method in Llama 2-Chat models?', 'response': 'The purpose of the Ghost Attention (GAtt) method in Llama 2-Chat models is to enable dialogue control over multiple turns. It addresses the limitation of the initial RLHF models, which tended to forget the initial instruction after a few turns of dialogue. GAtt is a simple method that helps the attention focus in a multi-stage process, allowing the subsequent response to always respect the constraint provided in the initial instruction throughout the dialogue.'}
[Node 17] Outputs: {'query': 'How long does each iteration of PPO on the 70B model take on average?', 'response': 'Each iteration of PPO on the 70B model takes on average approximately 330 seconds.'}
[Node 17] Outputs: {'query': 'What is the role of FSDP in training Llama 2-Chat models?', 'response': 'FSDP (Fully Sharded Data Parallelism) is used in training Llama 2-Chat models to enable quick training w

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 18] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}
[Node 18] Outputs: {'query': 'What is the name of the character who is asked to act as Oscar Wilde?', 'response': 'The name of the character who is asked to act as Oscar Wilde is not provided in the given context information.'}
[Node 18] Outputs: {'query': 'Which city is described as the epitome of sophistication and culture?', 'response': 'London is described as the epitome of sophistication and culture.'}
[Node 18] Outputs: {'query': 'What are some of the features mentioned that make London a great city?', 'response': 'London is mentioned as the city of Shakespeare and Dickens, the great universities, and the museums and galleries. It is described as a city where the old and the new blend together in a beautiful harmony.'}
[Node 18] Outputs: {'query': 'What is the role of GAtt in the dialogue model discussed in the context?', 'res

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 19] Outputs: {'query': 'What is the purpose of using a more general reward in the Llama 2 chat models?', 'response': 'To ensure that the measure used in the Llama 2 chat models aligns with human preferences and does not diverge from them.'}
[Node 19] Outputs: {'query': 'How does the iterative model updates help in the evolution of Llama 2-Chat?', 'response': 'Iterative model updates help in the evolution of Llama 2-Chat by preventing regression between the new model and the previous one. By using both models to sample during the next annotation iteration, it enables a model comparison "for free" on new prompts and increases diversity when sampling. This iterative process helps to improve the performance of Llama 2-Chat over time.'}
[Node 19] Outputs: {'query': 'What is the significance of using both the new model and the previous one to sample during the next annotation iteration?', 'response': 'Using both the new model and the previous one to sample during the next annotation it

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 20] Outputs: {'query': 'Which model outperformed MPT-7B-chat on 60% of the prompts?', 'response': 'Llama 2-Chat model outperformed MPT-7B-chat on 60% of the prompts.'}
[Node 20] Outputs: {'query': 'What is the overall win rate of Llama 2-Chat 34B against equivalently sized Vicuna-33B and Falcon 40B models?', 'response': 'The overall win rate of Llama 2-Chat 34B against equivalently sized Vicuna-33B and Falcon 40B models is more than 75%.'}
[Node 20] Outputs: {'query': 'How does the largest Llama 2-Chat model compare with ChatGPT?', 'response': 'The largest Llama 2-Chat model has a win rate of 36% and a tie rate of 31.5% relative to ChatGPT. This means that the Llama 2-Chat model performs competitively with ChatGPT, but ChatGPT has a higher win rate.'}
[Node 20] Outputs: {'query': 'What is the win rate and tie rate of Llama 2-Chat 70B model relative to ChatGPT?', 'response': 'The win rate of Llama 2-Chat 70B model relative to ChatGPT is 36%, and the tie rate is 31.5%.'}
[Node 20] 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 21] Outputs: {'query': 'What measures were taken to reduce the carbon footprint during the training of Llama 2 models?', 'response': 'Efforts were made to train the Llama 2 models efficiently in order to reduce the carbon footprint of pretraining.'}
[Node 21] Outputs: {'query': 'What is the potential risk of over-scrubbing datasets in the context of Llama 2 model training?', 'response': 'The potential risk of over-scrubbing datasets in the context of Llama 2 model training is the accidental demographic erasure.'}
[Node 21] Outputs: {'query': 'What is the recommended safety measure before deploying Llama 2 models?', 'response': 'Llama 2 models should be used carefully and deployed only after significant safety tuning is applied.'}
[Node 21] Outputs: {'query': 'What bias was observed in the use of pronouns in the English-language training corpus for Llama 2?', 'response': 'The bias observed in the use of pronouns in the English-language training corpus for Llama 2 is that He pronou

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 22] Outputs: {'query': 'What percentage of documents contain gender pronouns according to the study?', 'response': 'The percentage of documents that contain gender pronouns according to the study is 75%.'}
[Node 22] Outputs: {'query': 'What is the percentage of documents that contain pronouns in general?', 'response': 'The percentage of documents that contain pronouns in general is 94.47%.'}
[Node 22] Outputs: {'query': "What percentage of documents contain 'She' pronouns as per the analysis?", 'response': "28% of all documents contain 'She' pronouns as per the analysis."}
[Node 22] Outputs: {'query': "What is the percentage of documents that mention any descriptor terms in the demographic axis of 'Gender and Sex'?", 'response': "The percentage of documents that mention any descriptor terms in the demographic axis of 'Gender and Sex' is 5.91%."}
[Node 22] Outputs: {'query': "According to the study, what percentage of documents mention the descriptor 'female' in the 'Gender and Se

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 23] Outputs: {'query': 'What is the primary language in the pretraining data of Llama 2?', 'response': 'The primary language in the pretraining data of Llama 2 is English.'}
[Node 23] Outputs: {'query': 'What is the purpose of the TruthfulQA benchmark in evaluating the safety capabilities of Llama 2?', 'response': 'The purpose of the TruthfulQA benchmark in evaluating the safety capabilities of Llama 2 is to measure how well the language model can generate reliable outputs that agree with factuality and common sense.'}
[Node 23] Outputs: {'query': 'How is toxicity defined in the context of language model safety?', 'response': 'Toxicity, in the context of language model safety, is defined as the tendency of a language model to generate toxic, rude, adversarial, or implicitly hateful content. It refers to the extent to which the language model produces language that can be considered harmful or offensive.'}
[Node 23] Outputs: {'query': 'What is the BOLD benchmark used for in the ev

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 24] Outputs: {'query': 'What is the purpose of TruthfulQA in the evaluation of pretrained LLMs?', 'response': 'The purpose of TruthfulQA in the evaluation of pretrained LLMs is to determine the percentage of generations that are both truthful and informative.'}
[Node 24] Outputs: {'query': 'What does a lower percentage in ToxiGen indicate in the context of pretrained LLMs evaluation?', 'response': 'A lower percentage in ToxiGen indicates a smaller percentage of toxic generations in the context of pretrained LLMs evaluation.'}
[Node 24] Outputs: {'query': 'What are the three broad risk categories considered for creating adversarial prompts?', 'response': 'The three broad risk categories considered for creating adversarial prompts are illicit and criminal activities, hateful and harmful activities, and unqualified advice.'}
[Node 24] Outputs: {'query': 'What are some examples of attack vectors explored for creating adversarial prompts?', 'response': 'Examples of attack vectors expl

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 25] Outputs: {'query': 'What is the purpose of the guidelines mentioned in the Llama 2: Open Foundation and Fine-Tuned Chat Models study?', 'response': 'The purpose of the guidelines mentioned in the Llama 2: Open Foundation and Fine-Tuned Chat Models study is to provide a general guide for the model in order to avoid negative user experience categories and to identify and mitigate risks associated with unsafe behavior.'}
[Node 25] Outputs: {'query': 'What is the role of the annotators in the safety supervised fine-tuning process?', 'response': 'The annotators are responsible for providing prompts and demonstrations of safe model responses during the safety supervised fine-tuning process. They are instructed to come up with prompts that could potentially induce unsafe behavior and then craft a safe and helpful response that the model should produce.'}
[Node 25] Outputs: {'query': 'What does the term "red teaming" refer to in this context?', 'response': 'The term "red teaming" in 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 26] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'}
[Node 26] Outputs: {'query': 'What is the impact of safety RLHF measured by?', 'response': 'The impact of safety RLHF is measured by the reward model score distributions.'}
[Node 26] Outputs: {'query': 'What does the clustering of samples in the top left corner suggest?', 'response': 'The clustering of samples in the top left corner suggests an improvement in model safety.'}
[Node 26] Outputs: {'query': 'What is the subject of the email mentioned in the context?', 'response': 'The subject of the email mentioned in the context is "Urgent Assistance Required".'}
[Node 26] Outputs: {'query': 'What is the situation described in the email?', 'response': 'The situation described in the email is that the sender claims to have been robbed while being in a certain location and asks the re

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 27] Outputs: {'query': "What is the impact of increasing the proportion of safety data on a model's performance?", 'response': "Increasing the proportion of safety data in model training has a significant impact on the model's performance in handling risky and adversarial prompts. It improves the model's performance on safety, resulting in a lighter tail in the safety reward model score distribution. However, the mean helpfulness score remains relatively stable, suggesting that a sufficient amount of helpfulness training data is already available. The addition of more safety training data gradually eliminates the left tail of safety reward model scores, which represents the most unsafe responses. This indicates that increasing the proportion of safety data helps the model in generating safer and more reliable responses."}
[Node 27] Outputs: {'query': 'What happens to the mean safety reward model score as the amount of safety data in model training increases?', 'response': 'The me

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 28] Outputs: {'query': 'What is the purpose of using a safety preprompt in Llama 2-Chat?', 'response': 'The purpose of using a safety preprompt in Llama 2-Chat is to enhance the safety capabilities of the language model. By prefixing the model with a safety preprompt, such as "You are a safe and responsible assistant," the model is encouraged to associate adversarial prompts with safer responses. This helps to ensure that the model\'s responses do not include harmful, unethical, or socially biased content. The safety preprompt acts as a quick way to bootstrap the model\'s responses on hard adversarial prompts, which can then be further improved through fine-tuning.'}
[Node 28] Outputs: {'query': 'How does context distillation enhance the safety capabilities of LLMs?', 'response': 'Context distillation enhances the safety capabilities of LLMs by prefixing the model with a safety preprompt, such as "You are a safe and responsible assistant." This helps the model associate positive 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 29] Outputs: {'query': 'What does the term "red teaming" refer to in the context of proactive risk identification?', 'response': 'The term "red teaming" refers to a process of proactive risk identification in the context of the given information. It involves conducting granular analysis and targeted assessments to identify and address specific patterns and potential risks. This approach is important because safety issues can arise from even infrequent edge cases, and quantitative scores may not capture all potential problems. Red teaming typically involves engaging various groups of individuals, including domain experts and representatives from different backgrounds, to provide diverse perspectives and insights.'}
[Node 29] Outputs: {'query': 'What is the purpose of adding a preprompt based on the risk category with a tailored answer template?', 'response': 'The purpose of adding a preprompt based on the risk category with a tailored answer template is to increase the safety RM s

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 30] Outputs: {'query': 'What is the purpose of the red team probing the models?', 'response': "The purpose of the red team probing the models is to test the models across various risk categories and attack vectors. This includes assessing the models' capabilities in handling topics such as criminal planning, human trafficking, regulated substances, sexually explicit content, unqualified health or financial advice, privacy violations, and more. The red team also conducts tests to determine the models' abilities to facilitate the production of weapons. The goal of these probing exercises is to identify any unsafe or problematic responses generated by the models and provide insights for improving their safety and mitigating risks."}
[Node 30] Outputs: {'query': 'What are some of the risk categories that the models were tested against?', 'response': 'The models were tested against risk categories such as criminal planning, human trafficking, regulated or controlled substances, sexual

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 31] Outputs: {'query': 'What is the main evaluation metric used in the safety assessment of Llama 2-Chat models?', 'response': 'The main evaluation metric used in the safety assessment of Llama 2-Chat models is the violation percentage.'}
[Node 31] Outputs: {'query': 'How is a response determined to be violating or not in the safety assessment of Llama 2-Chat models?', 'response': "A response is determined to be violating or not in the safety assessment of Llama 2-Chat models based on the majority vote of three annotators. If the majority vote indicates that the response is violating, it is considered a violation. The violation percentage is used as the main evaluation metric, with the mean rating serving as a supplement. The inter-rater reliability (IRR) scores, which measure the agreement among annotators on safety assessments, range from 0.70 to 0.95. The average IRR for Llama 2-Chat annotations is 0.92 according to Gwet's AC2 measure."}
[Node 31] Outputs: {'query': 'What is G

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 32] Outputs: {'query': 'What trend is observed across models in terms of violation percentage on single- and multi-turn conversations?', 'response': 'Multi-turn conversations are more prone to inducing unsafe responses compared to single-turn conversations across models.'}
[Node 32] Outputs: {'query': 'How does the Llama 2-Chat model perform in comparison to baselines on multi-turn conversations?', 'response': 'The Llama 2-Chat model performs well compared to baselines on multi-turn conversations. It is observed that multi-turn conversations are more prone to inducing unsafe responses, but Llama 2-Chat still performs well in this aspect. However, Falcon, another model, performs better on single-turn conversations but worse on multi-turn conversations, possibly due to its lack of multi-turn supervised fine-tuning data.'}
[Node 32] Outputs: {'query': 'Which model performs particularly well on single-turn conversations and why?', 'response': 'Falcon performs particularly well on sin

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 33] Outputs: {'query': 'What interesting properties were observed with RLHF in the Llama 2-Chat model?', 'response': "The tuning process of the Llama 2-Chat model revealed several interesting properties of RLHF. These properties include the model's ability to temporally organize its knowledge and to call APIs for external tools."}
[Node 33] Outputs: {'query': 'What are the limitations of the Llama 2-Chat model?', 'response': 'The limitations of the Llama 2-Chat model are discussed in Section 5.2 of the paper.'}
[Node 33] Outputs: {'query': 'What strategy was presented for responsibly releasing these models?', 'response': 'The strategy presented for responsibly releasing these models was discussed in Section 5.3 of the paper.'}
[Node 33] Outputs: {'query': 'What abilities did the tuning process reveal about Llama 2-Chat?', 'response': 'The tuning process revealed that Llama 2-Chat has the ability to temporally organize its knowledge and to call APIs for external tools.'}
[Node 33]

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 34] Outputs: {'query': 'What does the RLHF in the Llama 2 model learn to adapt with regard to the type of prompt?', 'response': 'The RLHF in the Llama 2 model learns to adapt the temperature with regard to the type of prompt.'}
[Node 34] Outputs: {'query': 'How does the Self-BLEU metric correspond to diversity in the Llama 2 model?', 'response': 'Lower Self-BLEU corresponds to more diversity in the Llama 2 model.'}
[Node 34] Outputs: {'query': 'What does the Llama 2-Chat model demonstrate when provided with minimal data?', 'response': 'The Llama 2-Chat model demonstrates a robust capability to organize its knowledge in a temporal manner when provided with minimal data. It is able to understand and respond to questions related to specific dates, even without extensive training on temporal context.'}
[Node 34] Outputs: {'query': 'How was the concept of time instilled in the Llama 2-Chat model?', 'response': 'The concept of time was instilled in the Llama 2-Chat model by collecting 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 35] Outputs: {'query': 'What is the performance of Llama 2-Chat on the math datasets used in Toolformer?', 'response': 'Llama 2-Chat has a performance score of 67.1 on ASDiv, 69.2 on SVAMP, and 82.4 on MAWPS, which are the math datasets used in Toolformer.'}
[Node 35] Outputs: {'query': 'How does the Llama 2-Chat model understand the applications of tools and API arguments?', 'response': 'The Llama 2-Chat model is able to understand the applications of tools and API arguments through semantics, despite never having been explicitly trained to use tools. This means that the model can comprehend how to use different tools and understand the arguments required by the tools just by understanding the meaning of the text.'}
[Node 35] Outputs: {'query': 'What are the limitations of Llama 2-Chat as mentioned in the context?', 'response': "Llama 2-Chat, like other language models, has certain limitations. These include a cessation of knowledge updates post-pretraining, potential for non-fa

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 36] Outputs: {'query': 'What is the primary intention behind releasing Llama 2 openly?', 'response': 'The primary intention behind releasing Llama 2 openly is to encourage responsible AI innovation and promote collaboration within the AI community.'}
[Node 36] Outputs: {'query': 'How does the decentralization of AI expertise stimulate the AI industry according to the text?', 'response': 'The decentralization of AI expertise stimulates the AI industry by stimulating innovation and accelerating progress in the industry. This is because it allows more people to access AI tools and democratizes the technology, creating a more level playing field for organizations of all sizes to benefit from the economic growth promised by the advancement of AI. Additionally, openly releasing AI models consolidates costs and eliminates barriers to entry, allowing small businesses to leverage innovations in language models to explore and build text-generation use cases.'}
[Node 36] Outputs: {'query': 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 37] Outputs: {'query': 'What is the approach related to instruction tuning called that prompts models to explain their reasoning when given a complex problem?', 'response': 'The approach related to instruction tuning that prompts models to explain their reasoning when given a complex problem is called chain-of-thought prompting.'}
[Node 37] Outputs: {'query': 'What strategy has been identified as a powerful method for fine-tuning Large Language Models?', 'response': 'RLHF (Reinforcement Learning from Human Feedback) has been identified as a powerful strategy for fine-tuning Large Language Models.'}
[Node 37] Outputs: {'query': 'Who first showcased the method of fine-tuning Large Language Models based on feedback from human users?', 'response': 'Stiennon et al. first showcased the method of fine-tuning Large Language Models based on feedback from human users.'}
[Node 37] Outputs: {'query': 'What issues can a combination of instruction fine-tuning and RLHF help fix in Large Languag

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 38] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo.'}
[Node 38] Outputs: {'query': 'What is the title of the technical report written by Rohan Anil, Andrew M. Dai, Orhan Firat, and others in 2023?', 'response': 'Falcon-40B: an open large language model with state-of-the-art performance.'}
[Node 38] Outputs: {'query': 'Who are the authors of the preprint titled "A general language assistant as a laboratory for alignment" on arXiv in 2021?', 'response': 'Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al.'}
[Node 38] Outputs: {'query': 'What is the topic of the work by Ellen Jiang, 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 39] Outputs: {'query': 'What is the title of the technical report discussed in the context that focuses on automation and labor displacement?', 'response': 'The title of the technical report discussed in the context that focuses on automation and labor displacement is "Is automation labor-displacing? Productivity growth, employment, and the labor share."'}
[Node 39] Outputs: {'query': 'Who are the authors of the preprint titled "Training a helpful and harmless assistant with reinforcement learning from human feedback"?', 'response': 'Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al.'}
[Node 39] Outputs: {'query': 'What is the topic of the paper "Based on billions of words on the internet, people=men" by April H Bailey, Adina Williams, and Andrei Cimpian?', 'response': 'The topic of the paper "Based on billions of words on the internet, people=men" by April H Bailey, Adina Wi

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 40] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'The authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models" are Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundag

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 41] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.'}
[Node 41] Outputs: {'query': 'What is the title of the paper authored by Hyung Won Chung, Le Hou, S. Longpre, and others?', 'response': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}
[Node 41] Outputs: {'query': 'What is the preprint number of the paper "Scaling instruction-finetuned language models" on arXiv?', 'response': 'The preprint number of the paper "Scaling instruction-finetuned language m

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 42] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'Jesse Dodge, Taylor Prewitt, Remi Tachet Des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A Smith, Nicole DeCario, and Will Buchanan.'}
[Node 42] Outputs: {'query': 'What is the title of the paper authored by Jesse Dodge, Taylor Prewitt, Remi Tachet Des Combes, and others?', 'response': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}
[Node 42] Outputs: {'query': 'Who are the authors of the paper "GLaM: Efficient scaling of language models with mixture-of-experts"?', 'response': 'The authors of the paper "GLaM: Efficient scaling of language models with mixture-of-experts" are Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pe

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 43] Outputs: {'query': 'What is the title of the paper that discusses designing sustainable computer systems with an architectural carbon modeling tool?', 'response': 'The title of the paper that discusses designing sustainable computer systems with an architectural carbon modeling tool is "Act: designing sustainable computer systems with an architectural carbon modeling tool."'}
[Node 43] Outputs: {'query': 'Who are the authors of the paper "Chasing carbon: The elusive environmental footprint of computing"?', 'response': 'Udit Gupta, Young Guen Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin Sean Lee, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu.'}
[Node 43] Outputs: {'query': 'What is the subject of Kilem L. Gwet\'s "Handbook of inter-rater reliability"?', 'response': 'The subject of Kilem L. Gwet\'s "Handbook of inter-rater reliability" is inter-rater reliability measurement.'}
[Node 43] Outputs: {'query': 'What is the focus of the paper "Toxigen: A large-scale machine-generated data

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 44] Outputs: {'query': 'Who are the authors of the paper titled "Overcoming catastrophic forgetting in neural networks"?', 'response': 'James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska'}
[Node 44] Outputs: {'query': 'What is the title of the paper written by Tomasz Korbak, Kejian Shi, Angelica Chen, and others in 2023?', 'response': 'Pretraining language models with human preferences'}
[Node 44] Outputs: {'query': 'What is the focus of the paper "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing" authored by Taku Kudo and John Richardson?', 'response': 'The focus of the paper "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing" authored by Taku Kudo and John Richardson is on the development of a subword tokenizer and detokenizer that is simple,

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 45] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al.'}
[Node 45] Outputs: {'query': 'What is the title of the paper authored by Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal?', 'response': 'The title of the paper authored by Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal is "Can a suit of armor conduct electricity? A new dataset for open book question answering."'}
[Node 45] Outputs: {'query': 'What is the focus of the paper "Model cards for model reporting" by Margaret Mitchell and others?', 'response': 'The focus of the paper "Model cards for model reporting" by Margaret Mitchell and others is on model reporting.'}
[Node 45] Outputs: {'query': 'Who introduced the new standard for open-source, 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 46] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'The authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models" are Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 47] Outputs: {'query': 'Who are the editors of the Proceedings of the 37th International Conference on Machine Learning?', 'response': 'Hal Daumé III and Aarti Singh.'}
[Node 47] Outputs: {'query': 'What is the title of the paper by Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano?', 'response': 'The title of the paper by Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano is "Coldgans: Taming language gans with cautious sampling strategies."'}
[Node 47] Outputs: {'query': 'What is the focus of the paper "Neural machine translation of rare words with subword units" published in 2016?', 'response': 'The focus of the paper "Neural machine translation of rare words with subword units" published in 2016 is on using subword units to improve the neural machine translation of rare words.'}
[Node 47] Outputs: {'query': 'Who are the authors of the paper titled "SCROLLS: Standardized CompaRison over long lan

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 48] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'The authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models" are Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.'}
[Node 48] Outputs: {'query': 'What is the title of the paper authored by Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant?', 'response': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}
[Node 48] Outputs: {'query': 'Who developed the Stanford Alpaca, an instruction-following llama model?', 'response': 'Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto developed the Stanford Alpaca, an instruction-following llama model.'}
[Node 48] Outputs: {'query': 'What is the subject of the paper "Galactica: A large language model for science" authored by Ross Taylor, Marcin Kardas, Guillem Cucurull, Thoma

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 49] Outputs: {'query': 'Who are the authors of the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al.'}
[Node 49] Outputs: {'query': 'Who conducted the research on "Sustainable ai: Environmental implications, challenges and opportunities"?', 'response': 'Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. conducted the research on "Sustainable ai: Environmental implications, challenges and opportunities".'}
[Node 49] Outputs: {'query': 'Who are the authors of the paper "Recipes for safety in open-domain chatbots" published in 2021?', 'response': 'Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan.'}
[Node 49] Outputs: {'query': 'Who proposed the concept of "Hellaswag: Can a machine real

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 50] Outputs: {'query': 'Who are some of the contributors to the project titled "Llama 2: Open Foundation and Fine-Tuned Chat Models"?', 'response': 'Some of the contributors to the project titled "Llama 2: Open Foundation and Fine-Tuned Chat Models" include Amjad Almahairi, Yasmine Babaei, Soumya Batra, Lukas Blecher, Dan Bikel, Shruti Bhosale, Cristian Canton Ferrer, Jude Fernandes, Wenyin Fu, Brian Fuller, Cynthia Gao, Saghar Hosseini, Hakan Inan, Isabel Kloumann, Madian Khabsa, Artem Korenev, Viktor Kerkez, Jian Xiang Kuan, Yinghai Lu, Jenya Lee, Pushkar Mishra, Yixin Nie, Rashi Rungta, Alan Schelten, Kalyan Saladi, Adina Williams, and Zheng Yan.'}
[Node 50] Outputs: {'query': 'What role did the human annotators play in improving the performance of the tuned model in the "Llama 2: Open Foundation and Fine-Tuned Chat Models" project?', 'response': 'The human annotators played a key role in improving the performance of the tuned model in the "Llama 2: Open Foundation and Fine-Tu

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 51] Outputs: {'query': 'What is the performance comparison between 2k and 4k context pretraining on long-context benchmarks?', 'response': 'The performance comparison between 2k and 4k context pretraining on long-context benchmarks shows that the 4k context pretraining model performs significantly better. The 4k model shows improvement on the SCROLLS benchmark and no performance degradation on the SQUAD benchmark, while the 2k model does not perform as well on these benchmarks.'}
[Node 51] Outputs: {'query': 'What is the observed improvement on SCROLLS with the change in context length?', 'response': 'The observed improvement on SCROLLS with the change in context length is not mentioned in the given context information.'}
[Node 51] Outputs: {'query': 'How does the performance on SQUAD vary with the change in context length?', 'response': 'The performance on SQUAD does not degrade with the change in context length.'}
[Node 51] Outputs: {'query': 'What is the impact of increasing c

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 52] Outputs: {'query': 'What does the MHA variant trigger at a batch size of 1024 for a context of 256 tokens?', 'response': 'The MHA variant triggers an out-of-memory error at a batch size of 1024 for a context of 256 tokens.'}
[Node 52] Outputs: {'query': 'Which model was chosen for the 34B and 70B Llama 2 models based on the ablation results and ease of scaling inference?', 'response': 'GQA was chosen for the 34B and 70B Llama 2 models based on the ablation results and ease of scaling inference.'}
[Node 52] Outputs: {'query': 'How did the inference speed change for the 30B GQA and MQA ablation models compared to the MHA baseline?', 'response': 'The inference speed for the 30B GQA and MQA ablation models compared to the MHA baseline changed as shown in Figure 24. The experiment used 8 x 80 GiB A100s with tensor parallelism. By duplicating the KV heads for MQA in all GPUs, the KV cache size for MQA became equal to GQA, and the two variants behaved very similarly, with MQA having

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 53] Outputs: {'query': 'What is the five-shot performance of Llama 2 with a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark?', 'response': 'The five-shot performance of Llama 2 with a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark is 78.5.'}
[Node 53] Outputs: {'query': 'How does the performance of Falcon with a 7B model compare to Llama 1 with a 7B model on the MMLU benchmark?', 'response': 'The performance of Falcon with a 7B model is lower than Llama 1 with a 7B model on the MMLU benchmark.'}
[Node 53] Outputs: {'query': 'What is the performance of MPT with a 30B model on the standard benchmarks?', 'response': 'The performance of MPT with a 30B model on the standard benchmarks is as follows:\n- BoolQ: 79.0\n- PIQA: 81.9\n- SIQA: 48.9\n- HellaSwag: 79.9\n- WinoGrande: 71.0\n- ARC-e: 76.5\n- ARC-c: 50.6'}
[Node 53] Outputs: {'query': 'How does the performance of Llama 1 with a 65B model compare to Llama 2 with a 70B model on

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 54] Outputs: {'query': 'What are the code generation results on Human-Eval and MBPP for Llama 2 with 7B?', 'response': 'The code generation results on Human-Eval and MBPP for Llama 2 with 7B are as follows:\n\n- For Human-Eval:\n   - pass@1: 12.8\n   - pass@100: 20.8\n\n- For MBPP:\n   - pass@1: 45.6\n   - pass@100: 62.8'}
[Node 54] Outputs: {'query': 'How does the performance of Llama 1 with 65B compare to Llama 2 with 70B in terms of pass@1 and pass@100 scores?', 'response': 'Llama 1 with 65B has a pass@1 score of 23.7% and a pass@100 score of 79.3%. On the other hand, Llama 2 with 70B has a pass@1 score of 29.9% and a pass@100 score of 89.0%. Therefore, Llama 2 with 70B performs better than Llama 1 with 65B in terms of both pass@1 and pass@100 scores.'}
[Node 54] Outputs: {'query': 'What is the 5-shot exact match performance of Llama 2 with 34B on NaturalQuestions?', 'response': 'The 5-shot exact match performance of Llama 2 with 34B on NaturalQuestions is 32.8.'}
[Node 54] Ou

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 55] Outputs: {'query': 'What is the size of the MPT model that achieved a score of 74.7 in the 0-shot SQUAD evaluation?', 'response': 'The size of the MPT model that achieved a score of 74.7 in the 0-shot SQUAD evaluation is 30B.'}
[Node 55] Outputs: {'query': 'How did the 7B Falcon model perform in the 1-shot QUAC evaluation?', 'response': 'The 7B Falcon model achieved a score of 16.0 in the 1-shot QUAC evaluation.'}
[Node 55] Outputs: {'query': 'Which model achieved a score of 80.7 in the 0-shot SQUAD evaluation with a size of 70B?', 'response': 'Llama 2 achieved a score of 80.7 in the 0-shot SQUAD evaluation with a size of 70B.'}
[Node 55] Outputs: {'query': 'In the AGI Eval (English), what was the score of the 30B MPT model on the LSAT-AR test?', 'response': 'The score of the 30B MPT model on the LSAT-AR test in the AGI Eval (English) was 28.7.'}
[Node 55] Outputs: {'query': 'What was the performance of the 40B Falcon model on the SAT-en test in the AGI Eval (English)?', 'res

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 56] Outputs: {'query': 'What are the results reported for Llama 2 on the GSM8k and MATH tasks?', 'response': 'The results reported for Llama 2 on the GSM8k and MATH tasks are not provided in the given context information.'}
[Node 56] Outputs: {'query': 'How many batches of human preference data were collected for Meta?', 'response': '14 batches of human preference data were collected for Meta.'}
[Node 56] Outputs: {'query': 'What change was observed in preference rating over batches?', 'response': 'The change observed in preference rating over batches was an increase in the share of samples with similar responses (e.g., negligibly better or unsure) and a decrease in the share of samples with stronger preference (e.g., significantly better).'}
[Node 56] Outputs: {'query': 'What was the purpose of collecting more multi-turn samples in the Meta human preference data?', 'response': 'To increase the complexity of the Reinforcement Learning from Human Feedback (RLHF) data.'}
[Node 56] 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 57] Outputs: {'query': 'What is the average number of turns per dialogue in the Meta human preference data?', 'response': 'The average number of turns per dialogue in the Meta human preference data is 3.9.'}
[Node 57] Outputs: {'query': 'How many comparisons were made in the first batch of the Meta human preference data?', 'response': 'In the first batch of the Meta human preference data, 5,561 comparisons were made.'}
[Node 57] Outputs: {'query': 'What is the average number of tokens per example in the 14th batch of the Meta human preference data?', 'response': 'The average number of tokens per example in the 14th batch of the Meta human preference data is 1008.0.'}
[Node 57] Outputs: {'query': 'What does each example in the Meta human preference data consist of?', 'response': 'Each example in the Meta human preference data consists of a prompt (including previous dialogue if available) and a response, which is the input of the reward model.'}
[Node 57] Outputs: {'query': 'What 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 58] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'}
[Node 58] Outputs: {'query': 'What does Figure 25 in the context represent?', 'response': 'Figure 25 in the context represents the distribution of human preference data ratings over batches. It shows that as the Llama 2-Chat models are trained and become available for preference data annotation, the share of samples with an unsure or negligibly better rating becomes larger.'}
[Node 58] Outputs: {'query': 'What is the effect of the safety auxiliary loss on the accuracy of all three categories?', 'response': 'The safety auxiliary loss boosts the accuracy of all three categories.'}
[Node 58] Outputs: {'query': 'What is the measure used for the recall of unsafe response?', 'response': 'The measure used for the recall of unsafe response is the percentage of unsafe responses captured w

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 59] Outputs: {'query': 'What does the GAtt in Llama 2-Chat refer to?', 'response': 'The GAtt in Llama 2-Chat refers to the attention mechanism used in the model.'}
[Node 59] Outputs: {'query': 'How does Llama 2-Chat with GAtt perform in terms of referring to attributes in a conversation?', 'response': 'Llama 2-Chat with GAtt performs very well in terms of referring to attributes in a conversation. According to the provided information, when equipped with GAtt, Llama 2-Chat is able to refer to attributes 100% of the time for up to 20 turns in a conversation. This means that it can accurately remember and refer to attributes such as hobbies and persona throughout the conversation. In comparison, Llama 2-Chat without GAtt loses the ability to refer to attributes after only a few turns. Therefore, GAtt significantly improves the multi-turn memory ability of Llama 2-Chat in terms of referring to attributes.'}
[Node 59] Outputs: {'query': 'What happens to the ability of Llama 2-Chat wi

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 60] Outputs: {'query': 'What is the maximum number of tokens that Llama 2-Chat models can handle?', 'response': 'The context information does not provide the maximum number of tokens that Llama 2-Chat models can handle.'}
[Node 60] Outputs: {'query': 'What are the five categories into which single turn prompts were categorized for the model comparison?', 'response': 'The five categories into which single turn prompts were categorized for the model comparison are factual questions, writing and content creation, language assistance, recommendations, and dialogue.'}
[Node 60] Outputs: {'query': 'What is the context length and generation length used for open-source models in the evaluation?', 'response': 'The context length and generation length used for open-source models in the evaluation is 1000 tokens.'}
[Node 60] Outputs: {'query': 'What interaction methods were used to collect multi-turn prompts?', 'response': 'The interaction methods used to collect multi-turn prompts were: (a

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 61] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models."'}
[Node 61] Outputs: {'query': 'What is the purpose of the human annotators in the evaluation methodology?', 'response': "The purpose of the human annotators in the evaluation methodology is to compare and assess the responses generated by two models (one being the Llama 2-Chat model and the other being an open source or closed source model). They are asked to determine which model's response is better in terms of being helpful, safe, and honest. The annotators rate each generation pair on a seven-point scale and their evaluations are used to determine wins, ties, and losses in the results."}
[Node 61] Outputs: {'query': 'How many jelly beans are left in the jar if 35% of them are removed from a total of 60?', 'response': 'There would be 39 jelly beans left in the jar if 35% of the

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 62] Outputs: {'query': 'What is the win rate of Llama 2-Chat versus ChatGPT without any system prompt?', 'response': 'The win rate of Llama 2-Chat versus ChatGPT without any system prompt is 44%.'}
[Node 62] Outputs: {'query': 'How does the win rate of Llama 2-Chat change for single turn prompts?', 'response': 'The win rate of Llama 2-Chat for single turn prompts increases from 36% to nearly 49%.'}
[Node 62] Outputs: {'query': 'Which model outperforms the other on language assistance, Llama 2-Chat 70B or ChatGPT?', 'response': 'Llama 2-Chat 70B outperforms ChatGPT on language assistance.'}
[Node 62] Outputs: {'query': 'Which model performs better on factual questions, Llama 2-Chat 70B or ChatGPT?', 'response': 'Llama 2-Chat 70B performs better on factual questions compared to ChatGPT.'}
[Node 62] Outputs: {'query': 'Does the win rate show any trends with word count or turn count?', 'response': 'The win rate does not show any trends with word count or turn count.'}
[Node 62] Outpu

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 63] Outputs: {'query': 'What is a classic haircut that hits just below the chin and can be styled in many ways?', 'response': 'A classic haircut that hits just below the chin and can be styled in many ways is the bob.'}
[Node 63] Outputs: {'query': 'Which haircut is generally flattering on most people and is a great option for those with heart-shaped faces or angular features?', 'response': 'The haircut that is generally flattering on most people and is a great option for those with heart-shaped faces or angular features is the pixie cut.'}
[Node 63] Outputs: {'query': 'What is the benefit of adding layers to your hair according to the text?', 'response': 'Adding layers to your hair can add volume, texture, and movement.'}
[Node 63] Outputs: {'query': 'What is a choppy, layered haircut that is generally flattering on most people?', 'response': 'The shag is a choppy, layered haircut that is generally flattering on most people.'}
[Node 63] Outputs: {'query': 'What factors should be

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 64] Outputs: {'query': 'What is a flattering haircut for most face shapes that is also easy to manage?', 'response': 'The pixie cut is a flattering haircut for most face shapes that is also easy to manage.'}
[Node 64] Outputs: {'query': 'How is a pixie cut typically styled?', 'response': 'A pixie cut is typically styled in a variety of ways, including with bangs, without bangs, or with a side part.'}
[Node 64] Outputs: {'query': 'What is a good option for those who want a short, low-maintenance haircut?', 'response': 'A good option for those who want a short, low-maintenance haircut is the pixie cut.'}
[Node 64] Outputs: {'query': 'What is the best way to choose a haircut if you are unsure about what to get?', 'response': 'Consulting with a hair stylist is always a good idea if you are unsure about what haircut to get. They can help you choose a style that will flatter your features and make you feel confident and comfortable.'}
[Node 64] Outputs: {'query': 'What is a versatile h

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 65] Outputs: {'query': 'What is a Ponzi scheme?', 'response': 'A Ponzi scheme is a type of investment scam where existing investors are paid returns using funds contributed by new investors, rather than from actual profits. The scheme relies on a continuous influx of new investors to provide the necessary funds to pay returns to earlier investors. It is named after Charles Ponzi, who gained notoriety for using this technique in the early 1900s. The masterminds behind a Ponzi scheme typically set it up by making promises of high returns to potential investors, creating fake documents to make the scheme appear legitimate, collecting money from investors, using the collected funds to pay returns to earlier investors while keeping a portion for themselves, and continuing to recruit new investors to sustain the scheme. However, it is important to note that Ponzi schemes are illegal and can lead to severe consequences for those involved.'}
[Node 65] Outputs: {'query': 'Who is the Ponzi

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 66] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}
[Node 66] Outputs: {'query': 'What is the primary focus of the Llama 2: Open Foundation and Fine-Tuned Chat Models paper?', 'response': 'The primary focus of the Llama 2: Open Foundation and Fine-Tuned Chat Models paper is on developing chat models that are safe and do not generate offensive or problematic content.'}
[Node 66] Outputs: {'query': 'What is the purpose of a comedy roast as described in the context?', 'response': 'The purpose of a comedy roast, as described in the context, is to have fun and poke fun at the honoree in a lighthearted and playful way. It is meant to be light-hearted and playful, and should not be used to genuinely hurt or offend others. The goal is to create a fun and entertaining experience for everyone involved, using clever, witty, and self-deprecating humor.'}
[Node 66] Outputs: {'query': 'How does th

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 67] Outputs: {'query': 'What is the preferred style of pizza according to the context?', 'response': 'The preferred style of pizza according to the context is Chicago-style pizza.'}
[Node 67] Outputs: {'query': 'What is the opinion on folding pizza as per the context?', 'response': 'Folding pizza is not considered a preferred way to enjoy it according to the context.'}
[Node 67] Outputs: {'query': 'What is the view on adding pineapples to pizza according to the context?', 'response': 'The view on adding pineapples to pizza according to the context is that it is considered an abomination and should never be allowed on any self-respecting pizza.'}
[Node 67] Outputs: {'query': 'What is the suggested way to eat pizza as per the context?', 'response': 'The suggested way to eat pizza, according to the context, is to not fold it and to eat it in slices.'}
[Node 67] Outputs: {'query': 'According to the context, what is the opinion on New York-style pizza?', 'response': 'New York-style pi

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 68] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models."'}
[Node 68] Outputs: {'query': 'What is the name of the chat model mentioned in the context?', 'response': 'Llama 2'}
[Node 68] Outputs: {'query': 'What is the effect of including 100% safety data in model tuning according to the context?', 'response': 'The effect of including 100% safety data in model tuning, according to the context, is that it helps educate users to be respectful and discourages the use of offensive words such as "abomination."'}
[Node 68] Outputs: {'query': 'What kind of language does the Llama 2-Chat model discourage according to the context?', 'response': 'The Llama 2-Chat model discourages the use of offensive language, such as the word "abomination," as mentioned in the context.'}
[Node 68] Outputs: {'query': 'What does the Llama 2-Chat model promote when i

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 69] Outputs: {'query': "What is the dish 'sex in a pan' typically made of?", 'response': 'Sex in a pan is a popular Thai dish that is made with a variety of ingredients, including fish, meat, or seafood, and is typically served in a spicy and sour sauce.'}
[Node 69] Outputs: {'query': "How is the taste of 'sex in a pan' generally described?", 'response': "The taste of 'sex in a pan' is generally described as spicy, sour, and slightly sweet. It is known for its bold and complex flavors."}
[Node 69] Outputs: {'query': "What is the origin of the dish 'sex in a pan'?", 'response': "The origin of the dish 'sex in a pan' is not mentioned in the given context information."}
[Node 69] Outputs: {'query': "What kind of sensation does 'sex in a pan' often leave on the tongue?", 'response': 'The taste of sex in a pan is often described as having a slightly numbing or tingling sensation on the tongue.'}
[Node 69] Outputs: {'query': "What is the key to enjoying 'sex in a pan' according to the 

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 70] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models."'}
[Node 70] Outputs: {'query': 'What is the purpose of the Llama 2 model?', 'response': 'The purpose of the Llama 2 model is not mentioned in the given context information.'}
[Node 70] Outputs: {'query': 'What are the two aspects that can be observed in conflicts according to Figure 32?', 'response': 'The two aspects that can be observed in conflicts according to Figure 32 are high safety score but low helpfulness score in the bottom right corner of the safe response plot, and low safety score but high helpfulness score in the top left corner of the unsafe response plot.'}
[Node 70] Outputs: {'query': 'What is the role of the assistant as described in the context?', 'response': 'The role of the assistant, as described in the context, is to be a responsible and safe assistant that n

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 71] Outputs: {'query': 'Who created the cartoon that depicted the Republican Party as an elephant?', 'response': 'Thomas Nast, an American political cartoonist, created the cartoon that depicted the Republican Party as an elephant.'}
[Node 71] Outputs: {'query': 'What qualities did Thomas Nast believe the Republican Party possessed, as represented by the elephant in his cartoon?', 'response': 'Thomas Nast believed that the Republican Party possessed qualities of strength, courage, and wisdom, as represented by the elephant in his cartoon.'}
[Node 71] Outputs: {'query': "How was the Democratic Party depicted in Thomas Nast's political cartoons?", 'response': "The Democratic Party was depicted as a donkey in Thomas Nast's political cartoons. The donkey was meant to represent stubbornness and foolishness, according to Nast's symbolism."}
[Node 71] Outputs: {'query': "What qualities were meant to be represented by the donkey symbol for the Democratic Party in Thomas Nast's cartoons?"

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 72] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'}
[Node 72] Outputs: {'query': 'What is the purpose of Table 41 in the context?', 'response': 'Table 41 in the context provides examples of vague responses or false refusals made by context distillation.'}
[Node 72] Outputs: {'query': 'What is the response of the AI when asked about a recipe for Christmas crack?', 'response': 'The AI refuses to provide a recipe for Christmas crack, stating that it is a dangerous and illegal drug. It emphasizes the severe legal consequences associated with the production and distribution of crack cocaine, as well as the serious health consequences and risks of addiction. Instead, the AI encourages seeking help from addiction treatment centers, support groups, and hotlines for assistance and support.'}
[Node 72] Outputs: {'query': "What is the AI's s

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 73] Outputs: {'query': 'What does the false model refusal rate indicate in the Llama 2 study?', 'response': 'The false model refusal rate in the Llama 2 study indicates the rate at which the model incorrectly refuses to generate a response when it should have.'}
[Node 73] Outputs: {'query': 'What is the range of the false refusal rate on the helpfulness dataset according to Figure 33?', 'response': 'The range of the false refusal rate on the helpfulness dataset according to Figure 33 is from 0.006% to 0.05%.'}
[Node 73] Outputs: {'query': 'How does the false refusal rate on the borderline dataset vary in the Llama 2 study?', 'response': 'The false refusal rate on the borderline dataset in the Llama 2 study ranges from 15% to 27%.'}
[Node 73] Outputs: {'query': 'What is the context of the illicit and criminal activities prompts in Table 42?', 'response': 'The context of the illicit and criminal activities prompts in Table 42 is related to requests for information or advice on vari

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 74] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'}
[Node 74] Outputs: {'query': 'What is the stance of the Llama 2-Chat model on providing information about illegal activities?', 'response': 'The Llama 2-Chat model takes a clear stance against providing information or advice on illegal activities. It emphasizes that hotwiring a car is not only illegal but also dangerous and can cause harm to both the car and the person attempting to start it. Instead, the model suggests exploring legal and safe alternatives such as researching keyless entry systems, remote starters, or smartphone apps for unlocking and starting cars. It also recommends contacting a professional locksmith or roadside assistance service if someone needs to start a car without a key. The model emphasizes the importance of choosing legal and safe options when it come

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 75] Outputs: {'query': 'What benchmark is used to understand the ability of LLMs to generate reliable outputs?', 'response': 'TruthfulQA is the benchmark used to understand the ability of LLMs to generate reliable outputs.'}
[Node 75] Outputs: {'query': 'How many questions does the TruthfulQA benchmark consist of and across how many categories are they distributed?', 'response': 'The TruthfulQA benchmark consists of 817 questions distributed across 38 categories.'}
[Node 75] Outputs: {'query': 'What is the purpose of the questions designed in the TruthfulQA benchmark?', 'response': 'The purpose of the questions designed in the TruthfulQA benchmark is to measure whether a language model is truthful in generating answers to questions while being informative at the same time. These questions are designed in a way that even humans might answer incorrectly because of an unfounded belief or misconception.'}
[Node 75] Outputs: {'query': 'What model is used to predict the truthfulness an

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 76] Outputs: {'query': 'What is the sentiment score of the Llama 2-Chat model compared to the pretrained versions?', 'response': 'The sentiment score of the Llama 2-Chat model is more positive compared to the pretrained versions.'}
[Node 76] Outputs: {'query': 'How does the sentiment score of ChatGPT compare to that of Llama 2-Chat?', 'response': 'The sentiment score of ChatGPT tends to be more neutral compared to that of Llama 2-Chat.'}
[Node 76] Outputs: {'query': 'Which demographic groups in the race domain tend to have relatively positive sentiment scores?', 'response': 'Asian Americans and Hispanic and Latino Americans tend to have relatively positive sentiment scores in the race domain.'}
[Node 76] Outputs: {'query': 'Which demographic groups of religious ideology have the largest increase in sentiment scores after fine-tuning?', 'response': 'The demographic groups of Islam and Sikhism have the largest increase in sentiment scores after fine-tuning.'}
[Node 76] Outputs: {'q

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 77] Outputs: {'query': 'What is the score of the Pretrained MPT 7B model for the Asian category?', 'response': 'The score of the Pretrained MPT 7B model for the Asian category is 15.40.'}
[Node 77] Outputs: {'query': 'How does the score of the Falcon 7B model in the Muslim category compare to that of the Llama 1 7B model?', 'response': 'The score of the Falcon 7B model in the Muslim category is higher than that of the Llama 1 7B model.'}
[Node 77] Outputs: {'query': 'What is the score of the Llama 2 13B model for the Women category?', 'response': 'The score of the Llama 2 13B model for the Women category is 0.'}
[Node 77] Outputs: {'query': 'How does the score of the Fine-tuned ChatGPT model in the Latino category compare to that of the Llama 2 7B model?', 'response': 'The score of the Fine-tuned ChatGPT model in the Latino category is higher than that of the Llama 2 7B model.'}
[Node 77] Outputs: {'query': 'What is the score of the MPT-instruct 7B model for the Middle Eastern ca

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 78] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models."'}
[Node 78] Outputs: {'query': 'What does a small percentage indicate in the context of toxic generations split by demographic groups in ToxiGen?', 'response': 'A small percentage in the context of toxic generations split by demographic groups in ToxiGen indicates low toxicity in the model generations.'}
[Node 78] Outputs: {'query': 'From where are the demographic group labels adopted in the context?', 'response': 'The demographic group labels in the context are adopted from ToxiGen.'}
[Node 78] Outputs: {'query': 'What is the percentage of toxic generations for Asian Americans in the pretrained MPT model with 7B parameters?', 'response': 'The percentage of toxic generations for Asian Americans in the pretrained MPT model with 7B parameters is 0.38.'}
[Node 78] Outputs: {'query': '

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 79] Outputs: {'query': 'What is the title of the paper that discusses the distribution of mean sentiment scores across groups under the gender domain?', 'response': 'The title of the paper that discusses the distribution of mean sentiment scores across groups under the gender domain is not provided in the given context information.'}
[Node 79] Outputs: {'query': 'What is the importance of context in chat scenarios according to the paper?', 'response': 'The paper highlights that in chat scenarios, context is important for assessing the performance of fine-tuned chat models. Existing benchmarks that evaluate language understanding and generation based on individual sentences or prompts may not thoroughly evaluate the ability of chat models to maintain context, handle nuanced situations, and avoid generating toxic content within a conversation. The paper emphasizes the need for additional testing and evaluation methods that consider how chat models are integrated in a product deploy

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 80] Outputs: {'query': 'What is the distribution of mean sentiment scores across groups under the religious ideology domain from the BOLD prompts in the Llama 2 model with 70B parameters?', 'response': 'The distribution of mean sentiment scores across groups under the religious ideology domain from the BOLD prompts in the Llama 2 model with 70B parameters is as follows:\n\nJudaism: 0.42\nChristianity: 0.29\nIslam: 0.34\nBuddhism: 0.37\nSikhism: 0.20'}
[Node 80] Outputs: {'query': 'How does the mean sentiment score for Christianity compare between the pretrained MPT model with 7B parameters and the Falcon model with the same parameters?', 'response': 'The mean sentiment score for Christianity is higher in the pretrained MPT model with 7B parameters compared to the Falcon model with the same parameters.'}
[Node 80] Outputs: {'query': 'In the Llama 2-Chat model with 34B parameters, what is the mean sentiment score for Sikhism?', 'response': 'The mean sentiment score for Sikhism in t

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 81] Outputs: {'query': 'What is the title of the paper discussed in the context?', 'response': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}
[Node 81] Outputs: {'query': 'What are the mean sentiment scores for the Llama 2 model with 7B parameters under the political ideology domain?', 'response': 'The mean sentiment scores for the Llama 2 model with 7B parameters under the political ideology domain are 0.28, 0.51, 0.29, 0.44, 0.59, 0.75, 0.28, 0.75, 0.55, 0.26, 0.50, and -0.19.'}
[Node 81] Outputs: {'query': 'How do the sentiment scores of the Fine-tuned ChatGPT compare to those of the Llama 2 model in the political ideology domain?', 'response': 'The sentiment scores of the Fine-tuned ChatGPT model and the Llama 2 model in the political ideology domain can be compared.'}
[Node 81] Outputs: {'query': 'What are the mean sentiment scores for MPT-instruct with 7B parameters in the political ideology domain?', 'response': 'The mean sentiment scores for MPT-instruct with 7B p

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 82] Outputs: {'query': 'What is the title of the paper that presents the Llama 2 model?', 'response': 'The title of the paper that presents the Llama 2 model is "Llama 2: Open Foundation and Fine-Tuned Chat Models."'}
[Node 82] Outputs: {'query': 'How many different versions of the Llama 1 model are mentioned in the context?', 'response': 'There are 5 different versions of the Llama 1 model mentioned in the context.'}
[Node 82] Outputs: {'query': 'What are the different versions of the Llama 2 model presented in the context?', 'response': 'The different versions of the Llama 2 model presented in the context are 7B, 13B, 34B, and 70B.'}
[Node 82] Outputs: {'query': 'Which model has the highest score of 0.84 in the context?', 'response': 'The model that has the highest score of 0.84 in the context is Fine-tuned ChatGPT.'}
[Node 82] Outputs: {'query': 'How does the performance of the 7B version of Llama 1 compare to the 7B version of Llama 2?', 'response': 'The performance of the 7B

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 83] Outputs: {'query': 'What are some responses that could cause a negative user experience when interacting with chat models?', 'response': 'Some responses that could cause a negative user experience when interacting with chat models include promoting or enabling criminal activities, promoting or enabling dangerous behaviors to the user or other people, containing, promoting or enabling offensive and abusive behavior towards the user or other people, and containing, promoting or enabling sexually explicit content.'}
[Node 83] Outputs: {'query': 'What are the safety guidelines that annotators are instructed to avoid violating when writing responses?', 'response': 'The annotators are instructed to avoid writing responses that promote or enable criminal activities, promote or enable dangerous behaviors to the user or other people, contain, promote or enable offensive and abusive behavior towards the user or other people, and contain, promote or enable sexually explicit content.'}
[

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 84] Outputs: {'query': 'What is considered as "contaminated" in an evaluation set according to the 2022 study?', 'response': 'According to the 2022 study, an evaluation set is considered "contaminated" if there exists a collision between a high-order n-gram (generally, n = 13) from the sample and the training data. This approach is used to produce a "clean" subset of the data with high precision.'}
[Node 84] Outputs: {'query': 'What approach is used to produce a "clean" subset of the data with high precision?', 'response': 'The approach used to produce a "clean" subset of the data with high precision is by considering a collision between a high-order n-gram (generally, n = 13) from the sample and the training data. This approach is deliberately conservative and is used in open-sourced evaluation libraries.'}
[Node 84] Outputs: {'query': 'What is the limitation of the approach used to measure dataset contamination?', 'response': 'The limitation of the approach used to measure data

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 85] Outputs: {'query': 'What does the term "contamination" refer to in the context of training data?', 'response': 'The term "contamination" in the context of training data refers to the presence of matched sequences in the training data that may appear to be contaminated. However, it is unlikely that the model actually saw the correctly-assembled contaminated sequences during training. The analysis is done to determine if the training data has been affected by contamination and if it has impacted the evaluation performance of the model.'}
[Node 85] Outputs: {'query': 'What is the significance of the statistic Zn in the contamination analysis?', 'response': 'The significance of the statistic Zn in the contamination analysis is that it measures the deviation of the mean performance metric (¯X) from the expected mean (µn) in terms of the standard deviation (σn). By calculating Zn, the analysis determines whether there is sufficient evidence to suggest that contamination has affecte

  0%|          | 0/10 [00:00<?, ?it/s]

[Node 86] Outputs: {'query': 'Who are the developers of the Llama 2 model?', 'response': 'The developers of the Llama 2 model are Meta AI.'}
[Node 86] Outputs: {'query': 'What are the different parameter sizes that Llama 2 comes in?', 'response': 'Llama 2 comes in a range of parameter sizes, including 7B, 13B, and 70B.'}
[Node 86] Outputs: {'query': 'What type of architecture does the Llama 2 model use?', 'response': 'The Llama 2 model uses an optimized transformer architecture.'}
[Node 86] Outputs: {'query': 'During which months was the Llama 2 model trained?', 'response': 'The Llama 2 model was trained between January 2023 and July 2023.'}
[Node 86] Outputs: {'query': 'What is the intended use of the Llama 2 model?', 'response': 'The intended use of the Llama 2 model is for commercial and research purposes in English. The pretrained models can be adapted for various natural language generation tasks, while the tuned models are specifically intended for assistant-like chat.'}
[Node 86

### Filter out questions using QueryResponseEvaluator

Do a second pass to make sure only questions that can be answerd by context make it into the training set.

In [27]:
# try evaluation modules
from llama_index.evaluation import QueryResponseEvaluator, ResponseEvaluator
from llama_index import PromptTemplate
from llama_index.llms import OpenAI

In [36]:
query_eval_tmpl = PromptTemplate(
    "Your task is to evaluate the following: If the response for the query isn't able to answer the question provided.\n"
    "If query isn't able to answer the question, answer NO.\n"
    "Otherwise answer YES.\n"
    "To elaborate, you might get an answer like the following: 'The context does not contain the answer to this question.'"
    "Please return NO in that case. "
    "You be given the query and response. Return YES or NO as the answer.\n"
    "Query: \n {query_str}\n"
    "Response: \n {response_str}\n"
    "Answer: "
)

eval_llm = OpenAI(model="gpt-4-0613")

In [45]:
def filter_data(path: str, out_path: str):
    fp = open(path, "r")
    out_fp = open(out_path, "w")
    new_lines = []
    for idx, line in enumerate(fp):
        qa_pair = json.loads(line)
        eval = eval_llm.complete(query_eval_tmpl.format(query_str=qa_pair["query"], response_str=qa_pair["response"]))
        
        print(f"[{idx}] QA Pair: {qa_pair} \n Eval: {eval}")
        if "NO" in eval:
            continue
        else:
            # new_lines.append(line)
            out_fp.write(line)

In [None]:
filter_data("data/qa_pairs.jsonl", "data/qa_pairs_2.jsonl")

### Split into Training and Validation Sets

We split into training and validation sets.

**NOTE**: We shuffle the data before splitting. This helps ensure that the training data has coverage throughout the document.

In [130]:
from copy import deepcopy
import random

def split_train_val(path: str, out_train_path: str, out_val_path: str, train_split=0.7):
    with open(path, "r") as fp:
        lines = fp.readlines()

        # shuffle the lines to make sure that the "train questions" cover most fo the context
        shuffled_lines = deepcopy(lines)
        random.shuffle(shuffled_lines)
        
        split_idx = int(train_split * len(shuffled_lines))
        train_lines = shuffled_lines[:split_idx]
        val_lines = shuffled_lines[split_idx:]
        with open(out_train_path, "w") as out_fp:
            out_fp.write("".join(train_lines))

        with open(out_val_path, "w") as out_fp:
            out_fp.write("".join(val_lines))

In [132]:
split_train_val("data/qa_pairs_2.jsonl", "data/qa_pairs_train.jsonl", "data/qa_pairs_val.jsonl")

### Format into Training Data

Format into training data for OpenAI's finetuning endpoints.

**NOTE**: We don't use our `OpenAIFinetuningHandler` because that logs the full input prompt including context as the user message. Here we just want to log the query as the user message, because we want to fine-tune gpt-3.5-turbo to "bake in knowledge" into the fine-tuned model.

In [133]:
fp = open("data/qa_pairs_train.jsonl", "r")
out_fp = open("data/qa_pairs_openai.jsonl", "w")
# TODO: try with different system prompts
system_prompt = {"role": "system", "content": "You are a helpful assistant helping to answer questions about the Llama 2 paper."}
for line in fp:
    qa_pair = json.loads(line)
    user_prompt = {"role": "user", "content": qa_pair["query"]}
    assistant_prompt = {"role": "assistant", "content": qa_pair["response"]}
    out_dict = {
        "messages": [system_prompt, user_prompt, assistant_prompt],
    }
    out_fp.write(json.dumps(out_dict) + "\n")

## Fine-tune the Model

In [214]:
from llama_index.finetuning import OpenAIFinetuneEngine

In [215]:
finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "data/qa_pairs_openai.jsonl",
    # start_job_id="<start-job-id>"  # if you have an existing job, can specify id here
)

In [216]:
finetune_engine.finetune()

Num examples: 597
First example:
{'role': 'system', 'content': 'You are a helpful assistant helping to answer questions about the Llama 2 paper.'}
{'role': 'user', 'content': 'Who were the early reviewers of the paper on "Llama 2: Open Foundation and Fine-Tuned Chat Models" who helped improve its quality?'}
{'role': 'assistant', 'content': 'Mike Lewis, Joelle Pineau, Laurens van der Maaten, Jason Weston, and Omer Levy were the early reviewers of the paper on "Llama 2: Open Foundation and Fine-Tuned Chat Models" who helped improve its quality.'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 50, 637
mean / median: 102.51256281407035, 90.0
p5 / p95: 66.0, 155.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 2, 588
mean / median: 50.45728643216081, 35.0
p5 /

In [225]:
finetune_engine.get_current_job()

<FineTuningJob fine_tuning.job id=ftjob-fk0428lntJCRh6x1GKeccv8E at 0x2b95fd6c0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-fk0428lntJCRh6x1GKeccv8E",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1694406904,
  "finished_at": 1694409009,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::7xTTW0hT",
  "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
  "result_files": [
    "file-Ao1r7cGnYJbHqCG79zAQo6lP"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-9ndBjJX0pZ3Do4mPhADcTOef",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 180006,
  "error": null
}

In [226]:
ft_model = finetune_engine.get_finetuned_model()

In [227]:
ft_model

OpenAI(callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x2bdaba2f0>, model='ft:gpt-3.5-turbo-0613:llamaindex::7xTTW0hT', temperature=0.1, max_tokens=None, additional_kwargs={}, max_retries=10, class_type='openai')

## Evaluate Results

We run evaluations, over both the validation set but also the training set.

**Wait, isn't evaluating over the training set cheating?**

- It's a sanity check of how much the model was able to memorize information it's trained on.
- The training data contains quite a bit of content about the paper, so by answering the training set well the model would at least be well-equipped to answer some questions.

In [258]:
from llama_index.llms import ChatMessage

In [234]:
def load_data(path: str):
    fp = open(path, "r")
    data_dicts = []
    for line in fp:
        d = json.loads(line)
        data_dicts.append(d)
    return data_dicts


In [260]:
train_dicts = load_data("data/qa_pairs_train.jsonl")
eval_dicts = load_data("data/qa_pairs_val.jsonl")

In [269]:
def query_model(model, d):
    # print(d)
    msgs = [
        ChatMessage(role="system", content="You are a helpful assistant helping to answer questions about the Llama 2 paper."),
        ChatMessage(role="user", content=d["query"]),
    ]

    # try ft-model
    response = model.chat(msgs)
    return str(response)

In [254]:
response = query_model(ft_model, eval_dicts[7])
print(eval_dicts[7])
print(response)

{'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'}


'assistant: The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'

In [261]:
query_model(ft_model, train_dicts[7])
print(train_dicts[7])
print(response)

{'query': 'How is the decision made whether to use safety context distillation or not?', 'response': 'The decision to use safety context distillation is made based on the reward model score. The safety reward model is leveraged to determine whether the context-distilled output receives a better reward model score than the original answer. If the context-distilled output gets a better reward model score, it is kept. This approach helps limit the negative impact of context distillation while still utilizing it in cases where it improves the reward model score.'}


'assistant: The decision to use safety context distillation is made based on the reward model score. If the reward model score is below a certain threshold, safety context distillation is used.'

### Setup Baseline RAG system to benchmark

We setup a baseline RAG system powered by gpt-3.5-turbo to help benchmark the quality of results.

In [262]:
# baseline RAG system
from llama_index import VectorStoreIndex
base_index = VectorStoreIndex(nodes, service_context=gpt_35_context)
base_query_engine = base_index.as_query_engine()

In [252]:
# baseline model
base_model = OpenAI(model="gpt-4", temperature=0.3)

In [257]:
query_model(base_model, eval_dicts[80])

{'query': 'How does Llama 2-Chat address the issue of spreading misinformation or conspiracy theories?', 'response': "Llama 2-Chat addresses the issue of spreading misinformation or conspiracy theories by refuting any misinformation in the prompt immediately. It emphasizes the importance of relying on scientific evidence and credible sources when evaluating historical events. The model does not promote or encourage the spread of false information and instead focuses on sharing accurate and helpful information. It also highlights the importance of fact-checking and critical thinking when assessing the validity of a claim. Llama 2-Chat's programming rules prioritize respect for truth and accuracy in all forms of communication and discourage the spread of misinformation or conspiracy theories."}


"assistant: The Llama 2 paper does not specifically address the issue of spreading misinformation or conspiracy theories. However, it does mention that the model is designed to refuse outputs that are inappropriate or harmful. This could potentially include misinformation or conspiracy theories. It also states that the model's responses are based on a mixture of licensed data, data created by human trainers, and publicly available data. The developers have also used reinforcement learning from human feedback to fine-tune the model, which can help in reducing the spread of false information. However, the specifics of how misinformation or conspiracy theories are handled are not detailed in the paper."

### Run Evaluations

We log the responses from the fine-tuned model, the baseline RAG system, and the baseline model.

We then run all responses through a GPT-4 prompt, comparing each against the ground-truth to measure validity of the result.

In [278]:
import pandas as pd
from tqdm.notebook import tqdm

EVAL_PROMPT_TMPL = PromptTemplate("""\
We provide a question and the 'ground-truth' answer. We also provide \
the predicted answer.

Evaluate whether the predicted answer is correct, given its similarity \
to the ground-truth. If details provided in predicted answer are reflected \
in the ground-truth answer, return "YES". To return "YES", the details don't \
need to exactly match. Be lenient in evaluation if the predicted answer \
is missing a few details. Try to make sure that there are no blatant mistakes. \
Otherwise, return "NO".

Question: {question}
Ground-truth Answer: {gt_answer}
Predicted Answer: {pred_answer}
Evaluation Result: \
""")


def eval_match_gt(query, gt_response, pred_response):
    llm = OpenAI(model="gpt-4", temperature=0.0)
    fmt_prompt = EVAL_PROMPT_TMPL.format(
        question=query,
        gt_answer=gt_response,
        pred_answer=pred_response,
    )
    result = llm.complete(fmt_prompt)
    if "yes" in str(result).lower():
        return 1
    else:
        return 0


def run_evals(eval_dicts):
    """Run evals - fine-tuned model, RAG system, and base model."""

    raw_responses = []
    for eval_dict in tqdm(eval_dicts):
        gt_response = eval_dict["response"]
        ft_response = str(query_model(ft_model, eval_dict))
        rag_response = str(base_query_engine.query(eval_dict["query"]))
        base_response = str(query_model(base_model, eval_dict))

 
        # try evaluations
        ft_eval = eval_match_gt(eval_dict["query"], gt_response, ft_response)
        rag_eval = eval_match_gt(eval_dict["query"], gt_response, rag_response)
        base_eval = eval_match_gt(eval_dict["query"], gt_response, base_response)

        response_dict = {
            "query": eval_dict["query"],
            "gt_response": gt_response,
            "ft_response": ft_response,
            "rag_response": rag_response,
            "base_response": base_response,
            "ft_eval": ft_eval,
            "rag_eval": rag_eval,
            "base_eval": base_eval
        }

        raw_responses.append(response_dict)

    raw_responses_df = pd.DataFrame(raw_responses)

    eval_dict = {
        "ft_score": raw_responses_df["ft_eval"].mean(),
        "rag_score": raw_responses_df["rag_eval"].mean(),
        "base_score": raw_responses_df["base_eval"].mean()
    }
    
    return eval_dict, raw_responses_df

In [274]:
pd.set_option('display.max_colwidth', None)

#### Qualitative Evaluations

Here we show some qualitative output examples over both the training and validation sets.

In [275]:
eval_dict, raw_response_df = run_evals(train_dicts[7:8])
display(eval_dict)
display(raw_response_df)

{'ft_score': 1.0, 'rag_score': 1.0, 'base_score': 0.0}

Unnamed: 0,query,gt_response,ft_response,rag_response,base_response,ft_eval,rag_eval,base_eval
0,How is the decision made whether to use safety context distillation or not?,"The decision to use safety context distillation is made based on the reward model score. The safety reward model is leveraged to determine whether the context-distilled output receives a better reward model score than the original answer. If the context-distilled output gets a better reward model score, it is kept. This approach helps limit the negative impact of context distillation while still utilizing it in cases where it improves the reward model score.","assistant: The decision to use safety context distillation is made based on the reward model score. If the reward model score is above a certain threshold, safety context distillation is used.","The decision to use safety context distillation is made based on the reward model score. The safety reward model is used to evaluate whether the context-distilled output gets a better reward model score than the original answer. If the context-distilled output receives a better reward model score, it is kept. This approach helps limit the negative impact of context distillation while still improving the safety of the model's responses.","assistant: The Llama 2 paper does not provide specific criteria for deciding when to use safety context distillation. The choice to use this method would likely depend on the specific requirements of the task and the potential risks involved. Safety context distillation is used to ensure that the model behaves safely even in situations that were not covered in the training data. If the task involves high-risk decisions or is in a domain where unexpected situations are likely to occur, it might be more important to use safety context distillation. However, this would likely be a decision made on a case-by-case basis, considering factors such as the complexity of the task, the quality and diversity of the training data, and the potential consequences of unsafe behavior.",1,1,0


In [276]:
eval_dict, raw_response_df = run_evals(eval_dicts[6:7])
display(eval_dict)
display(raw_response_df)

{'ft_score': 0.0, 'rag_score': 1.0, 'base_score': 0.0}

Unnamed: 0,query,gt_response,ft_response,rag_response,base_response,ft_eval,rag_eval,base_eval
0,What model is used to predict the truthfulness and informativeness of the generated outputs from LLMs?,"A fine-tuned GPT-3 model, referred to as ""GPT-judge,"" is used to predict the truthfulness and informativeness of the generated outputs from LLMs.",assistant: The model used to predict the truthfulness and informativeness of the generated outputs from LLMs is called TruthfulQA.,"A fine-tuned GPT-3 model, referred to as ""GPT-judge,"" is used to predict the truthfulness and informativeness of the generated outputs from LLMs.","assistant: The Llama 2 paper does not specify a particular model used to predict the truthfulness and informativeness of the generated outputs from LLMs (Language Model). The paper primarily focuses on the limitations and potential risks of large language models. If you're referring to a different paper or model, please provide more details.",0,1,0


#### Quantitative Evaluations

Here we show quantitative metrics over both the training and eval set.

In [281]:
import random
k = 40

train_dicts_sample = random.sample(train_dicts, k)
eval_dicts_sample = random.sample(eval_dicts, k)

In [282]:
eval_result, raw_response_df = run_evals(train_dicts_sample)
display(eval_result)
# display(raw_response_df)

  0%|          | 0/40 [00:00<?, ?it/s]

{'ft_score': 0.425, 'rag_score': 0.7, 'base_score': 0.225}

In [None]:
eval_result, raw_response_df = run_evals(eval_dicts_sapmle)
display(eval_result)
# display(raw_response_df)