# Optimization by Prompting" for RAG
Inspired by the Optimization by Prompting paper by Yang et al., in this guide we test the ability of a "meta-prompt" to optimize our prompt for better RAG performance. The process is roughly as follows:

The prompt to be optimized is our standard QA prompt template for RAG, specifically the instruction prefix.
We have a "meta-prompt" that takes in previous prefixes/scores + an example of the task, and spits out another prefix.
For every candidate prefix, we compute a "score" through correctness evaluation - comparing a dataset of predicted answers (using the QA prompt) to a candidate dataset. If you don't have it already, you can generate with GPT-4.

In [54]:
#%pip install -q llama-index-llms-openai
#%pip install  -q llama-index-readers-file pymupdf

In [2]:
import nest_asyncio
nest_asyncio.apply()

import warnings
warnings.filterwarnings('ignore')

In [3]:
!mkdir data && wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: data: File exists


In [4]:
import nest_asyncio
import os

from dotenv import dotenv_values
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI

In [5]:
from pathlib import Path
from llama_index.readers.file import PDFReader
from llama_index.readers.file import UnstructuredReader
from llama_index.readers.file import PyMuPDFReader

In [6]:
loader = PDFReader()
docs0 = loader.load_data(file=Path("./data/llama2.pdf"))

In [7]:
docs0[0].text[:100]

'Llama 2 : Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗Louis Martin†Kevin Stone†\nPeter Al'

In [8]:
from llama_index.core import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

In [9]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode

In [10]:
node_parser = SentenceSplitter(chunk_size=1024)

In [11]:
base_nodes = node_parser.get_nodes_from_documents(docs)

In [12]:
len(base_nodes)

93

## Setup Vector Index over this Data
We load this data into an in-memory vector store (embedded with OpenAI embeddings).
We'll be aggressively optimizing the QA prompt for this RAG pipeline.

In [13]:
os.environ["API_KEY"][:10]
os.environ["OPENAI_API_KEY"]=os.environ["API_KEY"]

In [14]:
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-4o")

In [15]:

index = VectorStoreIndex(base_nodes)
query_engine = index.as_query_engine(similarity_top_k=2)


## "Golden" Dataset

Here we generate a dataset of ground-truth QA pairs (or load it).

This will be used for two purposes:

* To generate some exemplars that we can put into the meta-prompt to illustrate the task
* To generate an evaluation dataset to compute our objective score - so that the meta-prompt can try optimizing for this score.

In [16]:
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.core.node_parser import SimpleNodeParser

In [17]:
dataset_generator = DatasetGenerator(
    base_nodes[:10],
    llm=OpenAI(model="gpt-4o"),
    show_progress=True,
    num_questions_per_chunk=5,
)

  dataset_generator = DatasetGenerator(


In [18]:

eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=20)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.56it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.53it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.48it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.64it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.53it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████

In [23]:
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")

## Dataset Samples

In [24]:

import random

full_qr_pairs = eval_dataset.qr_pairs

In [26]:

num_exemplars = 2
num_eval = 20
exemplar_qr_pairs = random.sample(full_qr_pairs, num_exemplars)

eval_qr_pairs = random.sample(full_qr_pairs, num_eval)

In [27]:
len(exemplar_qr_pairs)

2

## Prompt Optimization

We now define the functions needed for prompt optimization. We first define an evaluator, and then we setup the meta-prompt which produces candidate instruction prefixes.

Finally we define and run the prompt optimization loop.

### Evaluator

In [28]:
from llama_index.core.evaluation.eval_utils import get_responses

In [29]:
from llama_index.core.evaluation import CorrectnessEvaluator, BatchEvalRunner

evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4o"))
evaluator_dict = {
    "correctness": evaluator_c,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

### Correctness Function

In [31]:
async def get_correctness(query_engine, eval_qa_pairs, batch_runner):
    # then evaluate
    # TODO: evaluate a sample of generated results
    eval_qs = [q for q, _ in eval_qa_pairs]
    eval_answers = [a for _, a in eval_qa_pairs]
    pred_responses = get_responses(eval_qs, query_engine, show_progress=True)

    eval_results = await batch_runner.aevaluate_responses(
        eval_qs, responses=pred_responses, reference=eval_answers
    )
    avg_correctness = np.array(
        [r.score for r in eval_results["correctness"]]
    ).mean()
    return avg_correctness

In [32]:
QA_PROMPT_KEY = "response_synthesizer:text_qa_template"

In [33]:

from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

llm = OpenAI(model="gpt-4o")

In [34]:
qa_tmpl_str = (
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

In [35]:
print(query_engine.get_prompts()[QA_PROMPT_KEY].get_template())

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


### Define Metaprompt

In [36]:
meta_tmpl_str = """\
Your task is to generate the instruction <INS>. Below are some previous instructions with their scores.
The score ranges from 1 to 5.

{prev_instruction_score_pairs}

Below we show the task. The <INS> tag is prepended to the below prompt template, e.g. as follows:

```
<INS>
{prompt_tmpl_str}
```

The prompt template contains template variables. Given an input set of template variables, the formatted prompt is then given to an LLM to get an output.

Some examples of template variable inputs and expected outputs are given below to illustrate the task. **NOTE**: These do NOT represent the \
entire evaluation dataset.

{qa_pairs_str}

We run every input in an evaluation dataset through an LLM. If the LLM-generated output doesn't match the expected output, we mark it as wrong (score 0).
A correct answer has a score of 1. The final "score" for an instruction is the average of scores across an evaluation dataset.
Write your new instruction (<INS>) that is different from the old ones and has a score as high as possible.

Instruction (<INS>): \
"""

meta_tmpl = PromptTemplate(meta_tmpl_str)

### Prompt Optimization functions

In [37]:
from copy import deepcopy


def format_meta_tmpl(
    prev_instr_score_pairs,
    prompt_tmpl_str,
    qa_pairs,
    meta_tmpl,
):
    """Call meta-prompt to generate new instruction."""
    # format prev instruction score pairs.
    pair_str_list = [
        f"Instruction (<INS>):\n{instr}\nScore:\n{score}"
        for instr, score in prev_instr_score_pairs
    ]
    full_instr_pair_str = "\n\n".join(pair_str_list)

    # now show QA pairs with ground-truth answers
    qa_str_list = [
        f"query_str:\n{query_str}\nAnswer:\n{answer}"
        for query_str, answer in qa_pairs
    ]
    full_qa_pair_str = "\n\n".join(qa_str_list)

    fmt_meta_tmpl = meta_tmpl.format(
        prev_instruction_score_pairs=full_instr_pair_str,
        prompt_tmpl_str=prompt_tmpl_str,
        qa_pairs_str=full_qa_pair_str,
    )
    return fmt_meta_tmpl

In [38]:
def get_full_prompt_template(cur_instr: str, prompt_tmpl):
    tmpl_str = prompt_tmpl.get_template()
    new_tmpl_str = cur_instr + "\n" + tmpl_str
    new_tmpl = PromptTemplate(new_tmpl_str)
    return new_tmpl

In [39]:
import numpy as np


def _parse_meta_response(meta_response: str):
    return str(meta_response).split("\n")[0]


async def optimize_prompts(
    query_engine,
    initial_instr: str,
    base_prompt_tmpl,
    meta_tmpl,
    meta_llm,
    batch_eval_runner,
    eval_qa_pairs,
    exemplar_qa_pairs,
    num_iterations: int = 5,
):
    prev_instr_score_pairs = []
    base_prompt_tmpl_str = base_prompt_tmpl.get_template()

    cur_instr = initial_instr
    for idx in range(num_iterations):
        # TODO: change from -1 to 0
        if idx > 0:
            # first generate
            fmt_meta_tmpl = format_meta_tmpl(
                prev_instr_score_pairs,
                base_prompt_tmpl_str,
                exemplar_qa_pairs,
                meta_tmpl,
            )
            meta_response = meta_llm.complete(fmt_meta_tmpl)
            print(fmt_meta_tmpl)
            print(str(meta_response))
            # Parse meta response
            cur_instr = _parse_meta_response(meta_response)

        # append instruction to template
        new_prompt_tmpl = get_full_prompt_template(cur_instr, base_prompt_tmpl)
        query_engine.update_prompts({QA_PROMPT_KEY: new_prompt_tmpl})

        avg_correctness = await get_correctness(
            query_engine, eval_qa_pairs, batch_runner
        )
        prev_instr_score_pairs.append((cur_instr, avg_correctness))

    # find the instruction with the highest score
    max_instr_score_pair = max(
        prev_instr_score_pairs, key=lambda item: item[1]
    )

    # return the instruction
    return max_instr_score_pair[0], prev_instr_score_pairs

In [42]:
# define and pre-seed query engine with the prompt
query_engine = index.as_query_engine(similarity_top_k=2)
# query_engine.update_prompts({QA_PROMPT_KEY: qa_tmpl})

# get the base qa prompt (without any instruction prefix)
base_qa_prompt = query_engine.get_prompts()[QA_PROMPT_KEY]


initial_instr = """\
You are a QA assistant.
Context information is below. Given the context information and not prior knowledge, \
answer the query. \
"""

# this is the "initial" prompt template
# implicitly used in the first stage of the loop during prompt optimization
# here we explicitly capture it so we can use it for evaluation
old_qa_prompt = get_full_prompt_template(initial_instr, base_qa_prompt)

meta_llm = OpenAI(model="gpt-4o")

In [45]:
new_instr, prev_instr_score_pairs = await optimize_prompts(
    query_engine,
    initial_instr,
    base_qa_prompt,
    meta_tmpl,
    meta_llm,  # note: treat llm as meta_llm
    batch_runner,
    eval_qr_pairs,
    exemplar_qr_pairs,
    num_iterations=1,
)


new_qa_prompt = query_engine.get_prompts()[QA_PROMPT_KEY]
print(new_qa_prompt)

 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                           | 16/20 [00:06<00:02,  1.64it/s]Retrying llama_index.llms.openai.base.OpenAI._achat in 0.059292011608630624 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wDdFfjYtAtzZ6OpmLVkRR7UA on tokens per min (TPM): Limit 30000, Used 28698, Requested 1757. Please try again in 910ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.2612950335056601 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wDdFfjYtAtzZ6OpmLVkRR7UA on tokens per min (TPM): Limit 30000, Used 28628, Requested 1875. Please try again in 1.006s. Visit https://platform.openai.com/account/ra

metadata={'prompt_type': <PromptType.CUSTOM: 'custom'>} template_vars=['context_str', 'query_str'] kwargs={} output_parser=None template_var_mappings=None function_mappings=None template='You are a QA assistant.\nContext information is below. Given the context information and not prior knowledge, answer the query. \nContext information is below.\n---------------------\n{context_str}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {query_str}\nAnswer: '





In [46]:
new_qa_prompt

PromptTemplate(metadata={'prompt_type': <PromptType.CUSTOM: 'custom'>}, template_vars=['context_str', 'query_str'], kwargs={}, output_parser=None, template_var_mappings=None, function_mappings=None, template='You are a QA assistant.\nContext information is below. Given the context information and not prior knowledge, answer the query. \nContext information is below.\n---------------------\n{context_str}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {query_str}\nAnswer: ')

In [47]:
prev_instr_score_pairs

[('You are a QA assistant.\nContext information is below. Given the context information and not prior knowledge, answer the query. ',
  3.975)]

In [48]:
full_eval_qs = [q for q, _ in full_qr_pairs]
full_eval_answers = [a for _, a in full_qr_pairs]

In [49]:
## Evaluate with base QA prompt

query_engine.update_prompts({QA_PROMPT_KEY: old_qa_prompt})
avg_correctness_old = await get_correctness(
    query_engine, full_qr_pairs, batch_runner
)

 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                           | 16/20 [00:06<00:02,  1.94it/s]Retrying llama_index.llms.openai.base.OpenAI._achat in 0.3933728782400451 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wDdFfjYtAtzZ6OpmLVkRR7UA on tokens per min (TPM): Limit 30000, Used 29555, Requested 2135. Please try again in 3.38s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.784100908752599 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wDdFfjYtAtzZ6OpmLVkRR7UA on tokens per min (TPM): Limit 30000, Used 29532, Requested 2131. Please try again in 3.326s. Visit https://platform.openai.com/account/rate-

In [50]:
print(avg_correctness_old)

3.825


In [51]:
## Evaluate with "optimized" prompt

query_engine.update_prompts({QA_PROMPT_KEY: new_qa_prompt})
avg_correctness_new = await get_correctness(
    query_engine, full_qr_pairs, batch_runner
)

 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████              | 18/20 [00:08<00:01,  1.01it/s]Retrying llama_index.llms.openai.base.OpenAI._achat in 0.20162978968196033 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wDdFfjYtAtzZ6OpmLVkRR7UA on tokens per min (TPM): Limit 30000, Used 28135, Requested 1894. Please try again in 58ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.6678360179459327 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wDdFfjYtAtzZ6OpmLVkRR7UA on tokens per min (TPM): Limit 30000, Used 27933, Requested 2116. Please try again in 98ms. Visit https://platform.openai.com/account/rate-l

In [52]:
print(avg_correctness_new)

4.0
