# Systematically Improving Your RAG - Week 4


# Generating Synthetic Queries from a FAQ

A while back, Klarna launched it's new customer service chatbot that was powered under the hood by OpenAI's GPT-4 model. This was used as a huge customer story by OpenAI which published a [writeup about it here](https://openai.com/index/klarna/)

In this notebook, we'll generate synthetic queries that a user might have based off Klarna's FAQ. We've downloaded a jsonl from their website and cleaned it so that we have a list of questions with the following fields

1. `question` : A question from their FAQ
2. `url` : The url slug that this question lives at
3. `parent_category` : The parent category that this question belongs to
4. `child_category` : The child category that this question belongs to

We've scraped all of the FAQ pages ahead of time that they have on the page and stored the raw HTML and a processed markdown equivalent in the `./data` folder. We'll use that to generate synthetic queries.

We'll do so in 3 steps

1. First we'll choose a random category and a sub category
2. Then we'll sample some questions from that sub category and get their markdown content from the `./data/md` folder
3. Finally we'll generate a synthetic query and answer for that question. 

We'll vary the tone of the question for diversity initially and generate a few good queries. Once we've done so, we'll start sampling from our synthetic queries to generate more.

In [110]:
import pandas as pd

question_mapping = pd.read_json("./data/questions.jsonl", lines=True)

category_to_subcategory = (
    question_mapping.groupby("parent_category")["child_category"]
    .unique()
    .apply(list)
    .to_dict()
)
category_to_subcategory

{'Account & settings': ['Manage account', 'Login'],
 'Declined purchase': ['Declined Purchase'],
 'Delivery & returns': ['Cancellations',
  'Deliveries',
  'Problem resolution',
  'Returns'],
 'Fraud & security': ['Report fraud', 'Prevent fraud', 'Data protection'],
 'Payments': ['Make & manage payments', 'Payment issues'],
 'Products & services': ['How to use Klarna',
  'One-time card',
  'Klarna balance',
  'Klarna Card',
  'Payment options'],
 'Refunds': ['Manage refunds']}

In [137]:
slug_to_content = {
    slug.split("/")[-2]: open(f"./data/md/{slug.split('/')[-2]}.md", "r").read()
    for slug in question_mapping["url"]
}

subcategories = question_mapping["child_category"].unique()

subcategory_to_question = {
    subcategory: question_mapping[question_mapping["child_category"] == subcategory][
        "url"
    ].tolist()
    for subcategory in subcategories
}

We'll choose a random category, choose a random sub category from that category and then generate a synthetic query for that question.

In [134]:
from pydantic import BaseModel, field_validator, ValidationInfo


class GeneratedUserQuestion(BaseModel):
    chain_of_thought: str
    category: str
    subcategory: str
    question: str
    answer: str
    slug_citations: list[str]

    @field_validator("slug_citations")
    def validate_slugs(cls, v, info: ValidationInfo):
        context = info.context

        if not all(slug in context["slugs"] for slug in v):
            raise ValueError("All cited slugs must be in the context slugs")

        return v

    @field_validator("category")
    def validate_category(cls, v, info: ValidationInfo):
        context = info.context

        if v != context["category"]:
            raise ValueError(
                f"Category {v} shoud be {context['category']}. Make sure you're using the correct category of that was provided in the prompt."
            )

        return v

    @field_validator("subcategory")
    def validate_subcategory(cls, v, info: ValidationInfo):
        context = info.context

        if v != context["subcategory"]:
            raise ValueError(
                f"Subcategory {v} shoud be {context['subcategory']}. Make sure you're using the correct subcategory of that was provided in the prompt."
            )

        return v


In [153]:
import instructor
from openai import AsyncOpenAI
import random
from asyncio import Semaphore

client = instructor.from_openai(AsyncOpenAI())


class EvaluationQuestion(BaseModel):
    question: str
    category: str
    relevant_pages: list[str]
    subcategory: str


async def generate_question(
    client,
    category: str,
    subcategory: str,
    slug_to_content: dict[str, str],
    sem: Semaphore,
):
    tone = ["enthusiastic", "concise", "polite", "direct"]
    chosen_urls = random.sample(
        subcategory_to_question[subcategory],
        random.randint(1, len(subcategory_to_question[subcategory])),
    )
    chosen_slugs = [url.split("/")[-2] for url in chosen_urls]

    async with sem:
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "Generate a question and answer pair for the following category and subcategory: {{category}} - {{subcategory}}. Make sure that it has the tone: {{tone}}",
                },
                {
                    "role": "user",
                    "content": """
                    Here are some examples of questions and answers from this category and subcategory. Make sure to cite the specific slugs which the content was taken from that was used to generate the answer.
                    {% for slug in slugs %}
                    {{ loop.index }}: Slug: {{ slug }} - Content: {{ slug_to_content[slug] }}
                    {% endfor %}

                    Refer to the questions above when generating your question but make sure that it's phrased differently or asks about something else that's contained within the context above. This should be a natural question which a customer might ask

                    Eg. Why isn't my payment method working?
                    Eg. I missed my payment date, what's going to happen?

                    These must be natural sounding questions that a customer might ask.
                    """,
                },
            ],
            context={
                "category": category,
                "subcategory": subcategory,
                "tone": random.choice(tone),
                "slugs": chosen_slugs,
                "slug_to_content": slug_to_content,
            },
            response_model=GeneratedUserQuestion,
        )
        return EvaluationQuestion(
            question=resp.question,
            category=category,
            relevant_pages=resp.slug_citations,
            subcategory=subcategory,
        )


EvaluationQuestion(question='Can I update my phone number linked to my Klarna account?', category='Account & Settings', relevant_pages=['how-can-i-change-my-email-address', 'how-do-i-change-my-billing-address'], subcategory='Manage account')

In [156]:
from openai import AsyncOpenAI
from asyncio import Semaphore
from tqdm.asyncio import tqdm_asyncio as asyncio

client = instructor.from_openai(AsyncOpenAI())
sem = Semaphore(10)
categories = list(category_to_subcategory.keys())
num = 10

coros = []

for _ in range(num):
    category = random.choice(categories)
    sub_category = random.choice(category_to_subcategory[category])

    coros.append(
        generate_question(client, category, sub_category, slug_to_content, sem)
    )

resp = await asyncio.gather(*coros, total=num)


with open("./data/synthetic_questions.jsonl", "w") as f:
    for item in resp:
        f.write(item.model_dump_json() + "\n")


[A
[A
[A
[A
[A
[A
[A
[A
[A
100%|██████████| 10/10 [00:15<00:00,  1.57s/it]


In [200]:
import json

with open("./data/cleaned.jsonl", "r") as f:
    data = [json.loads(line) for line in f]


counts = {}

for item in data:
    counts[item["category"]] = counts.get(item["category"], 0) + 1

category_to_prev_questions = {}

for item in data:
    category = item["category"]
    subcategory = item["subcategory"]
    question = item["question"]

    if category not in category_to_prev_questions:
        category_to_prev_questions[category] = {}

    if subcategory not in category_to_prev_questions[category]:
        category_to_prev_questions[category][subcategory] = []

    category_to_prev_questions[category][subcategory].append(question)

print(counts), len(data)

(None, 102)

In [198]:
import instructor
from openai import AsyncOpenAI
import random
from asyncio import Semaphore

client = instructor.from_openai(AsyncOpenAI())


class EvaluationQuestion(BaseModel):
    question: str
    category: str
    relevant_pages: list[str]
    subcategory: str


async def generate_question_with_examples(
    client: instructor.AsyncInstructor,
    category: str,
    subcategory: str,
    slug_to_content: dict[str, str],
    category_to_prev_questions: dict[str, dict[str, list[str]]],
    sem: Semaphore,
):
    tone = [
        "formal and professional",
        "casual and friendly", 
        "overly polite and verbose",
        "brief and direct",
        "informal with typos and slang",
        "neutral and straightforward"
    ]
    chosen_urls = random.sample(
        subcategory_to_question[subcategory],
        random.randint(1, len(subcategory_to_question[subcategory])),
    )
    chosen_slugs = [url.split("/")[-2] for url in chosen_urls]

    valid_examples = category_to_prev_questions.get(category, {}).get(subcategory, [])
    sampled_examples = random.sample(
        valid_examples, random.randint(0, len(valid_examples))
    )

    async with sem:
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "Generate a question and answer pair for the following category and subcategory: {{category}} - {{subcategory}}. Make sure that it has the tone: {{tone}}",
                },
                {
                    "role": "user",
                    "content": """
                    Here are some examples of questions and answers from this category and subcategory. Make sure to cite the specific slugs which the content was taken from that was used to generate the answer.
                    {% for slug in slugs %}
                    {{ loop.index }}: Slug: {{ slug }} - Content: {{ slug_to_content[slug] }}
                    {% endfor %}

                    Here are some examples of questions that were used previously for this category and subcategory. Make sure to not use these questions directly but instead use them as inspiration to generate a new question.
                    {% for example in examples %}
                    {{ loop.index }}: Question: {{ example }}
                    {% endfor %}
                    
                    """,
                },
            ],
            context={
                "category": category,
                "subcategory": subcategory,
                "tone": random.choice(tone),
                "slugs": chosen_slugs,
                "slug_to_content": slug_to_content,
                "examples": sampled_examples,
            },
            response_model=GeneratedUserQuestion,
        )
        return EvaluationQuestion(
            question=resp.question,
            category=category,
            relevant_pages=resp.slug_citations,
            subcategory=subcategory,
        )



In [201]:
from openai import AsyncOpenAI
from asyncio import Semaphore
from tqdm.asyncio import tqdm_asyncio as asyncio

client = instructor.from_openai(AsyncOpenAI())
sem = Semaphore(10)
categories = list(category_to_subcategory.keys())
num = 10

coros = []

for _ in range(num):
    category = random.choice(categories)
    sub_category = random.choice(category_to_subcategory[category])

    coros.append(
        generate_question_with_examples(
            client, category, sub_category, slug_to_content, category_to_prev_questions, sem
        )
    )

resp = await asyncio.gather(*coros, total=num)


with open("./data/synthetic_questions.jsonl", "w") as f:
    for item in resp:
        f.write(item.model_dump_json() + "\n")

100%|██████████| 10/10 [00:05<00:00,  1.98it/s]
