This notebook demonstrates how ground truth data is generated by asking an LLM to create five potential questions based on the section, question, and text of a document. Details are in the prompt. Note that results differ from `ground-truth-data.csv` because they are generated by the LLM model `Phi3`.

Before running this notebook, you need to start an ollma server. To do this, run the following command in the terminal:

```bash
docker run -it \
    --rm \
    -v ./ollama_files:/root/.ollama \
    -p 11434:11434 \
    --name ollama \
    ollama/ollama
```

In a new terminal, run the following command to download the `Phi-3` model:

```bash
docker exec -it ollama ollama pull phi3
```

In [None]:
import json
from typing import Optional

from openai import OpenAI
import pandas as pd

from tqdm.auto import tqdm

# Read documents

In [100]:
with open("documents-with-ids.json", "rt") as f_in:
    documents = json.load(f_in)

# Take one document as an example

In [114]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable Python list WITHOUT using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [115]:
doc = documents[0]
doc

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [116]:
prompt = prompt_template.format(**doc)

In [117]:
client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="phi3",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,
)

In [118]:
json.loads(response.choices[0].message.content)

['Could you please confirm when exactly our course is scheduled to begin? I understand that we need to register beforehand and would like some clarity on this.',
 "I'm a bit confused about the registration process for the upcoming class. Could someone guide me through it step by step, including how to subscribe to the Google Calendar and join Telegram channel?",
 'What are our office hours? I want to make sure that if there is anything unclear during my course journey, I can reach out.',
 "I'm not quite clear on why we need to register in DataTalks.Club’s Slack as well and join the channel before starting the class. Could you explain this requirement?",
 'Could someone please provide more details about how exactly our Google Calendar works for course-related events, including office hours?']

# Encapsulate the process in a function to simplify running it on all documents.

In [119]:
def generate_question(doc: dict, client: OpenAI) -> Optional[list[str]]:
    prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable Python list without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

    prompt = prompt_template.format(**doc)
    response = client.chat.completions.create(
        model="phi3",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # Unfortunatelly, the model does not always return a valid parsable Python
    # In this case, None is returned
    try:
        return json.loads(response.choices[0].message.content)
    except Exception:
        return None

In [122]:
# Generate questions for the first 5 documents to save time
for doc in tqdm(documents[:5]):
    doc["potential_question"] = generate_question(doc, client)

100%|██████████| 5/5 [01:32<00:00, 18.49s/it]


In [123]:
pd.DataFrame(documents).dropna(subset=["potential_question"]).explode(
    "potential_question"
)[["potential_question", "course", "id"]]

Unnamed: 0,potential_question,course,id
0,Could you please confirm if there are any prer...,data-engineering-zoomcamp,c02e79ef
0,What is the exact start time and date of the c...,data-engineering-zoomcamp,c02e79ef
0,Is there any specific software or tools that n...,data-engineering-zoomcamp,c02e79ef
0,Are assignments expected throughout the durati...,data-engineering-zoomcamp,c02e79ef
1,What specific skills or knowledge should I hav...,data-engineering-zoomcamp,1f6520ca
1,Could you detail what foundational concepts ar...,data-engineering-zoomcamp,1f6520ca
2,Can I still join this course after its start d...,data-engineering-zoomcamp,7842b56a
2,Is it possible to enroll in a class post-start...,data-engineering-zoomcamp,7842b56a
3,Could you clarify if I need a confirmation ema...,data-engineering-zoomcamp,0bbf41ec
3,What is expected of me as an unregistered part...,data-engineering-zoomcamp,0bbf41ec
