# FRIENDLY WARNING
## OpenAI API is being called here. Do not rerun if you're not planning to use any of your OpenAI credits!

# Generate Ground Truth Data

Using the chunked knowledge base `readme_notes_with_ids.json`, a set of ground truth data is generated using the following steps: 

* For each document, use LLM to produce 5 questions that are relevant to each document.
* Save as ground truth data.
* Generate embeddings for ground truth data and save them for future evaluations. 

This ground truth data is required to proceed with:
* **Retrieval evaluation**: Evaluate the retrieval methods (text-based search, vector-based search, hybrid search) using the generated questions and evaluation metrics Hit Rate and Mean Reciprocal Rank to determine the best retrieval method for the final chatbot app.
* **RAG evaluation**: Evaluate the LLM candidates' performance to determine the best model for the final chatbot app.  

In [1]:
from openai import OpenAI
from tqdm.auto import tqdm
import pickle
import pandas as pd
import json
import os
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer

### Load OpenAI API key from `.env`

You will need setup a `.env` file with your own OpenAI API key to rerun the ground truth data generation. 

Sample `.env` file for this project:
```
OPENAI_API_KEY=your API key here
```

In [2]:
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") 

### Load `readme_notes_with_ids.json`

JSON file should contain 385 documents where each document is a chunked data on its own.

In [3]:
folder = "../data/"
file = "readme_notes_with_ids.json"

In [4]:
with open(f"{folder}{file}") as f:
    documents = json.load(f)

Cross-check number of documents and format to ensure documents are loaded successfully.

In [5]:
len(documents)

385

In [6]:
documents[:3]

[{'doc_id': '08e49f1028_1',
  'header': ' Cloud Concepts: Describe cloud service types',
  'subheader': ' Infrastructure as a service (IaaS)',
  'document': ['Customer has maximum control of cloud resources.'],
  'doc_text': 'Customer has maximum control of cloud resources.'},
 {'doc_id': '08e49f1028_2',
  'header': ' Cloud Concepts: Describe cloud service types',
  'subheader': ' Infrastructure as a service (IaaS)',
  'document': ['Customer has largest share of responsibility in the shared responsibility model.'],
  'doc_text': 'Customer has largest share of responsibility in the shared responsibility model.'},
 {'doc_id': '08e49f1028_3',
  'header': ' Cloud Concepts: Describe cloud service types',
  'subheader': ' Infrastructure as a service (IaaS)',
  'document': ['Only the physical resources are controlled by cloud provider: Physical hosts, network and data center security.'],
  'doc_text': 'Only the physical resources are controlled by cloud provider: Physical hosts, network and d

### Formulate prompt to generate questions for ground truth data

This prompt was pre-tested in ChatGPT to ensure the output matches my expectations, see here: [ChatGPT Test on AZ-900 ground truth data prompt](https://chatgpt.com/share/30e7c43c-e457-4b44-8b0e-1be08f6c4b71).

In [7]:
prompt_template = """
You are currently studying for the Microsoft Azure Fundamentals (AZ-900) certification exam, and you're trying to better understand the concepts covered in the document provided. 
Based on this document, generate five questions that you might ask. The document should contain the answer to the questions.

The document:

Topic: {header}
Sub topic: {subheader}
Notes: {doc_text}

Provide the 5 questions in parsable JSON without using code blocks. Here is an example for the 5 questions:

["question1", "question2", ..., "question5"]
""".strip()

Test prompt template formatting with a random document

In [8]:
documents[2]

{'doc_id': '08e49f1028_3',
 'header': ' Cloud Concepts: Describe cloud service types',
 'subheader': ' Infrastructure as a service (IaaS)',
 'document': ['Only the physical resources are controlled by cloud provider: Physical hosts, network and data center security.'],
 'doc_text': 'Only the physical resources are controlled by cloud provider: Physical hosts, network and data center security.'}

In [9]:
print(prompt_template.format(**documents[2]))

You are currently studying for the Microsoft Azure Fundamentals (AZ-900) certification exam, and you're trying to better understand the concepts covered in the document provided. 
Based on this document, generate five questions that you might ask. The document should contain the answer to the questions.

The document:

Topic:  Cloud Concepts: Describe cloud service types
Sub topic:  Infrastructure as a service (IaaS)
Notes: Only the physical resources are controlled by cloud provider: Physical hosts, network and data center security.

Provide the 5 questions in parsable JSON without using code blocks. Here is an example for the 5 questions:

["question1", "question2", ..., "question5"]


### Setup LLM for ground truth data
Here's what's needed:

* A prompt that generates questions based on ground truth data
* An LLM that takes this prompt and generates a response that contains the questions


In [10]:
client = OpenAI()

def generate_questions(doc, model='gpt-4o-mini'):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

### Test run for questions generation
Before executing call on all documents:
* Test one document's questions generation with OpenAI
* Test JSON parsing on the generated questions. It should be in the format `{document_id : list of 5 questions}`

In [11]:
test_response = generate_questions(documents[2])

In [12]:
test_response

'[\n    "What does IaaS stand for in cloud service types?",\n    "What physical resources are managed by the cloud provider in IaaS?",\n    "In IaaS, who is responsible for data center security?",\n    "Which elements are included in the management scope of IaaS by the cloud provider?",\n    "What level of control do users have in an IaaS model compared to other cloud service types?"\n]'

In [13]:
test_results = {}
test_id = documents[2]['doc_id']
test_results[test_id] = test_response

test_parsed_results = {}
for id, json_questions in test_results.items():
    test_parsed_results[id] = json.loads(json_questions)

In [14]:
test_parsed_results

{'08e49f1028_3': ['What does IaaS stand for in cloud service types?',
  'What physical resources are managed by the cloud provider in IaaS?',
  'In IaaS, who is responsible for data center security?',
  'Which elements are included in the management scope of IaaS by the cloud provider?',
  'What level of control do users have in an IaaS model compared to other cloud service types?']}

### Generate questions for all documents
The subsequent code blocks are executed once the test run results are satisfactory.
* Save document id and the 5 generated questions as a key-value pair (5 questions in a list) in results dictionary.
* Skip generation if the document id exists in results dictionary, to handle code block rerun scenarios.

In [15]:
results = {}

In [16]:
for doc in tqdm(documents): 
    doc_id = doc['doc_id']
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

  0%|          | 0/385 [00:00<?, ?it/s]

* Cross check results to make sure each document has own set of questions.
* Backup results into `results_backup.json` in case of parsing errors in the following step.

In [17]:
len(results.items())

385

In [18]:
results_file = "results_backup.json"
with open(f"{folder}{results_file}", 'w') as w:
    json.dump(results, w)

### Parse results as ground-truth data

The results together with the original documents will be parsed into ground truth data. Each ground truth data row will look like this:

| doc_id | question |
|--------|--------|
|...xyz...| 1st question|
|...xyz...| 2nd question|
|...xyz...| 3rd question|
|...xyz...| 4th question|
|...xyz...| 5th question|

Hence, for $n$ documents, there will be $5n$ questions in the ground truth data.

In [19]:
parsed_result = {}

for doc_id, json_questions in results.items():
    parsed_result[doc_id] = json.loads(json_questions)

In [20]:
len(parsed_result)

385

In [21]:
ground_truth_data = []

for doc_id, questions in parsed_result.items():
    for q in questions:
        ground_truth_data.append((doc_id, q))

In [22]:
len(ground_truth_data)

1925

Inspecting 1st 10 rows of ground truth data, there should be 10 different questions belonging to 2 unique document ids.

In [23]:
ground_truth_data[:10]

[('08e49f1028_1',
  'What is the primary characteristic of Infrastructure as a Service (IaaS)?'),
 ('08e49f1028_1',
  'How much control does a customer have over cloud resources in IaaS?'),
 ('08e49f1028_1',
  'What type of cloud service provides the maximum control to customers?'),
 ('08e49f1028_1',
  "In the context of IaaS, what does 'customer control' refer to?"),
 ('08e49f1028_1',
  'Which cloud service model allows users to manage their own infrastructure resources?'),
 ('08e49f1028_2',
  'What is the largest share of responsibility for customers in the shared responsibility model?'),
 ('08e49f1028_2', 'What does IaaS stand for in cloud service types?'),
 ('08e49f1028_2',
  'In the shared responsibility model, how much responsibility does the customer have?'),
 ('08e49f1028_2',
  'Which cloud service model places the largest responsibility on the customer?'),
 ('08e49f1028_2',
  'What are customers responsible for in an Infrastructure as a Service (IaaS) model?')]

In [24]:
df_ground_truth = pd.DataFrame(ground_truth_data, columns=['doc_id', 'question'])

## Generate embeddings for questions

Generate embeddings now with the same embedding model `sentence-transformers/all-MiniLM-L12-v2` used in data ingestion pipeline. See `/ingestion/embed.ipynb`

In [25]:
questions = df_ground_truth['question'].tolist()

In [26]:
len(questions)

1925

In [27]:
model_name = 'sentence-transformers/all-MiniLM-L12-v2'
model = SentenceTransformer(model_name)



In [28]:
questions_vec = []
for q in tqdm(questions):
    questions_vec.append(model.encode(q))

  0%|          | 0/1925 [00:00<?, ?it/s]

In [29]:
df_ground_truth['question_vec'] = pd.Series(questions_vec)

In [30]:
df_ground_truth.head()

Unnamed: 0,doc_id,question,question_vec
0,08e49f1028_1,What is the primary characteristic of Infrastr...,"[-0.0061842897, -0.026708089, -0.009343912, -0..."
1,08e49f1028_1,How much control does a customer have over clo...,"[0.06669134, -0.040444482, -0.10285918, 0.0190..."
2,08e49f1028_1,What type of cloud service provides the maximu...,"[0.054696202, -0.057140283, -0.03228103, 0.005..."
3,08e49f1028_1,"In the context of IaaS, what does 'customer co...","[-0.03537042, -0.004027559, -0.078389645, 0.03..."
4,08e49f1028_1,Which cloud service model allows users to mana...,"[0.049779054, -0.10544806, -0.04841407, -0.025..."


### Save ground truth data as Pickle file

In [31]:
ground_truth_file = 'ground-truth-data.pkl'

In [32]:
df_ground_truth.to_pickle(f'{folder}{ground_truth_file}')