## Evaluation Data Generation

This notebook creates a set of realistic, song-specific user questions (with ground truth) for evaluating the Music Theory Assistant RAG system’s ability to retrieve and answer music theory queries. It does the following:

1. Loads the music theory dataset from the local CSV file into a pandas DataFrame.
2. Defines a prompt template that asks an [LLM (OpenAI - GPT-4o mini)](https://platform.openai.com/docs/models/gpt-4o-mini) to generate 5 specific, song-based questions for each song record.
3. Sends each song record to the LLM and collects the generated questions as JSON.
4. Aggregates all questions into a list, associating each question with its song’s id.
5. Saves the results as a CSV file (ground-truth-retrieval.csv) for use as evaluation data in retrieval tasks.


In [21]:
import pandas as pd

In [22]:
from openai import OpenAI

client = OpenAI()

In [39]:
# Read the local dataset
df = pd.read_csv('../data/music-theory-dataset-100.csv')
documents = df.to_dict(orient='records')

In [40]:
# Build the prompt template
prompt_template = """
You emulate a user of our music theory assistant application.
Generate 5 questions this user might ask based on a provided song title.
Make the questions specific to this title.
The record should contain the answer to the questions, and the questions should
be complete and not too short. Use as few words as possible from the record. 

The record:

title: {title}
artist: {artist}
genre: {genre}
key: {key}
tempo_bpm: {tempo_bpm}
time_signature: {time_signature}
chord_progression: {chord_progression}
roman_numerals: {roman_numerals}
cadence: {cadence}
theory_notes: {theory_notes}

Provide the output in parsable JSON without using code blocks:

{{"questions": ["question1", "question2", ..., "question5"]}}
""".strip()

In [25]:
prompt = prompt_template.format(**documents[0])

In [26]:
def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [27]:
questions = llm(prompt)

In [28]:
import json

In [29]:
json.loads(questions)

{'questions': ["What is the key of the song 'Let It Be'?",
  "Can you tell me the chord progression used in 'Let It Be'?",
  "What is the time signature of 'Let It Be'?",
  "What type of cadence is found at the end of 'Let It Be'?",
  "What is the tempo of the song 'Let It Be' in BPM?"]}

In [30]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [31]:
from tqdm.auto import tqdm

In [32]:
results = {}

In [33]:
for doc in tqdm(documents): 
    doc_id = doc['id']
    if doc_id in results:
        continue

    questions_raw = generate_questions(doc)
    questions = json.loads(questions_raw)
    results[doc_id] = questions['questions']

  0%|          | 0/100 [00:00<?, ?it/s]

In [34]:
final_results = []

for doc_id, questions in results.items():
    for q in questions:
        final_results.append((doc_id, q))

In [35]:
final_results[0]

(0, "What is the key of the song 'Let It Be' by The Beatles?")

In [36]:
df_results = pd.DataFrame(final_results, columns=['id', 'question'])

In [37]:
df_results.to_csv('../data/ground-truth-retrieval.csv', index=False)

In [38]:
!head ../data/ground-truth-retrieval.csv

id,question
0,What is the key of the song 'Let It Be' by The Beatles?
0,Can you provide the chord progression for 'Let It Be'?
0,What is the tempo in beats per minute for 'Let It Be'?
0,Which cadence is used at the end of 'Let It Be'?
0,What is the time signature of 'Let It Be'?
1,What is the key of the song 'Hotel California'?
1,Can you list the chord progression used in 'Hotel California'?
1,What is the time signature for 'Hotel California'?
1,What kind of cadence does 'Hotel California' have?
