# RAG Test

This notebook tests the RAG flow for the Music Theory Assistant, evaluates how well the search system can retrieve results and how well different LLMs can analyse the relevance of generated answers to given questions.

## Retrieval Flow

This section gets the music theory data from a local CSV file knowledge base containing a selection of songs from different musical genres. It is then indexed in [minsearch](https://github.com/alexeygrigorev/minsearch) and queried. This same dataset is then passed to an [LLM (OpenAI - GPT-4o mini)](https://platform.openai.com/docs/models/gpt-4o-mini) and queried again to check the accuracy of the results.

### Install minsearch

In [36]:
import os

if not os.path.exists("../notebooks/minsearch.py"):
    # Install the package
    os.system("pip install minsearch")
    # Download the file
    os.system("wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py")



--2025-08-08 21:24:50--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4273 (4.2K) [text/plain]
Saving to: ‘minsearch.py’

     0K ....                                                  100% 57.6M=0s

2025-08-08 21:24:50 (57.6 MB/s) - ‘minsearch.py’ saved [4273/4273]



### Get the data

In [9]:
import pandas as pd

In [12]:
csv = '../data/music-theory-dataset-100.csv'

df = pd.read_csv(csv)
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [13]:
df.to_csv(csv, index=False)

In [14]:
print('Shape (rows and columns):', df.shape)
df.head(2)

Shape (rows and columns): (100, 11)


Unnamed: 0,id,title,artist,genre,key,tempo_bpm,time_signature,chord_progression,roman_numerals,cadence,theory_notes
0,0,Let It Be,The Beatles,Pop,C major,76,4/4,C – G – Am – F – C – G – F – C,I – V – vi – IV – I – V – IV – I,Authentic (IV–I) at end; Deceptive (V–vi) earlier,Diatonic progression; Deceptive cadence in ear...
1,1,Hotel California,Eagles,Rock,Bm,74,4/4,Bm – F# – A – E – G – D – Em – F#,i – V – VII – IV – VI – III – iv – V,Half cadence (iv–V),Modal interchange; Natural VII chord; Aeolian ...


### Index with minsearch

In [15]:
import minsearch

In [16]:
df.columns

Index(['id', 'title', 'artist', 'genre', 'key', 'tempo_bpm', 'time_signature',
       'chord_progression', 'roman_numerals', 'cadence', 'theory_notes'],
      dtype='object')

In [17]:
# Covert numeric fields to string to prevent parsing errors in minsearch
df['tempo_bpm'] = df['tempo_bpm'].apply(lambda i: str(i))

documents = df.to_dict(orient='records')
documents[0]

{'id': 0,
 'title': 'Let It Be',
 'artist': 'The Beatles',
 'genre': 'Pop',
 'key': 'C major',
 'tempo_bpm': '76',
 'time_signature': '4/4',
 'chord_progression': 'C – G – Am – F – C – G – F – C',
 'roman_numerals': 'I – V – vi – IV – I – V – IV – I',
 'cadence': 'Authentic (IV–I) at end; Deceptive (V–vi) earlier',
 'theory_notes': 'Diatonic progression; Deceptive cadence in early phrase; Clear tonic return'}

In [18]:
index = minsearch.Index(
    text_fields=['title', 'artist', 'genre', 'key', 'tempo_bpm', 'time_signature',
       'chord_progression', 'roman_numerals', 'cadence', 'theory_notes'],
    keyword_fields=[]
)

In [19]:
index.fit(documents)

<minsearch.Index at 0x7ef6fe609d60>

In [20]:
query = "Give me Folk titles"

In [21]:
index.search(query, num_results=5)

[{'id': 5,
  'title': 'House of the Rising Sun',
  'artist': 'The Animals',
  'genre': 'Folk',
  'key': 'Am',
  'tempo_bpm': '76',
  'time_signature': '6/8',
  'chord_progression': 'Am – C – D – F – Am – E – Am',
  'roman_numerals': 'i – III – IV – VI – i – V – i',
  'cadence': 'Authentic (V–i)',
  'theory_notes': 'Aeolian mode; 6/8 compound meter; Traditional folk harmony'},
 {'id': 17,
  'title': "The Times They Are A-Changin'",
  'artist': 'Bob Dylan',
  'genre': 'Folk',
  'key': 'G major',
  'tempo_bpm': '76',
  'time_signature': '3/4',
  'chord_progression': 'G – Em – C – G – Am – D – G – Em – D – G',
  'roman_numerals': 'I – vi – IV – I – ii – V – I – vi – V – I',
  'cadence': 'Authentic (V–I)',
  'theory_notes': 'Folk protest song; Simple diatonic harmony; Waltz meter'},
 {'id': 82,
  'title': 'Hallelujah',
  'artist': 'Leonard Cohen',
  'genre': 'Folk',
  'key': 'C major',
  'tempo_bpm': '72',
  'time_signature': '6/8',
  'chord_progression': 'C – Am – F – G – Em – Am – F – G –

### RAG Flow

In [22]:
from openai import OpenAI

client = OpenAI()

In [23]:
def search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [24]:
prompt_template = """
You're a music teacher. Answer the QUESTION based on the CONTEXT from our music theory database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

entry_template = """
title: {title}
artist: {artist}
genre: {genre}
key: {key}
tempo_bpm: {tempo_bpm}
time_signature: {time_signature}
chord_progression: {chord_progression}
roman_numerals: {roman_numerals}
cadence: {cadence}
theory_notes: {theory_notes}
""".strip()

In [25]:
def build_prompt(query, search_results):
    context = ""
    
    for doc in search_results:
        context = context + entry_template.format(**doc) + "\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [26]:
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [27]:
def rag(query, model='gpt-4o-mini'):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt, model=model)
    return answer

In [28]:
question = 'Is Mr Tambourine Man a folk song and if so, explain why, including information about its key and cadence, and explain what cadence means?'
answer = rag(question)
print(answer)

Yes, "Mr. Tambourine Man" is indeed a folk song, as indicated by its genre classification. The song is in the key of D major and features a chord progression of D – G – A – D – Bm – G – A – D. This progression uses the Roman numeral system where the chords correspond to I, IV, V, I, vi, IV, V, I.

The cadence of "Mr. Tambourine Man" is an authentic cadence, specifically a V–I cadence. An authentic cadence occurs when a dominant chord (V) resolves to the tonic chord (I), providing a sense of closure and resolution in the music.

In summary, "Mr. Tambourine Man" is a folk song characterized by its genre, the key of D major, and its use of authentic cadences (V–I), which create a satisfying resolution within its musical structure.


In [29]:
question = 'Could you print a list of songs by The Beatles?'
answer = rag(question)
print(answer)

Based on the context provided, here is a list of songs by The Beatles:

1. Let It Be
2. Something


In [30]:
question = 'Could you print a list of songs in the key of C Major?'
answer = rag(question)
print(answer)

Here is a list of songs in the key of C Major:

1. "Let It Be" - The Beatles
2. "My Girl" - The Temptations
3. "Brown Sugar" - The Rolling Stones
4. "Great Balls of Fire" - Jerry Lee Lewis
5. "Blue Moon" - Richard Rodgers
6. "Dust in the Wind" - Kansas


## Retrieval Evaluation

This section is measuring how well the search system (using minsearch, Elasticsearch and Qdrant) can retrieve the correct song record for a set of ground-truth questions. Here’s what it does:

1. Loads ground-truth data: Reads a CSV file (ground-truth-retrieval.csv) containing questions and the correct song id for each question.
2. Defines evaluation metrics:
    - **Hit Rate**: The fraction of questions for which the correct song appears anywhere in the top search results.
    - **MRR (Mean Reciprocal Rank)**: Measures how high the correct song appears in the ranked results (higher is better).
3. Runs the search: For each question, it uses minsearch to retrieve the top results.
4. Checks relevance: Compares the id of each result to the ground-truth id to see if the correct song was retrieved and at what rank.
5. Calculates metrics: Aggregates the results to compute overall hit rate and MRR, giving you a quantitative measure of your retrieval system’s accuracy.

In [37]:
# Read the local ground truth dataset
df_question = pd.read_csv('../data/ground-truth-retrieval.csv')

In [38]:
df_question.head()

Unnamed: 0,id,question
0,0,What is the key of the song 'Let It Be' by The...
1,0,Can you provide the chord progression for 'Let...
2,0,What is the tempo in beats per minute for 'Let...
3,0,Which cadence is used at the end of 'Let It Be'?
4,0,What is the time signature of 'Let It Be'?


In [39]:
ground_truth = df_question.to_dict(orient='records')

In [40]:
ground_truth[0]

{'id': 0,
 'question': "What is the key of the song 'Let It Be' by The Beatles?"}

In [41]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [42]:
def minsearch_search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [43]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['id']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [45]:
from tqdm.auto import tqdm

In [46]:
evaluate(ground_truth, lambda q: minsearch_search(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.914, 'mrr': 0.6267063492063489}