# RAG Test

This notebook tests the RAG flow for the Music Theory Assistant, evaluates how well the search system can retrieve results and how well different LLMs can analyse the relevance of generated answers to given questions.

## Retrieval Flow

This section gets the music theory data from a local CSV file knowledge base containing a selection of songs from different musical genres. The retrieval flow is then tested using the following different methods:

- minsearch (text search)
- Qdrant (vector search)
- Qdrant (vector search hybrid)
- Qdrant (vector search hybrid re-ranked)

#### Get the data

In [1]:
import pandas as pd

In [2]:
csv = '../data/music-theory-dataset-100.csv'

df = pd.read_csv(csv)
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [3]:
df.to_csv(csv, index=False)

In [4]:
print('Shape (rows and columns):', df.shape)
df.head(2)

Shape (rows and columns): (100, 11)


Unnamed: 0,id,title,artist,genre,key,tempo_bpm,time_signature,chord_progression,roman_numerals,cadence,theory_notes
0,0,Let It Be,The Beatles,Pop,C major,76,4/4,C – G – Am – F – C – G – F – C,I – V – vi – IV – I – V – IV – I,Authentic (IV–I) at end; Deceptive (V–vi) earlier,Diatonic progression; Deceptive cadence in ear...
1,1,Hotel California,Eagles,Rock,Bm,74,4/4,Bm – F# – A – E – G – D – Em – F#,i – V – VII – IV – VI – III – iv – V,Half cadence (iv–V),Modal interchange; Natural VII chord; Aeolian ...


### Retrieval Flow - minsearch (text search)

This section gets the music theory data from a local CSV file knowledge base containing a selection of songs from different musical genres. It is then indexed in [minsearch](https://github.com/alexeygrigorev/minsearch) and queried. This same dataset is then passed to an [LLM (OpenAI - GPT-4o mini)](https://platform.openai.com/docs/models/gpt-4o-mini) and queried again to check the accuracy of the results.

#### Install minsearch

In [17]:
import os

if not os.path.exists("../notebooks/minsearch.py"):
    # Install the package
    os.system("pipenv install minsearch")
    # Download the file
    os.system("wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py")

#### Index with minsearch

In [18]:
import minsearch



In [19]:
df.columns

Index(['id', 'title', 'artist', 'genre', 'key', 'tempo_bpm', 'time_signature',
       'chord_progression', 'roman_numerals', 'cadence', 'theory_notes'],
      dtype='object')

In [20]:
# Covert numeric fields to string to prevent parsing errors in minsearch
df['tempo_bpm'] = df['tempo_bpm'].apply(lambda i: str(i))

documents = df.to_dict(orient='records')
documents[0]

{'id': 0,
 'title': 'Let It Be',
 'artist': 'The Beatles',
 'genre': 'Pop',
 'key': 'C major',
 'tempo_bpm': '76',
 'time_signature': '4/4',
 'chord_progression': 'C – G – Am – F – C – G – F – C',
 'roman_numerals': 'I – V – vi – IV – I – V – IV – I',
 'cadence': 'Authentic (IV–I) at end; Deceptive (V–vi) earlier',
 'theory_notes': 'Diatonic progression; Deceptive cadence in early phrase; Clear tonic return'}

In [21]:
index = minsearch.Index(
    text_fields=['title', 'artist', 'genre', 'key', 'tempo_bpm', 'time_signature',
       'chord_progression', 'roman_numerals', 'cadence', 'theory_notes'],
    keyword_fields=[]
)

In [22]:
index.fit(documents)

<minsearch.Index at 0x79a0caedf410>

In [23]:
query = "Give me Folk titles"

In [24]:
index.search(query, num_results=5)

[{'id': 5,
  'title': 'House of the Rising Sun',
  'artist': 'The Animals',
  'genre': 'Folk',
  'key': 'Am',
  'tempo_bpm': '76',
  'time_signature': '6/8',
  'chord_progression': 'Am – C – D – F – Am – E – Am',
  'roman_numerals': 'i – III – IV – VI – i – V – i',
  'cadence': 'Authentic (V–i)',
  'theory_notes': 'Aeolian mode; 6/8 compound meter; Traditional folk harmony'},
 {'id': 17,
  'title': "The Times They Are A-Changin'",
  'artist': 'Bob Dylan',
  'genre': 'Folk',
  'key': 'G major',
  'tempo_bpm': '76',
  'time_signature': '3/4',
  'chord_progression': 'G – Em – C – G – Am – D – G – Em – D – G',
  'roman_numerals': 'I – vi – IV – I – ii – V – I – vi – V – I',
  'cadence': 'Authentic (V–I)',
  'theory_notes': 'Folk protest song; Simple diatonic harmony; Waltz meter'},
 {'id': 82,
  'title': 'Hallelujah',
  'artist': 'Leonard Cohen',
  'genre': 'Folk',
  'key': 'C major',
  'tempo_bpm': '72',
  'time_signature': '6/8',
  'chord_progression': 'C – Am – F – G – Em – Am – F – G –

#### minsearch to LLM

In [25]:
from openai import OpenAI

client = OpenAI()

In [26]:
def search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [27]:
prompt_template = """
You're a music teacher. Answer the QUESTION based on the CONTEXT from our music theory database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

entry_template = """
title: {title}
artist: {artist}
genre: {genre}
key: {key}
tempo_bpm: {tempo_bpm}
time_signature: {time_signature}
chord_progression: {chord_progression}
roman_numerals: {roman_numerals}
cadence: {cadence}
theory_notes: {theory_notes}
""".strip()

In [28]:
def build_prompt(query, search_results):
    context = ""
    
    for doc in search_results:
        context = context + entry_template.format(**doc) + "\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [29]:
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [31]:
from typing import Callable, List, Dict, Any

def rag(
    query: str,
    retriever_fn: Callable[[str], List[Dict[str, Any]]] = None,
    model: str = "gpt-4o-mini",
    k: int = 5,
):
    """
    Single RAG:
      - Pass which retriever to use via retriever_fn (e.g., search, vector_search, vector_search_hybrid or vector_search_hybrid_rerank)
      - Reuses existing build_prompt() and llm()
      - Defaults to text/minsearch via search(...)
    """
    if retriever_fn is None:
        retriever_fn = search  # default to the text (minsearch) retriever

    search_results = retriever_fn(query)
    
    # If your retriever returns more than you need, trim to top-k
    if isinstance(search_results, list):
        search_results = search_results[:k]

    prompt = build_prompt(query, search_results)
    answer = llm(prompt, model=model)
    return answer

In [32]:
question = 'Is Mr Tambourine Man a folk song and if so, explain why, including information about its key and cadence, and explain what cadence means?'
answer = rag(question)
print(answer)

Yes, "Mr. Tambourine Man" is considered a folk song. It falls under the folk genre, which is characterized by its acoustic sounds and storytelling elements. The song is in the key of D major and has a tempo of 110 BPM in a 4/4 time signature.

The chord progression used in "Mr. Tambourine Man" is D – G – A – D – Bm – G – A – D, which translates to the Roman numeral analysis of I – IV – V – I – vi – IV – V – I. The cadence at the end of the phrases is an authentic cadence (V–I), which means it resolves from the dominant chord (V) to the tonic chord (I), providing a sense of closure and completion in the music.

A cadence in music theory refers to a sequence of chords that brings a phrase, section, or piece of music to a close. Authentic cadences are particularly strong and are often used to signal the end of a musical passage. In "Mr. Tambourine Man," the use of the authentic cadence reinforces its folk characteristics by bringing coherence to its melodic and harmonic structure.


In [33]:
question = 'Could you print a list of songs by The Beatles?'
answer = rag(question)
print(answer)

Based on the provided context, the only song listed by The Beatles is:

- **Let It Be**  
  - Genre: Pop  
  - Key: C major  
  - Tempo: 76 BPM  
  - Time Signature: 4/4  
  - Chord Progression: C – G – Am – F – C – G – F – C  
  - Roman Numerals: I – V – vi – IV – I – V – IV – I  
  - Cadence: Authentic (IV–I) at end; Deceptive (V–vi) earlier  
  - Theory Notes: Diatonic progression; Deceptive cadence in early phrase; Clear tonic return


In [34]:
question = 'Could you print a list of songs in the key of C Major?'
answer = rag(question)
print(answer)

Here is a list of songs in the key of C Major:

1. **Let It Be**  
   Artist: The Beatles  
   Genre: Pop  

2. **My Girl**  
   Artist: The Temptations  
   Genre: Soul  

3. **Brown Sugar**  
   Artist: The Rolling Stones  
   Genre: Rock  


### Retrieval Flow - Qdrant (vector search)

This section gets the music theory data from a local CSV file knowledge base containing a selection of songs from different musical genres. It is then indexed in [Qdrant](https://qdrant.tech/) and queried. This same dataset is then passed to an [LLM (OpenAI - GPT-4o mini)](https://platform.openai.com/docs/models/gpt-4o-mini) and queried again to check the accuracy of the results.

Install qdrant and fastembed (if not already installed during project setup):

```bash
pipenv install -q "qdrant-client[fastembed]>=1.14.2"
```

Run in Docker:

```bash
docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant
```

In [36]:
from qdrant_client import QdrantClient, models

In [37]:
# Qdrant setup
qd_client = QdrantClient("http://localhost:6333")

In [38]:
EMBEDDING_MODEL = "jinaai/jina-embeddings-v2-small-en"
EMBEDDING_DIMENSIONALITY = 512
COLLECTION_NAME = "zoomcamp-music-theory-assistant"

In [39]:
# delete the collection if it already exists
qd_client.delete_collection(collection_name=COLLECTION_NAME)

True

In [40]:
qd_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

True

In [42]:
# Prepare points
points = []
for doc in documents:
    # Build a single searchable text string from the document fields
    text = " | ".join([
        str(doc["title"]),
        str(doc["artist"]),
        f"Genre: {doc['genre']}",
        f"Key: {doc['key']}",
        f"Tempo: {doc['tempo_bpm']} BPM",
        f"Time: {doc['time_signature']}",
        f"Chords: {doc['chord_progression']}",
        f"Roman: {doc['roman_numerals']}",
        f"Cadence: {doc['cadence']}",
        f"Notes: {doc['theory_notes']}",
    ])

    vector = models.Document(text=text, model=EMBEDDING_MODEL)  # Qdrant client auto-embeds this
    point = models.PointStruct(id=int(doc["id"]), vector=vector, payload=doc)
    points.append(point)

In [44]:
qd_client.upsert(
    collection_name=COLLECTION_NAME,
    points=points
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

onnx/model.onnx:   0%|          | 0.00/130M [00:00<?, ?B/s]

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [79]:
def vector_search(query):

    query_points = qd_client.query_points(
        collection_name=COLLECTION_NAME,
        query=models.Document(
            text=query,
            model=EMBEDDING_MODEL 
        ),
        limit=10,
        with_payload=True
    )
    
    results = []
    
    for point in query_points.points:
        results.append(point.payload)
    
    return results

In [50]:
question = 'Is Mr Tambourine Man a folk song and if so, explain why, including information about its key and cadence, and explain what cadence means?'
answer = rag(question, retriever_fn=vector_search)
print(answer)

Yes, "Mr. Tambourine Man" is considered a folk song. It falls within the folk genre, which often features storytelling, simple structures, and themes that resonate with the human experience. 

The song is in the key of D major and has a tempo of 110 BPM with a time signature of 4/4. Its chord progression is D – G – A – D – Bm – G – A – D, corresponding to the Roman numerals I – IV – V – I – vi – IV – V – I in the key of D major. 

The cadences in the song are authentic cadences, specifically V–I, which means that the fifth scale degree (V) resolves to the first scale degree (I). An authentic cadence provides a sense of closure or resolution in music, signaling the end of a phrase or section. This characteristic contributes to the song's folk quality by creating a familiar and satisfying musical experience. 

Overall, the combination of its genre, key, chord progression, and the use of authentic cadences are key elements that affirm "Mr. Tambourine Man" as a folk song.


### Retrieval Flow - Qdrant (hybrid search)

In [54]:
COLLECTION_NAME_HYBRID = "zoomcamp-music-theory-assistant-hybrid"

In [55]:
# delete the collection if it already exists
qd_client.delete_collection(collection_name=COLLECTION_NAME_HYBRID)

True

In [56]:
qd_client.create_collection(
    collection_name=COLLECTION_NAME_HYBRID,
    vectors_config={
        "dense": models.VectorParams(
            size=EMBEDDING_DIMENSIONALITY,
            distance=models.Distance.COSINE
        )
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF
        )
    }
)

True

In [57]:
# Prepare points for the HYBRID collection
points = []
for doc in documents:
    # Build your single searchable text string (same as you do now)
    text = " | ".join([
        str(doc["title"]),
        str(doc["artist"]),
        f"Genre: {doc['genre']}",
        f"Key: {doc['key']}",
        f"Tempo: {doc['tempo_bpm']} BPM",
        f"Time: {doc['time_signature']}",
        f"Chords: {doc['chord_progression']}",
        f"Roman: {doc['roman_numerals']}",
        f"Cadence: {doc['cadence']}",
        f"Notes: {doc['theory_notes']}",
    ])

    # Named dense vector; Qdrant will auto-embed this Document with EMBEDDING_MODEL
    dense_vec = {"dense": models.Document(text=text, model=EMBEDDING_MODEL)}

    # IMPORTANT: include 'id' in payload so your evaluator can match d['id'] == q['id']
    payload = dict(doc)  # keep your fields
    payload["id"] = int(doc["id"])
    payload["text"] = text  # handy for prompting / reranking

    point = models.PointStruct(id=int(doc["id"]), vector=dense_vec, payload=payload)
    points.append(point)

In [58]:
qd_client.upsert(
    collection_name=COLLECTION_NAME_HYBRID,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [59]:
def vector_search_hybrid(question: str, limit: int = 10):
    results = qd_client.query_points(
        collection_name=COLLECTION_NAME_HYBRID,
        prefetch=[
            # Sparse / BM25
            models.Prefetch(
                query=models.Document(text=question, model="Qdrant/bm25"),
                using="bm25",
                limit=5 * limit,   # fetch extra for better fusion
            ),
            # Dense (auto-embed the query with the same model)
            models.Prefetch(
                query=models.Document(text=question, model=EMBEDDING_MODEL),
                using="dense",
                limit=5 * limit,
            ),
        ],
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        with_payload=True,
        limit=limit,
    )

    # Return list[dict] with 'id' (the evaluator depends on it)
    return [p.payload for p in results.points]

In [60]:
# simple lexical overlap re-ranker that can be applied on top of the fused list
def rerank_lexical_overlap(query: str, docs: list[dict], text_key: str = "text", top_k: int | None = None) -> list[dict]:
    q = set((query or "").lower().split())
    rescored = []
    for d in docs:
        t = (d.get(text_key, "") or "").lower()
        toks = set(t.split())
        denom = len(q | toks) or 1
        dd = dict(d)
        dd["score_rerank"] = len(q & toks) / denom
        rescored.append(dd)
    rescored.sort(key=lambda x: x["score_rerank"], reverse=True)
    return rescored[:top_k] if top_k else rescored

In [64]:
def vector_search_hybrid_rerank(question: str, limit: int = 10):
    docs = vector_search_hybrid(question, limit=limit)
    return rerank_lexical_overlap(question, docs, text_key="text", top_k=limit)

In [65]:
question = 'Is Mr Tambourine Man a folk song and if so, explain why, including information about its key and cadence, and explain what cadence means?'
answer = rag(question, retriever_fn=vector_search_hybrid)
print(answer)

Yes, "Mr. Tambourine Man" is considered a folk song. This classification is largely due to its roots in the folk genre, as noted in the context. 

The song is in the key of D major and has a tempo of 110 bpm with a 4/4 time signature. Its chord progression is D – G – A – D – Bm – G – A – D, which reflects simple harmonic structures typical of folk music. An important aspect of its musicality is the authentic cadence (V–I), which provides a strong sense of resolution and is commonly found in folk songs. 

A cadence in music refers to a sequence of chords that brings a phrase or piece to a close. Specifically, an authentic cadence resolves from the dominant chord (V) to the tonic chord (I), creating a satisfying conclusion to the musical phrase, which can be particularly characteristic of folk styles. In "Mr. Tambourine Man," the authentic cadence contributes to its folk-rock blend and engaging rhythmic quality.


In [66]:
question = 'Is Mr Tambourine Man a folk song and if so, explain why, including information about its key and cadence, and explain what cadence means?'
answer = rag(question, retriever_fn=vector_search_hybrid_rerank)
print(answer)

Yes, "Mr. Tambourine Man" is considered a folk song. It falls under the genre of folk music as classified in the context. The song is composed in the key of D major and features a tempo of 110 beats per minute with a 4/4 time signature. The chord progression used in the song is D – G – A – D – Bm – G – A – D, which can be translated into Roman numerals as I – IV – V – I – vi – IV – V – I.

The cadence of "Mr. Tambourine Man" is characterized as an Authentic cadence (V–I), which consists of a dominant chord resolving to the tonic chord. In music theory, a cadence refers to a sequence of chords that brings a musical phrase to a close. Authentic cadences, in particular, create a strong sense of resolution and finality.


## Retrieval Evaluation

This section is measuring how well the search system (using minsearch and Qdrant) can retrieve the correct song record for a set of ground-truth questions. Here’s what it does:

1. Loads ground-truth data: Reads a CSV file (ground-truth-retrieval.csv) containing questions and the correct song id for each question.
2. Defines evaluation metrics:
    - **Hit Rate**: The fraction of questions for which the correct song appears anywhere in the top search results.
    - **MRR (Mean Reciprocal Rank)**: Measures how high the correct song appears in the ranked results (higher is better).
3. Runs the search: For each question, it uses minsearch to retrieve the top results.
4. Checks relevance: Compares the id of each result to the ground-truth id to see if the correct song was retrieved and at what rank.
5. Calculates metrics: Aggregates the results to compute overall hit rate and MRR, giving a quantitative measure of your retrieval system’s accuracy.

In [67]:
# Read the local ground truth dataset
df_question = pd.read_csv('../data/ground-truth-retrieval.csv')

In [68]:
df_question.head()

Unnamed: 0,id,question
0,0,What is the key of the song 'Let It Be' by The...
1,0,Can you provide the chord progression for 'Let...
2,0,What is the tempo in beats per minute for 'Let...
3,0,Which cadence is used at the end of 'Let It Be'?
4,0,What is the time signature of 'Let It Be'?


In [69]:
ground_truth = df_question.to_dict(orient='records')

In [70]:
ground_truth[0]

{'id': 0,
 'question': "What is the key of the song 'Let It Be' by The Beatles?"}

In [71]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [72]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['id']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

### Retrieval Evaluation - minsearch (text search)

In [74]:
from tqdm.auto import tqdm

In [75]:
evaluate(ground_truth, lambda q: search(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.914, 'mrr': 0.6267063492063489}

#### Improve Parameters for minsearch

In [48]:
df_validation = df_question[:100]
df_test = df_question[100:]

In [49]:
import random

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')  # Assuming we're minimizing. Use float('-inf') if maximizing.

    for _ in range(n_iterations):
        # Generate random parameters
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            if isinstance(min_val, int) and isinstance(max_val, int):
                current_params[param] = random.randint(min_val, max_val)
            else:
                current_params[param] = random.uniform(min_val, max_val)
        
        # Evaluate the objective function
        current_score = objective_function(current_params)
        
        # Update best if current is better
        if current_score > best_score:  # Change to > if maximizing
            best_score = current_score
            best_params = current_params
    
    return best_params, best_score

In [50]:
gt_val = df_validation.to_dict(orient='records')

In [51]:
def minsearch_search(query, boost=None):
    if boost is None:
        boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [52]:
param_ranges = {
    'title': (0.0, 3.0),
    'artist': (0.0, 3.0),
    'genre': (0.0, 3.0),
    'key': (0.0, 3.0),
    'tempo_bpm': (0.0, 3.0),
    'time_signature': (0.0, 3.0),
    'chord_progression': (0.0, 3.0),
    'roman_numerals': (0.0, 3.0),
    'cadence': (0.0, 3.0),
    'theory_notes': (0.0, 3.0)
}

def objective(boost_params):
    def search_function(q):
        return minsearch_search(q['question'], boost_params)

    results = evaluate(gt_val, search_function)
    return results['mrr']

In [53]:
simple_optimize(param_ranges, objective, n_iterations=20)

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

({'title': 2.8323160442386586,
  'artist': 0.5818272140863803,
  'genre': 0.7484183054006688,
  'key': 1.518348518481425,
  'tempo_bpm': 1.0219632472572906,
  'time_signature': 0.8019925951212504,
  'chord_progression': 2.6938161757646277,
  'roman_numerals': 1.922131437342809,
  'cadence': 1.0644438314732327,
  'theory_notes': 0.15173237549453888},
 0.8453333333333333)

In [54]:
def minsearch_improved(query):
    boost = {
        'title': 2.83,
        'artist': 0.58,
        'genre': 0.75,
        'key': 1.52,
        'tempo_bpm': 1.02,
        'time_signature': 0.80,
        'chord_progression': 2.69,
        'roman_numerals': 1.92,
        'cadence': 1.06,
        'theory_notes': 0.15
    }

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

evaluate(ground_truth, lambda q: minsearch_improved(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.926, 'mrr': 0.91015}

### Retrieval Evaluation - Qdrant (vector search)

In [76]:
evaluate(ground_truth, lambda q: vector_search(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.914, 'mrr': 0.8712936507936508}

### Retrieval Evaluation - Qdrant (vector search hybrid)

In [77]:
evaluate(ground_truth, lambda q: vector_search_hybrid(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.914, 'mrr': 0.8712936507936508}

### Retrieval Evaluation - Qdrant (vector search hybrid re-rank)

In [78]:
evaluate(ground_truth, lambda q: vector_search_hybrid_rerank(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.914, 'mrr': 0.5521246031746028}

## LLM Evaluation

The RAG flow is evaluated below using the following 2 offline methods:
- Cosine Similarity
- LLM-as-a-Judge

### Cosine Similarity

The code below is used to calculate the cosine similarity between an answer generated by the RAG system with the actual answer from the music theory dataset. First it evaluates the cosine for a single question, then it generates a CSV with LLM answers to all of the questions in the ground truth dataset and evaluates the cosine with the original dataset for each one.

In [56]:
# Get a question from the ground truth data set
eval_question = ground_truth[3]
print(eval_question)

{'id': 0, 'question': "Which cadence is used at the end of 'Let It Be'?"}


In [59]:
# use the Minsearch improved search to get the answer to the question above (because it's the best one so far)
answer_llm = rag(eval_question['question'], retriever_fn=minsearch_improved)
print(answer_llm)

The cadence used at the end of "Let It Be" is an Authentic (IV–I) cadence.


In [60]:
# Get the original dataset item with answer to the question above
documents[eval_question['id']]

{'id': 0,
 'title': 'Let It Be',
 'artist': 'The Beatles',
 'genre': 'Pop',
 'key': 'C major',
 'tempo_bpm': '76',
 'time_signature': '4/4',
 'chord_progression': 'C – G – Am – F – C – G – F – C',
 'roman_numerals': 'I – V – vi – IV – I – V – IV – I',
 'cadence': 'Authentic (IV–I) at end; Deceptive (V–vi) earlier',
 'theory_notes': 'Diatonic progression; Deceptive cadence in early phrase; Clear tonic return'}

In [61]:
doc_idx = {d['id']: d for d in documents}
answer_orig = doc_idx[eval_question['id']]['cadence']
print(answer_orig)

Authentic (IV–I) at end; Deceptive (V–vi) earlier


In [63]:
from sentence_transformers import SentenceTransformer

# Load a sentence transformer model to generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode answers
v_orig = model.encode(answer_orig)
v_llm = model.encode(answer_llm)

# Compute cosine similarity
cos_sim = v_llm.dot(v_orig)
print("Cosine similarity:", cos_sim)

Cosine similarity: 0.47456914


In [67]:
# Now generate a file containing the ground truth questions with answers
# from both the LLM and the original dataset
OUTPUT_PATH = "../data/results-gpt4o-mini.csv"

if not os.path.exists(OUTPUT_PATH):
    answers = {}

    for i, rec in enumerate(tqdm(ground_truth)):
        if i in answers:
            continue
    
        answer_llm = rag(rec['question'], retriever_fn=minsearch_improved)
        doc_id = rec['id']
        original_doc = doc_idx[doc_id]
        answer_orig = " | ".join(f"{k}: {v}" for k, v in original_doc.items())
    
        answers[i] = {      
            'id': doc_id,
            'question': rec['question'],
            'answer_llm': answer_llm,
            'answer_orig': answer_orig
        }

    results_gpt4o_mini = [None] * len(ground_truth)

    for i, val in answers.items():
        results_gpt4o_mini[i] = val.copy()
        results_gpt4o_mini[i].update(ground_truth[i])

    df_gpt4o_mini = pd.DataFrame(results_gpt4o_mini)

    !mkdir data

    df_gpt4o_mini.to_csv(OUTPUT_PATH, index=False)
    print(f"Saved {len(df_gpt4o_mini)} rows to data/results-gpt4o-mini.csv")
else:
    print(f"File already exists: {OUTPUT_PATH} — skipping generation.")
    df_gpt4o_mini = pd.read_csv(OUTPUT_PATH)
    df_gpt4o_mini.to_csv(OUTPUT_PATH, index=False)

  0%|          | 0/500 [00:00<?, ?it/s]

mkdir: cannot create directory ‘data’: File exists
Saved 500 rows to data/results-gpt4o-mini.csv


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [68]:
 df_gpt4o_mini.sample(n=5)

Unnamed: 0,id,question,answer_llm,answer_orig
299,59,How does the song structure utilize call and r...,The song structure utilizes call and response ...,id: 59 | title: We Will Rock You | artist: Que...
38,7,What is the significance of the cadence found ...,"The significance of the cadence found in ""The ...",id: 7 | title: Moonlight Sonata | artist: Ludw...
404,80,What emotional themes are present in 'Hurt'?,"The emotional themes present in ""Hurt"" by Nine...",id: 80 | title: Hurt | artist: Nine Inch Nails...
342,68,What is the tempo of 'Light My Fire' in BPM?,"The tempo of ""Light My Fire"" is 126 BPM.",id: 68 | title: Light My Fire | artist: The Do...
313,62,What is the tempo of the track 'Superstition'?,The tempo of the track 'Superstition' is 100 BPM.,id: 62 | title: Superstition | artist: Stevie ...


#### gpt-4o-mini

In [69]:
results_gpt4o_mini = df_gpt4o_mini.to_dict(orient='records')

In [70]:
record = results_gpt4o_mini[0]

In [71]:
def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = model.encode(answer_llm)
    v_orig = model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [72]:
similarity = []

for record in tqdm(results_gpt4o_mini):
    sim = compute_similarity(record)
    similarity.append(sim)

  0%|          | 0/500 [00:00<?, ?it/s]

In [74]:
df_gpt4o_mini['cosine'] = similarity
df_gpt4o_mini['cosine'].describe()

count    500.000000
mean       0.568715
std        0.120105
min        0.205746
25%        0.505190
50%        0.582068
75%        0.654365
max        0.887017
Name: cosine, dtype: float64

### LLM-as-a-Judge

The code below uses an LLM to determine the similarity or relevance between questions and the answers generated by an LLM.

In [79]:
prompt_template_evaluation = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks. Return ONLY valid JSON
with double quotes, no comments, and no trailing commas. For example:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

In [80]:
len(ground_truth)

500

In [81]:
record = ground_truth[0]
question = record['question']
answer_llm = rag(question, retriever_fn=minsearch_improved)

In [82]:
print(question)

What is the key of the song 'Let It Be' by The Beatles?


In [83]:
print(answer_llm)

The key of the song 'Let It Be' by The Beatles is C major.


In [84]:
prompt = prompt_template_evaluation.format(question=question, answer_llm=answer_llm)
print(prompt)

You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: What is the key of the song 'Let It Be' by The Beatles?
Generated Answer: The key of the song 'Let It Be' by The Beatles is C major.

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks. Return ONLY valid JSON
with double quotes, no comments, and no trailing commas. For example:

{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}


In [85]:
llm(prompt)

'{\n  "Relevance": "RELEVANT",\n  "Explanation": "The generated answer directly addresses the question by providing the key of the song \'Let It Be\' by The Beatles, which is exactly what was asked for." \n}'

In [86]:
import json

In [87]:
df_sample = df_question.sample(n=200, random_state=1)

In [88]:
sample = df_sample.to_dict(orient='records')

In [89]:
evaluations = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question, retriever_fn=minsearch_improved) 

    prompt = prompt_template_evaluation.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    evaluation = json.loads(evaluation)

    evaluations.append((record, answer_llm, evaluation))

  0%|          | 0/200 [00:00<?, ?it/s]

In [90]:
df_eval = pd.DataFrame(evaluations, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [91]:
df_eval.relevance.value_counts()

relevance
RELEVANT           195
PARTLY_RELEVANT      4
NON_RELEVANT         1
Name: count, dtype: int64

In [92]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.975
PARTLY_RELEVANT    0.020
NON_RELEVANT       0.005
Name: proportion, dtype: float64

In [93]:
df_eval.to_csv('../data/rag-eval-gpt-4o-mini.csv', index=False)

In [94]:
df_eval[df_eval.relevance == 'NON_RELEVANT']

Unnamed: 0,answer,id,question,relevance,explanation
72,The time signature for the piece Boléro is 3/4.,29,What is the time signature for the piece Boléro?,NON_RELEVANT,"The time signature for Boléro is 3/4, which is..."


In [95]:
evaluations_gpt4o = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question, retriever_fn=minsearch_improved, model='gpt-4o') 

    prompt = prompt_template_evaluation.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    evaluation = json.loads(evaluation)
    
    evaluations_gpt4o.append((record, answer_llm, evaluation))

  0%|          | 0/200 [00:00<?, ?it/s]

In [96]:
df_eval = pd.DataFrame(evaluations_gpt4o, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [97]:
df_eval.relevance.value_counts()

relevance
RELEVANT           193
PARTLY_RELEVANT      6
NON_RELEVANT         1
Name: count, dtype: int64

In [98]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.965
PARTLY_RELEVANT    0.030
NON_RELEVANT       0.005
Name: proportion, dtype: float64

In [99]:
df_eval.to_csv('../data/rag-eval-gpt-4o.csv', index=False)

In [100]:
df_eval[df_eval.relevance == 'NON_RELEVANT']

Unnamed: 0,answer,id,question,relevance,explanation
72,"The time signature for the piece ""Boléro"" by M...",29,What is the time signature for the piece Boléro?,NON_RELEVANT,The time signature for Ravel's 'Boléro' is act...
