# RAG Test

This notebook tests the RAG flow for the Music Theory Assistant, evaluates how well the search system can retrieve results and how well different LLMs can analyse the relevance of generated answers to given questions.

## Retrieval Flow

This section gets the music theory data from a local CSV file knowledge base containing a selection of songs from different musical genres. The retrieval flow is then tested using the following different methods:

- minsearch (text search)
- Qdrant (vector search)


#### Get the data

In [2]:
import pandas as pd

In [3]:
csv = '../data/music-theory-dataset-100.csv'

df = pd.read_csv(csv)
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [4]:
df.to_csv(csv, index=False)

In [5]:
print('Shape (rows and columns):', df.shape)
df.head(2)

Shape (rows and columns): (100, 11)


Unnamed: 0,id,title,artist,genre,key,tempo_bpm,time_signature,chord_progression,roman_numerals,cadence,theory_notes
0,0,Let It Be,The Beatles,Pop,C major,76,4/4,C – G – Am – F – C – G – F – C,I – V – vi – IV – I – V – IV – I,Authentic (IV–I) at end; Deceptive (V–vi) earlier,Diatonic progression; Deceptive cadence in ear...
1,1,Hotel California,Eagles,Rock,Bm,74,4/4,Bm – F# – A – E – G – D – Em – F#,i – V – VII – IV – VI – III – iv – V,Half cadence (iv–V),Modal interchange; Natural VII chord; Aeolian ...


### Retrieval Flow - minsearch (text search)

This section gets the music theory data from a local CSV file knowledge base containing a selection of songs from different musical genres. It is then indexed in [minsearch](https://github.com/alexeygrigorev/minsearch) and queried. This same dataset is then passed to an [LLM (OpenAI - GPT-4o mini)](https://platform.openai.com/docs/models/gpt-4o-mini) and queried again to check the accuracy of the results.

#### Install minsearch

In [6]:
import os

if not os.path.exists("../notebooks/minsearch.py"):
    # Install the package
    os.system("pip install minsearch")
    # Download the file
    os.system("wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py")

#### Index with minsearch

In [7]:
import minsearch



In [8]:
df.columns

Index(['id', 'title', 'artist', 'genre', 'key', 'tempo_bpm', 'time_signature',
       'chord_progression', 'roman_numerals', 'cadence', 'theory_notes'],
      dtype='object')

In [9]:
# Covert numeric fields to string to prevent parsing errors in minsearch
df['tempo_bpm'] = df['tempo_bpm'].apply(lambda i: str(i))

documents = df.to_dict(orient='records')
documents[0]

{'id': 0,
 'title': 'Let It Be',
 'artist': 'The Beatles',
 'genre': 'Pop',
 'key': 'C major',
 'tempo_bpm': '76',
 'time_signature': '4/4',
 'chord_progression': 'C – G – Am – F – C – G – F – C',
 'roman_numerals': 'I – V – vi – IV – I – V – IV – I',
 'cadence': 'Authentic (IV–I) at end; Deceptive (V–vi) earlier',
 'theory_notes': 'Diatonic progression; Deceptive cadence in early phrase; Clear tonic return'}

In [10]:
index = minsearch.Index(
    text_fields=['title', 'artist', 'genre', 'key', 'tempo_bpm', 'time_signature',
       'chord_progression', 'roman_numerals', 'cadence', 'theory_notes'],
    keyword_fields=[]
)

In [11]:
index.fit(documents)

<minsearch.Index at 0x7e9c28bc3fb0>

In [12]:
query = "Give me Folk titles"

In [13]:
index.search(query, num_results=5)

[{'id': 5,
  'title': 'House of the Rising Sun',
  'artist': 'The Animals',
  'genre': 'Folk',
  'key': 'Am',
  'tempo_bpm': '76',
  'time_signature': '6/8',
  'chord_progression': 'Am – C – D – F – Am – E – Am',
  'roman_numerals': 'i – III – IV – VI – i – V – i',
  'cadence': 'Authentic (V–i)',
  'theory_notes': 'Aeolian mode; 6/8 compound meter; Traditional folk harmony'},
 {'id': 17,
  'title': "The Times They Are A-Changin'",
  'artist': 'Bob Dylan',
  'genre': 'Folk',
  'key': 'G major',
  'tempo_bpm': '76',
  'time_signature': '3/4',
  'chord_progression': 'G – Em – C – G – Am – D – G – Em – D – G',
  'roman_numerals': 'I – vi – IV – I – ii – V – I – vi – V – I',
  'cadence': 'Authentic (V–I)',
  'theory_notes': 'Folk protest song; Simple diatonic harmony; Waltz meter'},
 {'id': 82,
  'title': 'Hallelujah',
  'artist': 'Leonard Cohen',
  'genre': 'Folk',
  'key': 'C major',
  'tempo_bpm': '72',
  'time_signature': '6/8',
  'chord_progression': 'C – Am – F – G – Em – Am – F – G –

#### minsearch to LLM

In [14]:
from openai import OpenAI

client = OpenAI()

In [15]:
def search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [16]:
prompt_template = """
You're a music teacher. Answer the QUESTION based on the CONTEXT from our music theory database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

entry_template = """
title: {title}
artist: {artist}
genre: {genre}
key: {key}
tempo_bpm: {tempo_bpm}
time_signature: {time_signature}
chord_progression: {chord_progression}
roman_numerals: {roman_numerals}
cadence: {cadence}
theory_notes: {theory_notes}
""".strip()

In [17]:
def build_prompt(query, search_results):
    context = ""
    
    for doc in search_results:
        context = context + entry_template.format(**doc) + "\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [18]:
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [19]:
def rag(query, model='gpt-4o-mini'):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt, model=model)
    return answer

In [20]:
question = 'Is Mr Tambourine Man a folk song and if so, explain why, including information about its key and cadence, and explain what cadence means?'
answer = rag(question)
print(answer)

Yes, "Mr. Tambourine Man" is classified as a folk song because it is performed by Bob Dylan, who is a prominent artist in the folk genre. The song is characterized by its lyrical storytelling, a common feature of folk music.

The song is in the key of D major and follows a chord progression of D – G – A – D – Bm – G – A – D, which translates to the Roman numerals I – IV – V – I – vi – IV – V – I. The cadence used in "Mr. Tambourine Man" is an authentic cadence, specifically a V–I cadence, where the dominant chord (V) resolves to the tonic chord (I). 

A cadence in music refers to a progression of at least two chords that brings a phrase or a section of music to a close or creates a sense of resolution. In the case of the authentic cadence, it provides a strong sense of resolution, affirming the tonal center of the piece.


In [21]:
question = 'Could you print a list of songs by The Beatles?'
answer = rag(question)
print(answer)

The only songs by The Beatles listed in the context are:

1. **Let It Be**
   - Key: C major
   - Tempo: 76 BPM
   - Time Signature: 4/4
   - Chord Progression: C – G – Am – F – C – G – F – C

2. **Something**
   - Key: C major
   - Tempo: 66 BPM
   - Time Signature: 4/4
   - Chord Progression: C – G – Am – F – C – G – F – C


In [22]:
question = 'Could you print a list of songs in the key of C Major?'
answer = rag(question)
print(answer)

Here is a list of songs in the key of C Major:

1. **Let It Be** - The Beatles
2. **My Girl** - The Temptations
3. **Brown Sugar** - The Rolling Stones
4. **Great Balls of Fire** - Jerry Lee Lewis
5. **Blue Moon** - Richard Rodgers
6. **Dust in the Wind** - Kansas


### Retrieval Flow - Qdrant (vector search)

This section gets the music theory data from a local CSV file knowledge base containing a selection of songs from different musical genres. It is then indexed in [Qdrant](https://qdrant.tech/) and queried. This same dataset is then passed to an [LLM (OpenAI - GPT-4o mini)](https://platform.openai.com/docs/models/gpt-4o-mini) and queried again to check the accuracy of the results.

Install qdrant and fastembed (if not already installed during project setup):

```bash
pip install -q "qdrant-client[fastembed]>=1.14.2"
```

Run in Docker:

```bash
docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant
```

In [138]:
!pip show qdrant-client || pip install -q "qdrant-client[fastembed]>=1.14.2"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Name: qdrant-client
Version: 1.15.1
Summary: Client library for the Qdrant vector search engine
Home-page: https://github.com/qdrant/qdrant-client
Author: Andrey Vasnetsov
Author-email: andrey@qdrant.tech
License: Apache-2.0
Location: /home/codespace/.python/current/lib/python3.12/site-packages
Requires: grpcio, httpx, numpy, portalocker, protobuf, pydantic, urllib3
Required-by: 


In [23]:
from qdrant_client import QdrantClient, models

In [24]:
# Qdrant setup
qd_client = QdrantClient("http://localhost:6333")
collection_name = "zoomcamp-music-theory-assistant"

In [25]:
EMBEDDING_MODEL = "jinaai/jina-embeddings-v2-small-en"
EMBEDDING_DIMENSIONALITY = 512

In [26]:
# delete the collection if it already exists
qd_client.delete_collection(collection_name=collection_name)

True

In [27]:
qd_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

True

In [28]:
# (Optional) Add payload indexes for filtering later, e.g., by genre or cadence
# qd_client.create_payload_index(collection_name=COLLECTION, field_name="genre", field_schema="keyword")
# qd_client.create_payload_index(collection_name=COLLECTION, field_name="cadence", field_schema="keyword")

In [29]:
# Prepare points
points = []
for doc in documents:
    # Build a single searchable text string from the document fields
    text = " | ".join([
        str(doc["title"]),
        str(doc["artist"]),
        f"Genre: {doc['genre']}",
        f"Key: {doc['key']}",
        f"Tempo: {doc['tempo_bpm']} BPM",
        f"Time: {doc['time_signature']}",
        f"Chords: {doc['chord_progression']}",
        f"Roman: {doc['roman_numerals']}",
        f"Cadence: {doc['cadence']}",
        f"Notes: {doc['theory_notes']}",
    ])

    vector = models.Document(text=text, model=EMBEDDING_MODEL)  # Qdrant client auto-embeds this
    point = models.PointStruct(id=int(doc["id"]), vector=vector, payload=doc)
    points.append(point)

In [30]:
qd_client.upsert(
    collection_name=collection_name,
    points=points
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

onnx/model.onnx:   0%|          | 0.00/130M [00:00<?, ?B/s]

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [31]:
def vector_search(query):

    query_points = qd_client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=query,
            model=EMBEDDING_MODEL 
        ),
        # Optionally add a filter here if you want to constrain results (e.g., only Pop)
        #query_filter=models.Filter( 
        #    must=[
        #        models.FieldCondition(
        #            key="genre",
        #            match=models.MatchValue(value="Pop")
        #        )
        #    ]
        #),
        limit=10,
        with_payload=True
    )
    
    results = []
    
    for point in query_points.points:
        results.append(point.payload)
    
    return results

In [105]:
def rag(query, model='gpt-4o-mini'):
    search_results = vector_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt, model=model)
    return answer

In [106]:
question = 'Is Mr Tambourine Man a folk song and if so, explain why, including information about its key and cadence, and explain what cadence means?'
answer = rag(question)
print(answer)

Yes, "Mr. Tambourine Man" is considered a folk song, as it fits within the folk genre and reflects the themes and characteristics typical of folk music. The song is written in the key of D major and has a chord progression of D – G – A – D – Bm – G – A – D. The use of simple chords and a clear progression aligns with folk traditions.

The cadence identified for "Mr. Tambourine Man" is an authentic cadence, which specifically refers to a progression from the dominant chord (V) to the tonic chord (I), in this case, moving from A (V) to D (I). An authentic cadence typically provides a sense of resolution and completeness, finalizing a musical phrase with a satisfying ending.

In summary, "Mr. Tambourine Man" is a folk song characterized by its melodic simplicity, use of common folk chord progressions, and an authentic cadence that adds to its lyrical storytelling quality.


## Retrieval Evaluation

This section is measuring how well the search system (using minsearch and Qdrant) can retrieve the correct song record for a set of ground-truth questions. Here’s what it does:

1. Loads ground-truth data: Reads a CSV file (ground-truth-retrieval.csv) containing questions and the correct song id for each question.
2. Defines evaluation metrics:
    - **Hit Rate**: The fraction of questions for which the correct song appears anywhere in the top search results.
    - **MRR (Mean Reciprocal Rank)**: Measures how high the correct song appears in the ranked results (higher is better).
3. Runs the search: For each question, it uses minsearch to retrieve the top results.
4. Checks relevance: Compares the id of each result to the ground-truth id to see if the correct song was retrieved and at what rank.
5. Calculates metrics: Aggregates the results to compute overall hit rate and MRR, giving a quantitative measure of your retrieval system’s accuracy.

In [34]:
# Read the local ground truth dataset
df_question = pd.read_csv('../data/ground-truth-retrieval.csv')

In [35]:
df_question.head()

Unnamed: 0,id,question
0,0,What is the key of the song 'Let It Be' by The...
1,0,Can you provide the chord progression for 'Let...
2,0,What is the tempo in beats per minute for 'Let...
3,0,Which cadence is used at the end of 'Let It Be'?
4,0,What is the time signature of 'Let It Be'?


In [36]:
ground_truth = df_question.to_dict(orient='records')

In [37]:
ground_truth[0]

{'id': 0,
 'question': "What is the key of the song 'Let It Be' by The Beatles?"}

In [38]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [39]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['id']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

### Retrieval Evaluation - minsearch (text search)

In [40]:
def minsearch_search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [41]:
from tqdm.auto import tqdm

In [42]:
evaluate(ground_truth, lambda q: minsearch_search(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.914, 'mrr': 0.6267063492063489}

#### Improve Parameters for minsearch

In [87]:
df_validation = df_question[:100]
df_test = df_question[100:]

In [89]:
import random

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')  # Assuming we're minimizing. Use float('-inf') if maximizing.

    for _ in range(n_iterations):
        # Generate random parameters
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            if isinstance(min_val, int) and isinstance(max_val, int):
                current_params[param] = random.randint(min_val, max_val)
            else:
                current_params[param] = random.uniform(min_val, max_val)
        
        # Evaluate the objective function
        current_score = objective_function(current_params)
        
        # Update best if current is better
        if current_score > best_score:  # Change to > if maximizing
            best_score = current_score
            best_params = current_params
    
    return best_params, best_score

In [90]:
gt_val = df_validation.to_dict(orient='records')

In [91]:
def minsearch_search(query, boost=None):
    if boost is None:
        boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [94]:
param_ranges = {
    'title': (0.0, 3.0),
    'artist': (0.0, 3.0),
    'genre': (0.0, 3.0),
    'key': (0.0, 3.0),
    'tempo_bpm': (0.0, 3.0),
    'time_signature': (0.0, 3.0),
    'chord_progression': (0.0, 3.0),
    'roman_numerals': (0.0, 3.0),
    'cadence': (0.0, 3.0),
    'theory_notes': (0.0, 3.0)
}

def objective(boost_params):
    def search_function(q):
        return minsearch_search(q['question'], boost_params)

    results = evaluate(gt_val, search_function)
    return results['mrr']

In [95]:
simple_optimize(param_ranges, objective, n_iterations=20)

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

({'title': 2.3698411388829372,
  'artist': 0.3031625435310451,
  'genre': 0.6140584779182744,
  'key': 2.4562974215046816,
  'tempo_bpm': 1.4232337093957383,
  'time_signature': 2.210401336226906,
  'chord_progression': 0.31739076615573913,
  'roman_numerals': 0.5962528573188755,
  'cadence': 0.3870637694320902,
  'theory_notes': 0.16372389099029938},
 0.8445833333333332)

In [96]:
def minsearch_improved(query):
    boost = {
        'title': 2.37,
        'artist': 0.30,
        'genre': 0.61,
        'key': 2.46,
        'tempo_bpm': 1.42,
        'time_signature': 2.21,
        'chord_progression': 0.32,
        'roman_numerals': 0.60,
        'cadence': 0.39,
        'theory_notes': 0.16
    }

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

evaluate(ground_truth, lambda q: minsearch_improved(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.926, 'mrr': 0.9082333333333332}

### Retrieval Evaluation - Qdrant (vector search)

In [88]:
evaluate(ground_truth, lambda q: vector_search(q['question']))

  0%|          | 0/500 [00:00<?, ?it/s]

{'hit_rate': 0.914, 'mrr': 0.8712936507936508}

## LLM Evaluation

The RAG flow is evaluated below using the following 2 offline methods:
- Cosine Similarity
- LLM-as-a-Judge

### Cosine Similarity

The code below is used to calculate the cosine similarity between an answer generated by the RAG system with the actual answer from the music theory dataset. First it evaluates the cosine for a single question, then it generates a CSV with LLM answers to all of the questions in the ground truth dataset and evaluates the cosine with the original dataset for each one.

In [100]:
# Get a question from the ground truth data set
eval_question = ground_truth[3]
print(eval_question)

{'id': 0, 'question': "Which cadence is used at the end of 'Let It Be'?"}


In [None]:
# Use the improved minsearch since it performs best
#def rag(query, model='gpt-4o-mini'):
#    search_results = minsearch_improved(query)
#    prompt = build_prompt(query, search_results)
#    answer = llm(prompt, model=model)
#    return answer

In [108]:
# use the Qdrant vector search to get the answer to the question above
answer_llm = rag(eval_question['question'])
print(answer_llm)

The cadence used at the end of "Let It Be" is an Authentic cadence (IV–I).


In [46]:
# Get the original dataset item with answer to the question above
documents[eval_question['id']]

{'id': 0,
 'title': 'Let It Be',
 'artist': 'The Beatles',
 'genre': 'Pop',
 'key': 'C major',
 'tempo_bpm': '76',
 'time_signature': '4/4',
 'chord_progression': 'C – G – Am – F – C – G – F – C',
 'roman_numerals': 'I – V – vi – IV – I – V – IV – I',
 'cadence': 'Authentic (IV–I) at end; Deceptive (V–vi) earlier',
 'theory_notes': 'Diatonic progression; Deceptive cadence in early phrase; Clear tonic return'}

In [47]:
doc_idx = {d['id']: d for d in documents}
answer_orig = doc_idx[eval_question['id']]['cadence']
print(answer_orig)

Authentic (IV–I) at end; Deceptive (V–vi) earlier


In [48]:
!pip show sentence-transformers || pip install sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Name: sentence-transformers
Version: 5.1.0
Summary: Embeddings, Retrieval, and Reranking
Home-page: https://www.SBERT.net
Author: 
Author-email: Nils Reimers <info@nils-reimers.de>, Tom Aarsen <tom.aarsen@huggingface.co>
License: Apache 2.0
Location: /home/codespace/.python/current/lib/python3.12/site-packages
Requires: huggingface-hub, Pillow, scikit-learn, scipy, torch, tqdm, transformers, typing_extensions
Required-by: 


In [49]:
from sentence_transformers import SentenceTransformer

# Load a sentence transformer model to generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode answers
v_orig = model.encode(answer_orig)
v_llm = model.encode(answer_llm)

# Compute cosine similarity
cos_sim = v_llm.dot(v_orig)
print("Cosine similarity:", cos_sim)

Cosine similarity: 0.45508164


In [107]:
# Now generate a file containing the ground truth questions with answers
# from both the LLM and the original dataset
OUTPUT_PATH = "../data/results-gpt4o-mini.csv"

if not os.path.exists(OUTPUT_PATH):
    answers = {}

    for i, rec in enumerate(tqdm(ground_truth)):
        if i in answers:
            continue
    
        answer_llm = rag(rec['question'])
        doc_id = rec['id']
        original_doc = doc_idx[doc_id]
        answer_orig = " | ".join(f"{k}: {v}" for k, v in original_doc.items())
    
        answers[i] = {      
            'id': doc_id,
            'question': rec['question'],
            'answer_llm': answer_llm,
            'answer_orig': answer_orig
        }

    results_gpt4o_mini = [None] * len(ground_truth)

    for i, val in answers.items():
        results_gpt4o_mini[i] = val.copy()
        results_gpt4o_mini[i].update(ground_truth[i])

    df_gpt4o_mini = pd.DataFrame(results_gpt4o_mini)

    !mkdir data

    df_gpt4o_mini.to_csv(OUTPUT_PATH, index=False)
    print(f"Saved {len(df_gpt4o_mini)} rows to data/results-gpt4o-mini.csv")
else:
    print(f"File already exists: {OUTPUT_PATH} — skipping generation.")
    df_gpt4o_mini = pd.read_csv(OUTPUT_PATH)
    df_gpt4o_mini.to_csv(OUTPUT_PATH, index=False)

File already exists: ../data/results-gpt4o-mini.csv — skipping generation.


In [75]:
 df_gpt4o_mini.sample(n=5)

Unnamed: 0,id,question,answer_llm,answer_orig
135,27,What is the time signature of 'Clair de Lune'?,The time signature of 'Clair de Lune' is 9/8.,id: 27 | title: Clair de Lune | artist: Claude...
393,78,Which genre does the song 'Something' belong to?,The song 'Something' belongs to the genre Pop.,id: 78 | title: Something | artist: George Har...
93,18,What type of cadence is found in this song?,"The type of cadence found in ""Amazing Grace"" i...",id: 18 | title: Vincent (Starry Starry Night) ...
204,40,What is the tempo of 'When I Fall in Love' in ...,The tempo of 'When I Fall in Love' is 63 beats...,id: 40 | title: When I Fall in Love | artist: ...
472,94,Can you provide the chord progression for 'Sha...,The chord progression for 'Shallow' is not pro...,id: 94 | title: Shallow | artist: Lady Gaga & ...


#### gpt-4o-mini

In [80]:
results_gpt4o_mini = df_gpt4o_mini.to_dict(orient='records')

In [81]:
record = results_gpt4o_mini[0]

In [82]:
def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = model.encode(answer_llm)
    v_orig = model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [83]:
similarity = []

for record in tqdm(results_gpt4o_mini):
    sim = compute_similarity(record)
    similarity.append(sim)

  0%|          | 0/500 [00:00<?, ?it/s]

In [85]:
df_gpt4o_mini['cosine'] = similarity
df_gpt4o_mini['cosine'].describe()

count    500.000000
mean       0.570456
std        0.118722
min        0.219455
25%        0.501760
50%        0.583249
75%        0.660498
max        0.842621
Name: cosine, dtype: float64

### LLM-as-a-Judge

The code below uses an LLM to determine the similarity or relevance between questions and the answers generated by an LLM.

In [111]:
prompt_template_evaluation = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks. Return ONLY valid JSON
with double quotes, no comments, and no trailing commas. For example:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

In [112]:
len(ground_truth)

500

In [113]:
record = ground_truth[0]
question = record['question']
answer_llm = rag(question)

In [114]:
print(question)

What is the key of the song 'Let It Be' by The Beatles?


In [115]:
print(answer_llm)

The key of the song 'Let It Be' by The Beatles is C major.


In [116]:
prompt = prompt_template_evaluation.format(question=question, answer_llm=answer_llm)
print(prompt)

You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: What is the key of the song 'Let It Be' by The Beatles?
Generated Answer: The key of the song 'Let It Be' by The Beatles is C major.

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks. Return ONLY valid JSON
with double quotes, no comments, and no trailing commas. For example:

{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}


In [117]:
llm(prompt)

'{\n  "Relevance": "RELEVANT",\n  "Explanation": "The generated answer correctly identifies the key of the song \'Let It Be\' by The Beatles, directly addressing the question asked."\n}'

In [120]:
import json

In [121]:
df_sample = df_question.sample(n=200, random_state=1)

In [122]:
sample = df_sample.to_dict(orient='records')

In [123]:
evaluations = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question) 

    prompt = prompt_template_evaluation.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    evaluation = json.loads(evaluation)

    evaluations.append((record, answer_llm, evaluation))

  0%|          | 0/200 [00:00<?, ?it/s]

In [124]:
df_eval = pd.DataFrame(evaluations, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [125]:
df_eval.relevance.value_counts()

relevance
RELEVANT           192
PARTLY_RELEVANT      7
NON_RELEVANT         1
Name: count, dtype: int64

In [127]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.960
PARTLY_RELEVANT    0.035
NON_RELEVANT       0.005
Name: proportion, dtype: float64

In [128]:
df_eval.to_csv('../data/rag-eval-gpt-4o-mini.csv', index=False)

In [129]:
df_eval[df_eval.relevance == 'NON_RELEVANT']

Unnamed: 0,answer,id,question,relevance,explanation
72,"The time signature for the piece ""Boléro"" is 3/4.",29,What is the time signature for the piece Boléro?,NON_RELEVANT,The time signature for 'Boléro' is actually 3/...


In [130]:
evaluations_gpt4o = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question, model='gpt-4o') 

    prompt = prompt_template_evaluation.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    evaluation = json.loads(evaluation)
    
    evaluations_gpt4o.append((record, answer_llm, evaluation))

  0%|          | 0/200 [00:00<?, ?it/s]

In [131]:
df_eval = pd.DataFrame(evaluations_gpt4o, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [132]:
df_eval.relevance.value_counts()

relevance
RELEVANT           195
PARTLY_RELEVANT      4
NON_RELEVANT         1
Name: count, dtype: int64

In [133]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.975
PARTLY_RELEVANT    0.020
NON_RELEVANT       0.005
Name: proportion, dtype: float64

In [134]:
df_eval.to_csv('../data/rag-eval-gpt-4o.csv', index=False)

In [135]:
df_eval[df_eval.relevance == 'NON_RELEVANT']

Unnamed: 0,answer,id,question,relevance,explanation
72,"The time signature for the piece ""Boléro"" by M...",29,What is the time signature for the piece Boléro?,NON_RELEVANT,The correct time signature for Ravel's 'Boléro...
