## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

## Getting the data
Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with gpt-4o-mini

Read it:

In [71]:
import pandas as pd

url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv?raw=1'
df = pd.read_csv(url)

df.head()


Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [72]:
#Using only 300 rows
df = df.iloc[:300]

df.shape

(300, 5)

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)


In [73]:
from sentence_transformers import SentenceTransformer, util

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')



Create the embeddings for the first LLM answer:



In [74]:
answer_llm = df.iloc[0].answer_llm



# Q1 What's the first value of the resulting vector?



In [75]:
v_answer_llm = model.encode(answer_llm)

f'The first value of resulting vector is {v_answer_llm[0]}'

'The first value of resulting vector is -0.4224465489387512'

# Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?

In [76]:
results_gpt4o_mini = df.to_dict(orient='records')
results_gpt4o_mini[0]

{'answer_llm': 'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).',
 'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there’s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork',
 'document': '0227b872',
 'question': 'Where can I sign up for the course?',
 'course': 'machine-learning-zoomcamp'}

In [77]:
def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = model.encode(answer_llm)
    v_orig = model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [78]:
from tqdm.auto import tqdm


dot_similarity = []

for record in tqdm(results_gpt4o_mini):
    sim = compute_similarity(record)
    dot_similarity.append(sim)

  0%|          | 0/300 [00:00<?, ?it/s]

100%|██████████| 300/300 [02:22<00:00,  2.11it/s]


In [79]:
df['dot_similarity'] = dot_similarity 
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,dot_similarity
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,17.515987
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,13.418402
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,25.313255
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,12.147415
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,18.747736


In [80]:
print(f"The 75th percentile dot product  similarity score is: {df['dot_similarity'].quantile(0.75)}")


The 75th percentile dot product  similarity score is: 31.67430877685547


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

In [81]:
import numpy as np

def compute_cosine_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = model.encode(answer_llm)
    v_orig = model.encode(answer_orig)
    
    # Compute the norm of the vectors
    norm_llm = np.sqrt((v_llm * v_llm).sum())
    norm_orig = np.sqrt((v_orig * v_orig).sum())
    
    # Avoid division by zero
    if norm_llm == 0 or norm_orig == 0:
        return 0.0
    
    # Normalize the vectors
    v_llm_norm = v_llm / norm_llm
    v_orig_norm = v_orig / norm_orig
    
    # Compute cosine similarity as dot product of normalized vectors
    cosine_sim = np.dot(v_llm_norm, v_orig_norm)
    
    return cosine_sim


In [82]:
from tqdm.auto import tqdm


cos_similarity = []

for record in tqdm(results_gpt4o_mini):
    sim = compute_cosine_similarity(record)
    cos_similarity.append(sim)

  0%|          | 0/300 [00:00<?, ?it/s]

100%|██████████| 300/300 [02:21<00:00,  2.12it/s]


In [83]:
df['cos_similarity'] = cos_similarity 
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,dot_similarity,cos_similarity
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,17.515987,0.506754
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,13.418402,0.388549
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,25.313255,0.718599
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,12.147415,0.337266
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,18.747736,0.521792


In [84]:
print(f"The 75th percentile cosine   similarity score is: {df['cos_similarity'].quantile(0.75)}")


The 75th percentile cosine   similarity score is: 0.8362348973751068


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [85]:
pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [86]:
df.iloc[10]

answer_llm        Yes, all sessions are recorded, so if you miss...
answer_orig       Everything is recorded, so you won’t miss anyt...
document                                                   5170565b
question                       Are sessions recorded if I miss one?
course                                    machine-learning-zoomcamp
dot_similarity                                            32.344711
cos_similarity                                             0.777956
Name: 10, dtype: object

In [87]:
from rouge import Rouge

rouge_scorer = Rouge()



# Calculate the ROUGE score row index 10
scores = rouge_scorer.get_scores(df.iloc[10]['answer_llm'], df.iloc[10]['answer_orig'])[0]



f"Rouge f1 score is : {scores['rouge-1']['f']}"


'Rouge f1 score is : 0.45454544954545456'

## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

In [88]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [89]:
# Extract all individual ROUGE scores
rouge_1_f1 = scores['rouge-1']['f']
rouge_2_f1 = scores['rouge-2']['f']
rouge_l_f1 = scores['rouge-l']['f']


# Combine all scores into a list
all_f1_scores = [
    rouge_1_f1,
    rouge_2_f1, 
    rouge_l_f1, 
]

# Calculate the average
average_score = sum(all_f1_scores) / len(all_f1_scores)

print(f"Average ROUGE score: {average_score}")

Average ROUGE score: 0.35490034990035496


## Q6. Average rouge score for all the data points

Now let's compute the score for all the records and create a dataframe from them.

What's the average `rouge_2` across all the records?

In [90]:
import pandas as pd
from rouge import Rouge

def compute_rouge_f1_scores(df):
    # Initialize the Rouge scorer
    rouge_scorer = Rouge()

    # Lists to hold the scores
    rouge_1_f1_scores = []
    rouge_2_f1_scores = []
    rouge_l_f1_scores = []

    # Iterate through each record in the dataframe
    for index, row in df.iterrows():
        # Calculate the ROUGE score for each row
        scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])[0]
        
        # Extract the F1 scores for ROUGE-1, ROUGE-2, and ROUGE-L
        rouge_1_f1 = scores['rouge-1']['f']
        rouge_2_f1 = scores['rouge-2']['f']
        rouge_l_f1 = scores['rouge-l']['f']
        
        # Append the scores to the respective lists
        rouge_1_f1_scores.append(rouge_1_f1)
        rouge_2_f1_scores.append(rouge_2_f1)
        rouge_l_f1_scores.append(rouge_l_f1)

    # Add the scores to the dataframe
    df['rouge_1_f1'] = rouge_1_f1_scores
    df['rouge_2_f1'] = rouge_2_f1_scores
    df['rouge_l_f1'] = rouge_l_f1_scores

    # Calculate the average of the ROUGE scores
    avg_rouge_1_f1 = sum(rouge_1_f1_scores) / len(rouge_1_f1_scores)
    avg_rouge_2_f1 = sum(rouge_2_f1_scores) / len(rouge_2_f1_scores)
    avg_rouge_l_f1 = sum(rouge_l_f1_scores) / len(rouge_l_f1_scores)

    return df, avg_rouge_1_f1, avg_rouge_2_f1, avg_rouge_l_f1


# Compute ROUGE scores and add to dataframe
df_with_rouge_scores, avg_rouge_1_f1, avg_rouge_2_f1, avg_rouge_l_f1 = compute_rouge_f1_scores(df)

print(df_with_rouge_scores)
print(f"Average ROUGE-1 F1: {avg_rouge_1_f1}")
print(f"Average ROUGE-2 F1: {avg_rouge_2_f1}")
print(f"Average ROUGE-L F1: {avg_rouge_l_f1}")


                                            answer_llm  \
0    You can sign up for the course by visiting the...   
1    You can sign up using the link provided in the...   
2    Yes, there is an FAQ for the Machine Learning ...   
3    The context does not provide any specific info...   
4    To structure your questions and answers for th...   
..                                                 ...   
295  An alternative way to load the data using the ...   
296  You can directly download the dataset from Git...   
297  You can fetch data for homework using the `req...   
298  If the status code is 200 when downloading dat...   
299  If the file download fails when using the requ...   

                                           answer_orig  document  \
0    Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   
1    Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   
2    Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   
3    Machine Learning Zoomcamp 