## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

## Getting the data
Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with gpt-4o-mini

Read it:

In [58]:
import pandas as pd
from tqdm import tqdm

In [3]:
github_url = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [10]:
df.sample(5)

Unnamed: 0,answer_llm,answer_orig,document,question,course
18,To find more about the theoretical topics not ...,The bare minimum. The focus is more on practic...,ecca790c,Where can I find more about the theoretical to...,machine-learning-zoomcamp
1727,The server receives data in JSON format becaus...,Problem happens when contacting the server wai...,cc60f7bc,Why does the server receive data in JSON forma...,machine-learning-zoomcamp
787,To see the version of an installed Python pack...,Import waitress\nprint(waitress.__version__)\n...,7156679d,What code should I run in Jupyter to see the v...,machine-learning-zoomcamp
1620,To calculate your email hash for project evalu...,I am not sure how the project evaluate assignm...,37eab341,What specific steps should I follow to calcula...,machine-learning-zoomcamp
143,"In the homework, X.dot(Y) is not necessarily e...",I'm trying to invert the matrix but I got erro...,54ec0de4,"In the homework, why is X.dot(Y) not necessari...",machine-learning-zoomcamp


We will use only the first 300 documents:

In [13]:
df = df.iloc[:300]

## Q1. Getting the embeddings model
Now, get the embeddings model ```multi-qa-mpnet-base-dot-v1``` from the [Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

In [14]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

  from .autonotebook import tqdm as notebook_tqdm
You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





Create the embeddings for the first LLM answer:

In [17]:
df.iloc[:4].answer_llm

0    You can sign up for the course by visiting the...
1    You can sign up using the link provided in the...
2    Yes, there is an FAQ for the Machine Learning ...
3    The context does not provide any specific info...
Name: answer_llm, dtype: object

In [18]:
answer_llm_1 = df.iloc[0].answer_llm

answer_llm_1

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [20]:
answer_llm_1_v = embedding_model.encode(answer_llm_1)

In [21]:
answer_llm_1_v

array([-4.22446549e-01, -2.24856257e-01, -3.24058414e-01, -2.84758478e-01,
        7.25642918e-03,  1.01186566e-01,  1.03716910e-01, -1.89983174e-01,
       -2.80599259e-02,  2.71588802e-01, -1.15337655e-01,  1.14666030e-01,
       -8.49586725e-02,  3.32365334e-01,  5.52720726e-02, -2.22195774e-01,
       -1.42540857e-01,  1.02519155e-01, -1.52333647e-01, -2.02912465e-01,
        1.98422875e-02,  8.38149190e-02, -5.68632066e-01,  2.32844148e-02,
       -1.67292684e-01, -2.39256918e-01, -8.05464387e-02,  2.57084146e-02,
       -8.15464780e-02, -7.39290118e-02, -2.61550009e-01,  1.92575473e-02,
        3.22909206e-01,  1.90357104e-01, -9.34726413e-05, -2.13165611e-01,
        2.88943425e-02, -1.79530401e-02, -5.92756271e-02,  1.99918285e-01,
       -4.75170948e-02,  1.71634093e-01, -2.45917086e-02, -9.38061550e-02,
       -3.57002735e-01,  1.33263692e-01,  1.94045901e-01, -1.18530318e-01,
        4.56915230e-01,  1.47728190e-01,  3.35945129e-01, -1.86959356e-01,
        2.45954901e-01, -

What's the first value of the resulting vector?

- -0.42
- -0.22
- -0.02
- 0.21

In [23]:
answer_llm_1_v[0]

-0.42244655

### Answer Q1:
- -0.42

## Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the ```evaluations list```

In [66]:
for indx, row in tqdm(df.iterrows()):
    answer_llm_v = embedding_model.encode(row.answer_llm)
    answer_orig_v = embedding_model.encode(row.answer_orig)

    cosine_sim = answer_llm_v.dot(answer_orig_v)

    df.at[indx, 'cosine_sim'] = cosine_sim

300it [02:19,  2.16it/s]


In [73]:
df.head(3)

Unnamed: 0,answer_llm,answer_orig,document,question,course,cosine_sim
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,17.515987
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,13.418402
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,25.313255


What's the 75% percentile of the score?

- 21.67
- 31.67
- 41.67
- 51.67

In [72]:
df['cosine_sim'].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547924
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
Name: cosine_sim, dtype: float64

### Answer Q2: 
- 31.67

## Q3. Computing the cosine
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

- Compute the norm of a vector
- Divide each element by this norm

So, for vector v, it'll be v / ||v||

In numpy, this is how you do it:

In [81]:
df_norm = df.copy()

In [75]:
import numpy as np

In [None]:
norm = np.sqrt((v * v).sum()) # or np.linalg.norm(v)
v_norm = v / norm

Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity



In [84]:
def cosine_simmilarity(df, model, col_1, col_2):

    for indx, row in tqdm(df.iterrows()):
        answer_llm_v = model.encode(row[col_1]) 
        answer_llm_v_norm = answer_llm_v / np.linalg.norm(answer_llm_v)
    
        answer_orig_v = embedding_model.encode(row[col_2]) 
        answer_orig_v_norm = answer_orig_v / np.linalg.norm(answer_orig_v)
    
        cosine_sim = np.dot(answer_llm_v_norm, answer_orig_v_norm)
        
        df.at[indx, 'cosine_sim'] = cosine_sim

    return df

In [88]:
df_norm_v = cosine_simmilarity(df_norm, embedding_model, answer_llm, answer_orig)

300it [02:19,  2.16it/s]


In [89]:
df_norm_v

Unnamed: 0,answer_llm,answer_orig,document,question,course,cosine_sim
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,0.506754
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,0.388549
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,0.718599
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,0.337266
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,0.521792
...,...,...,...,...,...,...
295,An alternative way to load the data using the ...,Above users showed how to load the dataset dir...,8d209d6d,What is an alternative way to load the data us...,machine-learning-zoomcamp,0.914175
296,You can directly download the dataset from Git...,Above users showed how to load the dataset dir...,8d209d6d,How can I directly download the dataset from G...,machine-learning-zoomcamp,0.902190
297,You can fetch data for homework using the `req...,Above users showed how to load the dataset dir...,8d209d6d,Could you share a method to fetch data for hom...,machine-learning-zoomcamp,0.904734
298,If the status code is 200 when downloading dat...,Above users showed how to load the dataset dir...,8d209d6d,What should I do if the status code is 200 whe...,machine-learning-zoomcamp,0.726782


What's the 75% cosine in the scores?

- 0.63
- 0.73
- 0.83
- 0.93

In [90]:
df_norm_v.describe()

Unnamed: 0,cosine_sim
count,300.0
mean,0.728393
std,0.157755
min,0.125357
25%,0.651273
50%,0.763761
75%,0.836235
max,0.958796


### Answer Q3: 
- 0.83

## Q4. Rouge
Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [91]:
pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


```Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)```

In [96]:
answ_test = df_norm_v[df_norm_v['document'] == '5170565b']

In [122]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(answ_test['answer_llm'], answ_test['answer_orig'])[0]

In [123]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: ```rouge-1```, ```rouge-2``` and ```rouge-l```, and precision, recall and F1 score for each.

- ```rouge-1``` - the overlap of unigrams,
- ```rouge-2``` - bigrams,
- ```rouge-l``` - the longest common subsequence

What's the F score for rouge-1?

- 0.35
- 0.45
- 0.55
- 0.65

### Answer Q4:
- 0.45

## Q5. Average rouge score
Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65

In [124]:
scores['rouge-1']

{'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}

In [125]:
scores['rouge-1']['f']

0.45454544954545456

In [126]:
f_mean = []

for key in scores:
    f_mean.append(scores[key]['f'])

In [127]:
mean_f_mean = sum(f_mean) / len(f_mean)


mean_f_mean

0.35490034990035496

### Answer Q5:
- 0.35

## Q6. Average rouge score for all the data points
Now let's compute the score for all the records and create a dataframe from them.

What's the average rouge_2 across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

In [145]:
evaluations = []

for idx, r in tqdm(df.iterrows()):
    scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
        
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
    
    evaluations.append({'rouge_1': rouge_1,
                       'rouge_2': rouge_2,
                       'rouge_l': rouge_l,
                       'mean_rouge': rouge_avg})

300it [00:00, 346.19it/s]


In [146]:
print(pd.DataFrame(evaluations)['rouge_2'].mean())


0.20696501983423318


### Answer Q6:
- 0.20