## Homework: Evaluation and Monitoring
In this homework, we'll evaluate the quality of our RAG system.



### Getting the data
Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv).

Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```
We will use only the first 300 documents:

```python
df = df.iloc[:300]
```

In [1]:
!pip install rouge



In [2]:
import os
import requests
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm
from rouge import Rouge

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"

In [4]:
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
df = df.iloc[:300]

### Q1. Getting the embeddings model
Now, get the embeddings model ```multi-qa-mpnet-base-dot-v1``` from the Sentence Transformer library

Note: this is not the same model as in HW3
```python
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```
Create the embeddings for the first LLM answer:
```python
answer_llm = df.iloc[0].answer_llm
```
What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

In [5]:
model_name = "multi-qa-mpnet-base-dot-v1"
embedding_model = SentenceTransformer(model_name)

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [6]:
answer_llm = df.iloc[0].answer_llm
v = embedding_model.encode(answer_llm)

### Q1 Answer

In [7]:
v[0]

-0.42244682

### Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

In [8]:
df['v_answer_llm'] = df['answer_llm'].map(lambda x: embedding_model.encode(x))

In [9]:
df['v_answer_orig'] = df['answer_orig'].map(lambda x: embedding_model.encode(x))

In [10]:
df['score'] = df.apply(lambda x: x['v_answer_llm'].dot(x['v_answer_orig']), axis=1)

### Q2 Answer

In [11]:
df['score'].describe()

count    300.000000
mean      27.495996
std        6.384743
min        4.547927
25%       24.307842
50%       28.336858
75%       31.674304
max       39.476013
Name: score, dtype: float64

### Q3. Computing the cosine
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

- Compute the norm of a vector
- Divide each element by this norm

So, for vector ```v```, it'll be ```v / ||v||```

In numpy, this is how you do it:
```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```
Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [12]:
def normalise_v(v):
    return v / np.sqrt((v * v).sum())

In [13]:
df['v_answer_llm_norm'] = df['v_answer_llm'].apply(normalise_v)  

In [14]:
df['v_answer_orig_norm'] = df['v_answer_orig'].apply(normalise_v)  

In [15]:
df['score_norm'] = df.apply(lambda x: x['v_answer_llm_norm'].dot(x['v_answer_orig_norm']), axis=1)

### Q3 Answer

In [16]:
df['score_norm'].describe()

count    300.000000
mean       0.728392
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: score_norm, dtype: float64

### Q4. Rouge
Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:
```bash
pip install rouge
```
(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)
```python
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```
There are three scores: ```rouge-1```, ```rouge-2``` and ```rouge-l```, and precision, recall and F1 score for each.

* ```rouge-1``` - the overlap of unigrams,
* ```rouge-2``` - bigrams,
* ```rouge-l``` - the longest common subsequence
What's the F score for rouge-1?

* 0.35
* 0.45
* 0.55
* 0.65

In [17]:
print(df.iloc[10].document)
r = df.iloc[10]

5170565b


In [18]:
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

### Q4 Answer

In [19]:
scores['rouge-1']['f']

0.45454544954545456

### Q5. Average rouge score
Let's compute the average between rouge-1, rouge-2 and rouge-l for the same record from Q4

* 0.35
* 0.45
* 0.55
* 0.65

In [20]:
average_scores = [v2 for _,v in scores.items() for _, v2 in v.items()]

### Q5 Answer

In [21]:
sum(average_scores)/len(average_scores)

0.35490035323368824

### Q6. Average rouge score for all the data points
Now let's compute the score for all the records
```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```
And create a dataframe from them

What's the average rouge_2 across all the records?

* 0.10
* 0.20
* 0.30
* 0.40

In [22]:
def calc_rouge_scores(answer_llm, answer_orig):
    scores = rouge_scorer.get_scores(answer_llm, answer_orig)[0]
    # rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    # rouge_l = scores['rouge-l']['f']
    # rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
    # return rouge_1, rouge_2, rouge_l, rouge_avg
    return rouge_2

In [23]:
# df['rouge_1'], df['rouge_2'], df['rouge_l'], df['rouge_avg'] = df.apply(lambda x: calc_rouge_scores(x['answer_llm'], x['answer_orig']), axis=1)
df['rouge_2'] = df.apply(lambda x: calc_rouge_scores(x['answer_llm'], x['answer_orig']), axis=1)

### Q6 Answer

In [24]:
df['rouge_2'].mean()

0.20696501983423318