# Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)

Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:

```python
df = df.iloc[:300]
```

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from [the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```python
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

In [1]:
from tqdm.auto import tqdm
from rouge import Rouge
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd

In [2]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url).iloc[:300]
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [3]:
model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)



In [4]:
answer_llm = df.iloc[0].answer_llm
embedding = embedding_model.encode(answer_llm)
print(f'First value in vector: {embedding[0]:.2f}')

First value in vector: -0.42


## Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

In [5]:
llm_embeddings = [embedding_model.encode(ans) for ans in tqdm(df['answer_llm'], desc='LLM embeddings', unit='embedding')]
orig_embeddings = [embedding_model.encode(ans) for ans in tqdm(df['answer_orig'], desc='Orig. embeddings', unit='embedding')]
results = sorted([x.dot(y) for x, y in tqdm(zip(llm_embeddings, orig_embeddings), desc='Calculating dot product', unit='pair')])
seventy_fifth_percentile = int(len(results) * 0.75)
results[seventy_fifth_percentile]

LLM embeddings:   0%|          | 0/300 [00:00<?, ?embedding/s]

Orig. embeddings:   0%|          | 0/300 [00:00<?, ?embedding/s]

Calculating dot product: 0pair [00:00, ?pair/s]

31.754608

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [6]:
llm_norm_embeddings = [v / np.sqrt((v * v).sum()) for v in tqdm(llm_embeddings, desc='Normalizing LLM embeddings')]
orig_norm_embeddings = [v / np.sqrt((v * v).sum()) for v in tqdm(orig_embeddings, desc='Normalizing Orig. embeddings')]
results = sorted([x.dot(y) for x, y in tqdm(zip(llm_norm_embeddings, orig_norm_embeddings), desc='Calculating dot product', unit='pair')])
results[seventy_fifth_percentile]

Normalizing LLM embeddings:   0%|          | 0/300 [00:00<?, ?it/s]

Normalizing Orig. embeddings:   0%|          | 0/300 [00:00<?, ?it/s]

Calculating dot product: 0pair [00:00, ?pair/s]

0.83714414

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```python
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for rouge-1?

* 0.35
* 0.45
* 0.55
* 0.65

In [7]:
doc = df.iloc[10]
doc['document']

'5170565b'

In [8]:
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(doc['answer_llm'], doc['answer_orig'])[0]

In [9]:
print(f'rouge-1 F score: {scores["rouge-1"]["f"]:.2f}')

rouge-1 F score: 0.45


## Q5. Average rouge score

Let's compute the average between rouge-1, rouge-2 and rouge-l for the same record from Q4

* 0.35
* 0.45
* 0.55
* 0.65

In [10]:
avg = np.mean([scores[k]['f'] for k in ['rouge-1', 'rouge-2', 'rouge-l']])
print(f'Average F score: {avg:.2f}')

Average F score: 0.35


## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the average `rouge_2` across all the records?

* 0.10
* 0.20
* 0.30
* 0.40

In [14]:
f_scores = []

for row in df.iterrows():
    scores = rouge_scorer.get_scores(row[1]['answer_llm'], row[1]['answer_orig'])[0]
    f_scores.append({k: scores[k]['f'] for k in ['rouge-1', 'rouge-2', 'rouge-l']})

df_scores = pd.DataFrame(f_scores)
df_scores['rouge_avg'] = df_scores.mean(axis=1)
df_scores

Unnamed: 0,rouge-1,rouge-2,rouge-l,rouge_avg
0,0.095238,0.028169,0.095238,0.072882
1,0.125000,0.055556,0.093750,0.091435
2,0.415584,0.177778,0.389610,0.327658
3,0.216216,0.047059,0.189189,0.150821
4,0.142076,0.033898,0.120219,0.098731
...,...,...,...,...
295,0.654545,0.540984,0.618182,0.604570
296,0.590164,0.460432,0.557377,0.535991
297,0.654867,0.564516,0.637168,0.618851
298,0.304762,0.132231,0.304762,0.247252


In [15]:
print(f"Average rouge-2: {np.mean(df_scores['rouge-2']):.2f}")

Average rouge-2: 0.21
