## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```

In [26]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
from rouge import Rouge

In [2]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [3]:
df = df.iloc[:300]

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42 <-- this
* -0.22
* -0.02
* 0.21


In [5]:
model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
answer_llm = df.iloc[0].answer_llm
embedded_answer = embedding_model.encode(answer_llm)

In [14]:
embedded_answer

array([-4.22446549e-01, -2.24856257e-01, -3.24058414e-01, -2.84758478e-01,
        7.25642918e-03,  1.01186566e-01,  1.03716910e-01, -1.89983174e-01,
       -2.80599259e-02,  2.71588802e-01, -1.15337655e-01,  1.14666030e-01,
       -8.49586725e-02,  3.32365334e-01,  5.52720726e-02, -2.22195774e-01,
       -1.42540857e-01,  1.02519155e-01, -1.52333647e-01, -2.02912465e-01,
        1.98422875e-02,  8.38149190e-02, -5.68632066e-01,  2.32844148e-02,
       -1.67292684e-01, -2.39256918e-01, -8.05464387e-02,  2.57084146e-02,
       -8.15464780e-02, -7.39290118e-02, -2.61550009e-01,  1.92575473e-02,
        3.22909206e-01,  1.90357104e-01, -9.34726413e-05, -2.13165611e-01,
        2.88943425e-02, -1.79530401e-02, -5.92756271e-02,  1.99918285e-01,
       -4.75170948e-02,  1.71634093e-01, -2.45917086e-02, -9.38061550e-02,
       -3.57002735e-01,  1.33263692e-01,  1.94045901e-01, -1.18530318e-01,
        4.56915230e-01,  1.47728190e-01,  3.35945129e-01, -1.86959356e-01,
        2.45954901e-01, -

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67 <--- this
* 41.67
* 51.67

In [15]:
df['answer_llm_embedding'] = df['answer_llm'].apply(lambda x: embedding_model.encode(x))
df['answer_orig_embedding'] = df['answer_orig'].apply(lambda x: embedding_model.encode(x))

In [17]:
df['dot_product'] = df[['answer_llm_embedding', 'answer_orig_embedding']].apply(lambda x: np.dot(x[0], x[1]), axis=1)

  df['dot_product'] = df[['answer_llm_embedding', 'answer_orig_embedding']].apply(lambda x: np.dot(x[0], x[1]), axis=1)


In [19]:
evaluations = df.dot_product.values
print(f'75th percentile {np.percentile(evaluations, 75)}')

75th percentile 31.67430591583252



## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83 <-- this

In [21]:
def compute_normed_embedding(text):
    _vector = embedding_model.encode(text)
    return _vector / np.linalg.norm(_vector, ord=2)

df['answer_llm_embedding_normed'] = df['answer_llm'].apply(compute_normed_embedding)
df['answer_orig_embedding_normed'] = df['answer_orig'].apply(compute_normed_embedding)

In [22]:
df['cosine_similarity'] = df[['answer_llm_embedding_normed', 'answer_orig_embedding_normed']].apply(lambda x: np.dot(x[0], x[1]), axis=1)

  df['cosine_similarity'] = df[['answer_llm_embedding_normed', 'answer_orig_embedding_normed']].apply(lambda x: np.dot(x[0], x[1]), axis=1)


In [23]:
evaluations_cosine = df.cosine_similarity.values
print(f'75th percentile {np.percentile(evaluations_cosine, 75)}')

75th percentile 0.8362348228693008


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45 <-- this
- 0.55
- 0.65

In [27]:
r = df.iloc[10]
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [30]:
print(f"F score of rouge-1 {scores['rouge-1']['f']}")

F score of rouge-1 0.45454544954545456


## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35 <-- this
- 0.45
- 0.55
- 0.65

In [32]:
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
print(f'Average {rouge_avg}')

Average 0.35490034990035496



## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the agerage `rouge_l` across all the records?

- 0.10
- 0.20
- 0.30 <--- this
- 0.40



In [35]:
def average_rouge(r):
    scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
    return rouge_avg

In [39]:
def compute_f_from_rouge_l(r):
    scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
    return scores['rouge-l']['f'] 


In [37]:
df['rouge'] = df.apply(average_rouge, axis=1)

In [38]:
df.rouge.mean()

0.313205367339838

In [40]:
df['rouge_l'] = df.apply(compute_f_from_rouge_l, axis=1)

In [41]:
df.roug_l.mean()

0.3538074656078652