## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]

In [1]:
# Get the data.
import pandas as pd
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [2]:
# Use only the first 300 documents.
df = df.iloc[:300]

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

In [3]:
# Define the embedding model.
from sentence_transformers import SentenceTransformer
model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

  from tqdm.autonotebook import tqdm, trange


In [4]:
# Select the first LLM answer.
answer_llm_0 = df.iloc[0].answer_llm

In [5]:
# Create the embeddings for the first LLM answer. 
v_llm_0 = embedding_model.encode(answer_llm_0)

In [6]:
v_llm_0[0]

np.float32(-0.42244655)

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

In [7]:
# Select the answer_llm column.
answer_llm_col = df.answer_llm

In [8]:
# Select the answer_orig column.
answer_orig_col = df.answer_orig

In [9]:
# Create the embeddings for answer_llm.
v_answer_llm = embedding_model.encode(answer_llm_col)

In [10]:
# Create the embeddings for answer_orig.
v_answer_orig = embedding_model.encode(answer_orig_col)

In [11]:
# Compute dot product.
evaluations = [llm.dot(orig) for llm, orig in zip(v_answer_llm, v_answer_orig)]

In [13]:
# Get the 75th percentile of evaluations.
import numpy as np
np.percentile(evaluations, 75)

np.float32(31.674309)

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [14]:
# Function to normalise a vector.
def norm_vec(the_vec):
    norm_the_vec = [vec / np.sqrt((vec * vec).sum()) for vec in the_vec]
    return norm_the_vec

In [15]:
# Normalise answer_llm vector.
v_norm_answer_llm = norm_vec(v_answer_llm)

In [16]:
# Normalise answer_orig vector.
v_norm_answer_orig = norm_vec(v_answer_orig)

In [17]:
# Compute dot product of the normalised vectors.
cosine_similarity = [n_llm.dot(n_orig) for n_llm, n_orig in zip(v_norm_answer_llm, v_norm_answer_orig)]

In [18]:
# Get the 75th percentile of cosine_similarity.
np.percentile(cosine_similarity, 75)

np.float32(0.8362349)

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

In [19]:
# Define the rouge_scorer.
from rouge import Rouge
rouge_scorer = Rouge()

In [20]:
# Select the document at index 10.
doc_10 = df.iloc[10]
doc_10

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object

In [21]:
# Select answer_llm at index 10 document.
answer_llm_10 = doc_10.answer_llm

In [22]:
# Select answer_orig at index 10 document.
answer_orig_10 = doc_10.answer_orig

In [23]:
# Calculate the rogue scores of document at index 10.
scores_10 = rouge_scorer.get_scores(answer_llm_10, answer_orig_10)[0]
scores_10

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [24]:
# Return the rogue-1 f score of document at index 10.
scores_10_1_f = scores_10['rouge-1']['f']
scores_10_1_f

0.45454544954545456

## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65

In [25]:
# Return the rogue-2 f score of document at index 10.
scores_10_2_f = scores_10['rouge-2']['f']
scores_10_2_f

0.21621621121621637

In [26]:
# Return the rogue-l f score of document at index 10.
scores_10_l_f = scores_10['rouge-l']['f']
scores_10_l_f

0.393939388939394

In [27]:
# Calculate the average of the returned scores.
scores_10_avg = (scores_10_1_f + scores_10_2_f + scores_10_l_f) / 3
scores_10_avg

0.35490034990035496

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the avgerage `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

In [30]:
# Compute the score for all the records.
rouge_score = [rouge_scorer.get_scores(r_llm, r_orig)[0] for r_llm, r_orig in zip(answer_llm_col, answer_orig_col)]

In [31]:
# Create a dataframe from rouge_score.
df_rouge_score = pd.DataFrame(rouge_score)

In [32]:
# Select the rouge-2 column in df_rouge_score. 
df_rouge_score['rouge-2']

0      {'r': 0.017543859649122806, 'p': 0.07142857142...
1      {'r': 0.03508771929824561, 'p': 0.133333333333...
2      {'r': 0.14035087719298245, 'p': 0.242424242424...
3      {'r': 0.03508771929824561, 'p': 0.071428571428...
4      {'r': 0.07017543859649122, 'p': 0.022346368715...
                             ...                        
295    {'r': 0.559322033898305, 'p': 0.52380952380952...
296    {'r': 0.5423728813559322, 'p': 0.4, 'f': 0.460...
297    {'r': 0.5932203389830508, 'p': 0.5384615384615...
298    {'r': 0.13559322033898305, 'p': 0.129032258064...
299    {'r': 0.01694915254237288, 'p': 0.038461538461...
Name: rouge-2, Length: 300, dtype: object

In [33]:
# Select the first item in rouge-2 column. 
df_rouge_score['rouge-2'][0]

{'r': 0.017543859649122806,
 'p': 0.07142857142857142,
 'f': 0.028169010918468917}

In [34]:
# Select the first f score of rouge-2. 
df_rouge_score['rouge-2'][0]['f']

0.028169010918468917

In [35]:
# Calculate the average of rouge-2.
rouge_2 = np.mean([s_r2['f'] for s_r2 in df_rouge_score['rouge-2']])

In [36]:
rouge_2

np.float64(0.20696501983423318)