## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)

In [1]:
# import required libraries
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from rouge import Rouge

  from tqdm.autonotebook import tqdm, trange


In [2]:
# Read data
github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [3]:
df

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp
...,...,...,...,...,...
1825,Some suggested titles for listing the Machine ...,I’ve seen LinkedIn users list DataTalksClub as...,c6a22665,What are some suggested titles for listing the...,machine-learning-zoomcamp
1826,It is best advised that you do not list the Ma...,I’ve seen LinkedIn users list DataTalksClub as...,c6a22665,Should I list the Machine Learning Zoomcamp ex...,machine-learning-zoomcamp
1827,You can incorporate your Machine Learning Zoom...,I’ve seen LinkedIn users list DataTalksClub as...,c6a22665,In which LinkedIn sections can I incorporate m...,machine-learning-zoomcamp
1828,The advice on including a project link in a CV...,I’ve seen LinkedIn users list DataTalksClub as...,c6a22665,Who gave advice on including a project link in...,machine-learning-zoomcamp


**We will use only the first 300 documents:**

In [4]:
df = df.iloc[:300]

In [5]:
df

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp
...,...,...,...,...,...
295,An alternative way to load the data using the ...,Above users showed how to load the dataset dir...,8d209d6d,What is an alternative way to load the data us...,machine-learning-zoomcamp
296,You can directly download the dataset from Git...,Above users showed how to load the dataset dir...,8d209d6d,How can I directly download the dataset from G...,machine-learning-zoomcamp
297,You can fetch data for homework using the `req...,Above users showed how to load the dataset dir...,8d209d6d,Could you share a method to fetch data for hom...,machine-learning-zoomcamp
298,If the status code is 200 when downloading dat...,Above users showed how to load the dataset dir...,8d209d6d,What should I do if the status code is 200 whe...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

  **Answer: -0.42**

In [6]:
model_name = 'multi-qa-mpnet-base-dot-v1'
model = SentenceTransformer(model_name)

In [7]:
answer_llm = df.iloc[0].answer_llm

In [8]:
answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [9]:
model.encode(answer_llm)[0]

np.float32(-0.42244655)

**Answer: -0.42**

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

  **Answer: 31.67**

In [10]:
# Initialize the evaluations list
evaluations = []

# Iterate through the DataFrame
for _, row in df.iterrows():
    # Create embeddings
    llm_embedding = model.encode(row['answer_llm'])
    orig_embedding = model.encode(row['answer_orig'])
    
    # Compute dot product
    dot_product = np.dot(llm_embedding, orig_embedding)
    evaluations.append(dot_product)

In [11]:
len(evaluations)

300

In [12]:
evaluations[:5]

[np.float32(17.515987),
 np.float32(13.418402),
 np.float32(25.313255),
 np.float32(12.147415),
 np.float32(18.747736)]

In [13]:
# Calculate the 75th percentile
percentile_75 = np.percentile(evaluations, 75)

In [14]:
percentile_75

np.float32(31.674309)

**Answer: 31.67**

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

  **Answer: 0.83**

In [15]:
# Function to normalize a vector
def normalize(v):
    norm = np.sqrt((v * v).sum())
    return v / norm

In [16]:
# Initialize the evaluations list
evaluations = []

# Iterate through the DataFrame
for _, row in tqdm(df.iterrows(), total=len(df)):
    # Create embeddings
    llm_embedding = model.encode(row['answer_llm'])
    orig_embedding = model.encode(row['answer_orig'])
    
    # Normalize the embeddings
    llm_embedding_norm = normalize(llm_embedding)
    orig_embedding_norm = normalize(orig_embedding)
    
    # Compute cosine similarity (dot product of normalized vectors)
    cosine_similarity = np.dot(llm_embedding_norm, orig_embedding_norm)
    evaluations.append(cosine_similarity)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [01:23<00:00,  3.60it/s]


In [17]:
# Calculate the 75th percentile
percentile_75 = np.percentile(evaluations, 75)
print(f"75th percentile of cosine similarity scores: {percentile_75:.2f}")

75th percentile of cosine similarity scores: 0.84


In [18]:
percentile_75

np.float32(0.8362349)

**Answer: 0.83**

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

  **Answer: 0.45**

In [19]:
# Initialize the Rouge scorer
rouge_scorer = Rouge()

# Get the row at index 10
r = df.iloc[10]

# Compute ROUGE scores
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

# Extract the F1 score for rouge-1
rouge_1_f1 = scores['rouge-1']['f']

print(f"ROUGE-1 F1 score: {rouge_1_f1:.2f}")

ROUGE-1 F1 score: 0.45


In [26]:
r

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object

In [20]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [21]:
rouge_1_f1

0.45454544954545456

**Answer: 0.45**

## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65

  **Answer: 0.35**

In [22]:
# Extract F1 scores for rouge-1, rouge-2, and rouge-l
rouge_1_f1 = scores['rouge-1']['f']
rouge_2_f1 = scores['rouge-2']['f']
rouge_l_f1 = scores['rouge-l']['f']

# Compute the average F-score
average_f_score = (rouge_1_f1 + rouge_2_f1 + rouge_l_f1) / 3

In [23]:
average_f_score

0.35490034990035496

**Answer: 0.35**

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records and create a dataframe from them.

What's the average `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

  **Answer: 0.20**

In [24]:
# Initialize lists to store the scores
rouge_1_scores = []
rouge_2_scores = []
rouge_l_scores = []

# Iterate through all records in the dataframe
for _, row in tqdm(df.iterrows(), total=len(df), desc="Computing ROUGE scores"):
    scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])[0]
    
    rouge_1_scores.append(scores['rouge-1']['f'])
    rouge_2_scores.append(scores['rouge-2']['f'])
    rouge_l_scores.append(scores['rouge-l']['f'])

# Create a dataframe with the scores
scores_df = pd.DataFrame({
    'rouge_1': rouge_1_scores,
    'rouge_2': rouge_2_scores,
    'rouge_l': rouge_l_scores
})

# Calculate the average rouge_2 score
average_rouge_2 = scores_df['rouge_2'].mean()

Computing ROUGE scores: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 349.40it/s]


In [25]:
average_rouge_2

np.float64(0.20696501983423318)

**Answer: 0.20**