# Homework 4: Evaluation and Monitoring

In [1]:
import pandas as pd

github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [2]:
df = df.iloc[:300]
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from the [Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

What's the first value of the resulting vector?
* **-0.42**

In [3]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
answer_llm = df.iloc[0].answer_llm
v_llm = embedding_model.encode(answer_llm)

  from tqdm.autonotebook import tqdm, trange


In [4]:
v_llm[0]

-0.42244655

## Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them. We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?
* **31.67**

In [5]:
def compute_cos_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_orig = embedding_model.encode(answer_orig)
    v_llm = embedding_model.encode(answer_llm)
    
    return v_llm.dot(v_orig)

In [6]:
from tqdm import tqdm

results = df.to_dict(orient='records')
results[0]

{'answer_llm': 'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).',
 'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there’s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork',
 'document': '0227b872',
 'question': 'Where can I sign up for the course?',
 'course': 'machine-learning-zoomcamp'}

In [7]:
evaluations = []

for record in tqdm(results):
    cos_sim = compute_cos_similarity(record)
    evaluations.append(cos_sim)

100%|█████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:22<00:00,  2.11it/s]


In [8]:
import numpy as np

np.quantile(evaluations, 0.75)

31.67430877685547

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized. So we need to normalize them.

To do it, we
* Compute the norm of a vector
* Divide each element by this norm

Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity.

What's the 75% cosine in the scores?
* **0.83**

In [9]:
def compute_cos_similarity_norm(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_orig = embedding_model.encode(answer_orig)
    norm_orig = np.sqrt((v_orig * v_orig).sum())
    v_orig_norm = v_orig / norm_orig
    
    v_llm = embedding_model.encode(answer_llm)
    norm_llm = np.sqrt((v_llm * v_llm).sum())
    v_llm_norm = v_llm / norm_llm

    return v_llm_norm.dot(v_orig_norm)

In [10]:
evaluations_norm = []

for record in tqdm(results):
    cos_sim_norm = compute_cos_similarity_norm(record)
    evaluations_norm.append(cos_sim_norm)

100%|█████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:22<00:00,  2.10it/s]


In [11]:
np.quantile(evaluations_norm, 0.75)

0.8362348973751068

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score. This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs. It can give a more nuanced view of text similarity than just cosine similarity alone.

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

There are three scores: `rouge-1`, `rouge-2`, and `rouge-l`; and precision, recall, and F1 score for each.
* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for rouge-1?
* **0.45**

In [12]:
from rouge import Rouge

rouge_scorer = Rouge()

r = df.loc[df['document'] == '5170565b']
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2`, and `rouge-l` for the same record from Q4
* **0.35**

In [13]:
sum_f = 0
for k, v in scores.items():
    sum_f += v['f']
    
avg = sum_f / 3
avg

0.35490034990035496

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records and create a dataframe from them.

What's the agerage rouge_2 across all the records?
* **0.20**

In [14]:
scores = rouge_scorer.get_scores(df['answer_llm'], df['answer_orig'])

In [15]:
df_scores = pd.DataFrame(scores)
df_avg = df_scores.join(pd.json_normalize(df_scores['rouge-2'])).drop(columns='rouge-1')
df_avg['f'].mean()

0.20696501983423318