# Homework: Evaluation and Monitoring

In [1]:
import pandas as pd

## Getting the data

In [2]:
github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [3]:
df = df.iloc[:300]

In [4]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

In [5]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [6]:
model_name = "multi-qa-mpnet-base-dot-v1"
embedding_model = SentenceTransformer(model_name)

KeyboardInterrupt: 

In [None]:
answer_llm = df.iloc[0].answer_llm

In [8]:
embedding_model.encode(answer_llm)[0]

-0.42244676

**What's the first value of the resulting vector?**

**-0.42**  
-0.22  
-0.02  
0.21  

In [9]:
Answer: -0.42


## Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?

In [25]:
from tqdm.auto import tqdm

In [35]:
def get_embeddings(row, column):
    return embedding_model.encode(row[column])

In [33]:
get_embeddings(df.loc[0,:], "answer_llm")

  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]


array([[-4.22446758e-01, -2.24855900e-01, -3.24058443e-01,
        -2.84758657e-01,  7.25698331e-03,  1.01186745e-01,
         1.03716850e-01, -1.89983502e-01, -2.80596316e-02,
         2.71588653e-01, -1.15337193e-01,  1.14665851e-01,
        -8.49587470e-02,  3.32365155e-01,  5.52725643e-02,
        -2.22195774e-01, -1.42540932e-01,  1.02519162e-01,
        -1.52333617e-01, -2.02912480e-01,  1.98425725e-02,
         8.38148519e-02, -5.68631887e-01,  2.32841987e-02,
        -1.67292684e-01, -2.39256635e-01, -8.05459842e-02,
         2.57079173e-02, -8.15462843e-02, -7.39287138e-02,
        -2.61549920e-01,  1.92571841e-02,  3.22909236e-01,
         1.90356985e-01, -9.34726340e-05, -2.13165760e-01,
         2.88944878e-02, -1.79527570e-02, -5.92764653e-02,
         1.99918449e-01, -4.75168340e-02,  1.71633810e-01,
        -2.45913174e-02, -9.38060954e-02, -3.57002944e-01,
         1.33263826e-01,  1.94046125e-01, -1.18530892e-01,
         4.56915349e-01,  1.47727951e-01,  3.35945249e-0

In [39]:
answer_llm_emb = []
answer_orig_emb = []
for _, row in tqdm(df.iterrows(), total=len(df)):
    answer_llm_emb.append(get_embeddings(row, "answer_llm"))
    answer_orig_emb.append(get_embeddings(row, "answer_orig"))
    

100%|█████████████████████████████████████████████████████████████████████████████████| 300/300 [01:23<00:00,  3.61it/s]


In [13]:
import numpy as np

In [45]:
X1 = np.array(answer_llm_emb)
X2 = np.array(answer_orig_emb)

In [55]:
evaluations = np.diag(X1.dot(X2.T))

In [58]:
np.percentile(evaluations, 75)

31.67431640625

What's the 75% percentile of the score?

21.67  
**31.67**  
41.67  
51.67  

Answer: 31.67

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

In [61]:
def norm_f(v):
    norm = np.sqrt((v * v).sum())
    return v / norm

In [70]:
X1n = np.array([norm_f(row) for row in X1])

In [71]:
X2n = np.array([norm_f(row) for row in X2])

In [72]:
evaluations = np.diag(X1n.dot(X2n.T))

In [73]:
np.percentile(evaluations, 75)

0.8362347781658173

What's the 75% cosine in the scores?

0.63  
0.73  
**0.83**  
0.93  

Answer: 0.83

## Q4. Rouge

In [74]:
!pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [7]:
r = df.iloc[10,]

In [8]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [9]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

What's the F score for rouge-1?

0.35  
**0.45**   
0.55  
0.65  

Answer: 0.45

## Q5. Average rouge score

Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record from Q4

**0.35**  
0.45  
0.55  
0.65  

In [17]:
print(f"Mean score: {np.mean([scores[k]['r'] for k in scores])}")

Mean score: 0.35490035490035493


## Q6. Average rouge score for all the data points

In [30]:
def get_rought_score_2(a1, a2):
    scores = rouge_scorer.get_scores(a1, a2)[0]
    return scores['rouge-2']['r']

In [31]:
get_rought_score_2(r['answer_llm'], r['answer_orig'])

0.21621621621621623

In [32]:
rought_scores = []
for _, row in tqdm(df.iterrows(), total=len(df)):
    rought_scores.append(get_rought_score_2(row["answer_llm"], row["answer_orig"]))
    

100%|████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 387.62it/s]


In [33]:
np.mean(rought_scores)

0.19861258009846788

Now let's compute the score for all the records and create a dataframe from them.

What's the average rouge_2 across all the records?

0.10  
**0.20**  
0.30  
0.40  