## Q1. Getting the embeddings model

Now, get the embeddings model multi-qa-mpnet-base-dot-v1 from the Sentence Transformer library

Create the embeddings for the first LLM answer:

In [44]:
import pandas as pd
import numpy as np

In [6]:
base_url = 'https://github.com/tejasjbansal/LLM-Zoomcamp/blob/main'
relative_url = '4.%20Monitoring%20and%20Guardrails/data/results-gpt4o-mini.csv'
url = f'{base_url}/{relative_url}?raw=1'
df = pd.read_csv(url)

df = df.iloc[:300]

In [2]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("multi-qa-mpnet-base-dot-v1")

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [7]:
answer_llm = df.iloc[0].answer_llm

In [11]:
embedding_model.encode(answer_llm)[0]

-0.42244655

## Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?

In [28]:
from tqdm.auto import tqdm

evaluations = []

for row in tqdm(df[['answer_llm', 'answer_orig']].itertuples(index=False)):
    v_llm = embedding_model.encode(row.answer_llm)
    v_orig = embedding_model.encode(row.answer_orig)

    evaluations.append(v_llm.dot(v_orig))

300it [01:24,  3.56it/s]


In [31]:
df['evaluations'] = evaluations

In [32]:
df['evaluations'].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547923
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
Name: evaluations, dtype: float64

In [34]:
evaluations[0]

17.515987

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

- Compute the norm of a vector
- Divide each element by this norm

So, for vector v, it'll be v / ||v||

In numpy, this is how you do it:

- `norm = np.sqrt((v * v).sum())`
- `v_norm = v / norm`

Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

In [45]:
results_gpt4o = df.to_dict(orient='records')

In [46]:
def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)

    norm = np.sqrt((v_llm * v_llm).sum())
    v_norm_llm = v_llm / norm

    norm = np.sqrt((v_orig * v_orig).sum())
    v_norm_orig = v_orig / norm
    
    return v_norm_llm.dot(v_norm_orig)

In [47]:
evaluations = []

for record in tqdm(results_gpt4o):
    sim = compute_similarity(record)
    evaluations.append(sim)


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [01:22<00:00,  3.62it/s]


In [48]:
df['evaluations'] = evaluations
df['evaluations'].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: evaluations, dtype: float64

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [49]:
pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Note: you may need to restart the kernel to use updated packages.


(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [86]:
r = df[df['document']=='5170565b'].iloc[0]

In [93]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [94]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

## Q5. Average rouge score

Let's compute the average between rouge-1, rouge-2 and rouge-l for the same record from Q4

To compute the average of the ROUGE-1, ROUGE-2, and ROUGE-L scores for recall, precision, and F-measure, we will take the arithmetic mean of the corresponding values from each metric.

Here are the given values:
- **ROUGE-1:** \( r = 0.4545 \), \( p = 0.4545 \), \( f = 0.4545 \)
- **ROUGE-2:** \( r = 0.2162 \), \( p = 0.2162 \), \( f = 0.2162 \)
- **ROUGE-L:** \( r = 0.3939 \), \( p = 0.3939 \), \( f = 0.3939 \)

In [95]:
# Define the recall, precision, and F-measure values for each ROUGE metric
rouge_1 = {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}
rouge_2 = {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}
rouge_l = {'r': 0.3939393939393939, 'p': 0.3939393939393939, 'f': 0.393939388939394}

# Compute the averages
average_recall = (rouge_1['r'] + rouge_2['r'] + rouge_l['r']) / 3
average_precision = (rouge_1['p'] + rouge_2['p'] + rouge_l['p']) / 3
average_fmeasure = (rouge_1['f'] + rouge_2['f'] + rouge_l['f']) / 3

average_recall, average_precision, average_fmeasure


(0.35490035490035493, 0.35490035490035493, 0.35490034990035496)

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3

And create a dataframe from them

What's the agerage rouge_l across all the records?

In [97]:
def rouge_score(r):
    scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
    
    return scores

In [104]:
data = []

for record in tqdm(results_gpt4o):
    sim = rouge_score(record)
    rouge_1 = sim['rouge-1']['f']
    rouge_2 = sim['rouge-2']['f']
    rouge_l = sim['rouge-l']['f']
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3

    data.append({
            'rouge-1': rouge_1,
            'rouge-2': rouge_2,
            'rouge-l': rouge_l,
            'rouge-avg': rouge_avg
        })
    

# Create a DataFrame from the list
df = pd.DataFrame(data)

# Display the DataFrame
df

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 362.97it/s]


Unnamed: 0,rouge-1,rouge-2,rouge-l,rouge-avg
0,0.095238,0.028169,0.095238,0.072882
1,0.125000,0.055556,0.093750,0.091435
2,0.415584,0.177778,0.389610,0.327658
3,0.216216,0.047059,0.189189,0.150821
4,0.142076,0.033898,0.120219,0.098731
...,...,...,...,...
295,0.654545,0.540984,0.618182,0.604570
296,0.590164,0.460432,0.557377,0.535991
297,0.654867,0.564516,0.637168,0.618851
298,0.304762,0.132231,0.304762,0.247252


In [109]:
df['rouge-l'].mean()

0.3538074656078652