<div style="overflow:auto;">
    <div style="float:left; margin-right:10px;">
        <img width='120' height='120' src='https://cityme.novaims.unl.pt/images/footer/novaims.png'>
    </div>

# <p align="center">Fidelidade chatbot</p>

---

## <p align="center">*1 - Evaluation*</p>

---

In this notebook, we aim to evaluate the performance of a chatbot by comparing its responses to a set of reference question-answer (Q&A) pairs extracted from a PDF. The goal is to systematically assess how accurately and semantically appropriately the chatbot answers real-world questions, helping to identify areas for improvement and optimization.

Our evaluation approach involves three key steps:

1. **Q&A Extraction**:  
   We parse a PDF document to extract structured Q&A pairs.

2. **Chatbot Response Generation**:  
   For each extracted question, we query the chatbot and record its generated answer along with the response time.

3. **Evaluation Metrics**:
   - **String Similarity**: We use fuzzy string matching (Levenshtein distance) to quantify the lexical similarity between the chatbot's answer and the expected answer.
   - **Semantic Similarity**: We compute the cosine similarity between sentence embeddings (using a transformer model) to assess contextual alignment, enabling evaluation beyond surface-level word overlap.

The output includes per-question evaluation results, summary statistics such as overall accuracy and average response time, and automatically saved reports for both string-based and semantic evaluations.

This evaluation framework provides both quantitative insights and qualitative feedback on chatbot behavior, helping guide further training and refinement efforts.


### 👥 **Team Members**
- **Ana Farinha** *(Student Number: 20211514)*  
- **António Oliveira** *(Student Number: 20211595)*  
- **Mariana Neto** *(Student Number: 20211527)*  
- **Salvador Domingues** *(Student Number: 20240597)*  

📅 **Date:** *May 26, 2025*  
📍 **Prepared for:** *FidelidadeC*

---

# Table of Contents

1. [Import Libraries & Data](#1-import-libraries--data)  
2. [Evaluation](#2-evaluation)  
   2.1. [String Similarity](#21-string-similarity)  
   2.2. [Semantic Similarity](#22-semantic-similarity)  

# 1. Import Libraries & Data

In [1]:
# Imports
from api import *
from sentence_transformers import SentenceTransformer
from eval_utils import *

import warnings
warnings.filterwarnings("ignore", message="CropBox missing from /Page") 
warnings.filterwarnings("ignore", message=".*ScriptRunContext.*")
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

8J6pTdfaGgA5r193UVLsBshUspqwNpal42Jse1aHaok1cWNTLpRkJQQJ99BDACYeBjFXJ
https://ai-bcds.openai.azure.com/


2025-05-26 14:34:13.822911: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-26 14:34:21.460 
  command:

    streamlit run /opt/anaconda3/envs/bcwds4/lib/python3.10/site-packages/ipykernel_launcher.py [ARGUMENTS]
2025-05-26 14:34:21.462 Session state does not function when running a script without `streamlit run`


# 2. Evaluation

In this section, we extract question-and-answer (Q&A) pairs from a structured PDF document. These pairs will serve as the reference dataset for evaluating the performance of our chatbot.

The extraction process is designed to:
- Detect and separate questions (typically marked with a bullet point or other prefix),
- Identify corresponding answers (often marked with a specific prefix like "R:"),
- Clean and structure the text into a format suitable for further evaluation.

This automated extraction ensures consistency, reduces manual effort, and prepares our data for downstream comparison using both string similarity and semantic evaluation techniques.

In [3]:
qa_pairs = extract_qa_pairs_from_pdf('./docs/Documents for training and evaluation-20250507/Questions_Answers_Censored.pdf')

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


To understand how well our chatbot answers questions extracted from the PDF, we will evaluate its responses using several key metrics:

| **Metric**            | **What It Measures**                                                                 |
|-----------------------|--------------------------------------------------------------------------------------|
| **Accuracy**          | Percentage of chatbot answers that closely match the expected answer.               |
| **Similarity Score**  | Fuzzy match score (0–100) comparing chatbot’s answer to the expected answer.        |
| **Match (Yes/No)**    | Whether the answer is considered correct (based on a similarity threshold, e.g. 70).|
| **Response Time**     | Time (in seconds) taken by the chatbot to respond to a question.                    |
| **Average Response**  | Average of all response times — reflects chatbot’s overall speed/efficiency.       |

These metrics help quantify both the correctness and responsiveness of the chatbot, guiding improvements and benchmarking.

# 2.1 String Similarity

In this section, we assess how closely the chatbot's answers match the expected answers by measuring their textual similarity. We use fuzzy string matching techniques, which account for minor differences such as typos, paraphrasing, or formatting variations.

The main tool used is the **FuzzyWuzzy** library, which calculates a similarity score between 0 and 100 based on the Levenshtein distance. Higher scores indicate greater similarity.

By setting a similarity threshold, we classify chatbot answers as correct or incorrect. This method provides a straightforward, interpretable way to evaluate answer quality based purely on text overlap.

In [3]:
run_multiple_evaluations(
    qa_pairs=qa_pairs,
    num_runs=15,
    threshold=70,
    evaluation_func=evaluate_string_and_save,
    summary_filename="string_summary_results",
    detailed_filename="string_detailed_results"
)


### Running Evaluation 1/15 ###


Evaluating: 100%|██████████| 17/17 [01:00<00:00,  3.54s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.54s

### Running Evaluation 2/15 ###


Evaluating: 100%|██████████| 17/17 [00:53<00:00,  3.17s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.16s

### Running Evaluation 3/15 ###


Evaluating: 100%|██████████| 17/17 [01:07<00:00,  3.96s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.96s

### Running Evaluation 4/15 ###


Evaluating: 100%|██████████| 17/17 [00:56<00:00,  3.35s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.35s

### Running Evaluation 5/15 ###


Evaluating: 100%|██████████| 17/17 [00:55<00:00,  3.26s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.25s

### Running Evaluation 6/15 ###


Evaluating: 100%|██████████| 17/17 [00:55<00:00,  3.27s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.27s

### Running Evaluation 7/15 ###


Evaluating: 100%|██████████| 17/17 [00:58<00:00,  3.44s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.44s

### Running Evaluation 8/15 ###


Evaluating: 100%|██████████| 17/17 [01:02<00:00,  3.70s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.70s

### Running Evaluation 9/15 ###


Evaluating: 100%|██████████| 17/17 [01:00<00:00,  3.57s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.57s

### Running Evaluation 10/15 ###


Evaluating: 100%|██████████| 17/17 [00:57<00:00,  3.39s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.39s

### Running Evaluation 11/15 ###


Evaluating: 100%|██████████| 17/17 [00:58<00:00,  3.45s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.45s

### Running Evaluation 12/15 ###


Evaluating: 100%|██████████| 17/17 [00:58<00:00,  3.45s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.45s

### Running Evaluation 13/15 ###


Evaluating: 100%|██████████| 17/17 [00:58<00:00,  3.45s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.45s

### Running Evaluation 14/15 ###


Evaluating: 100%|██████████| 17/17 [00:56<00:00,  3.34s/it]



===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.34s

### Running Evaluation 15/15 ###


Evaluating: 100%|██████████| 17/17 [00:59<00:00,  3.50s/it]


===== Summary =====
Total: 17, Correct: 0, Accuracy: 0.00%, Avg Response Time: 3.50s

Saved detailed results to: evaluation\string_detailed_results.csv
Saved summary results to: evaluation\string_summary_results.csv





#### Why String Similarity Evaluation Can Be Problematic

Evaluating chatbot or assistant answers using **string similarity** methods can be unreliable for several reasons:

- **Surface-Level Matching:** String similarity compares the literal characters or tokens in the text. It does not understand the *meaning* behind the answer. Two answers can be semantically equivalent but phrased very differently, leading to low similarity scores.

- **Sensitive to Minor Differences:** Even small changes like punctuation, synonyms, or word order can drastically reduce the similarity score, unfairly marking correct answers as wrong.

- **Ignores Context and Paraphrasing:** Good language models often rephrase answers or use synonyms, which string similarity methods might penalize despite the answer being correct.

Because of these limitations, string similarity can produce misleading evaluation results, especially in more complex or natural language scenarios.

---

#### Next Step: Evaluating with Semantic Similarity

To better capture the *meaning* behind responses, we will move to **semantic similarity** evaluation. This approach uses embeddings from models like Sentence Transformers to compare the *contextual meaning* of two sentences, rather than just their surface forms.

Semantic similarity can:

- Recognize paraphrases and synonyms
- Be more robust to minor wording differences
- Provide a more accurate measure of whether the chatbot's answer truly matches the expected answer

Next, we will implement and test semantic similarity evaluation using pre-trained sentence embedding models.


## 2.2 Semantic Similarity

In this section, we assess chatbot answers by measuring the **similarity of their underlying meaning** rather than exact wording. Using **sentence embeddings** and **cosine similarity**, this approach provides a deeper understanding of response quality—especially useful when answers are **rephrased** or use **different vocabulary**.

We use models such as:
- [`paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2): lightweight, fast, and multilingual, good for general-purpose evaluations.
- [`LaBSE`](https://huggingface.co/sentence-transformers/LaBSE): a more powerful model designed for **language-agnostic sentence embeddings**, better suited for multilingual and semantically rich content.

> ⚠️ **Note:** Because semantic similarity is **less sensitive to wording variation** and tends to produce **stable results** (especially with low-generation-temperature settings), we typically perform **fewer evaluation runs**—often a **single run** is sufficient.

This type of evaluation offers a more **flexible and robust** way to judge chatbot performance, especially in real-world use cases where exact phrasing is less important than meaning.

In [4]:
# Load the model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

run_multiple_evaluations(
    qa_pairs=qa_pairs,
    num_runs=1,
    threshold=0.7,
    model=model,
    evaluation_func=evaluate_semantic_and_save,
    summary_filename="semantic_summary_results",
    detailed_filename="semantic_detailed_results"
)



### Running Evaluation 1/1 ###


Evaluating: 100%|██████████| 17/17 [01:06<00:00,  3.88s/it]


===== Summary =====
Total: 17, Correct: 3, Accuracy: 17.65%, Avg Response Time: 3.55s

Saved detailed results to: evaluation\semantic_detailed_results.csv
Saved summary results to: evaluation\semantic_summary_results.csv





In [5]:
# Load the model
model = SentenceTransformer('sentence-transformers/LaBSE')

run_multiple_evaluations(
    qa_pairs=qa_pairs,
    num_runs=1,
    threshold=0.7,
    model=model,
    evaluation_func=evaluate_semantic_and_save,
    summary_filename="semantic_summary_results",
    detailed_filename="semantic_detailed_results"
)



### Running Evaluation 1/1 ###


Evaluating: 100%|██████████| 17/17 [01:17<00:00,  4.56s/it]


===== Summary =====
Total: 17, Correct: 2, Accuracy: 11.76%, Avg Response Time: 3.80s

Saved detailed results to: evaluation\semantic_detailed_results.csv
Saved summary results to: evaluation\semantic_summary_results.csv





#### Comparison of Sentence Embedding Models for Portuguese Q&A Evaluation

When evaluating question-answer pairs in Portuguese from Portugal (pt-PT), choosing the right sentence embedding model is crucial for balancing accuracy, speed, and language coverage. 

Below is a comparison between two popular models suitable for this task:

| Feature / Model                                | paraphrase-multilingual-MiniLM-L12-v2         | LaBSE                                         |
|------------------------------------------------|-----------------------------------------------|-----------------------------------------------|
| **Publisher**                                  | Sentence-Transformers                         | Google Research                                |
| **Languages Supported**                        | ~50+ (including pt-PT)                        | 100+ (strong pt-PT support)                   |
| **Portuguese Variant**                         | Multilingual, supports both pt-PT and pt-BR   | Multilingual, very strong pt-PT support        |
| **Model Size**                                 | ~135M parameters                              | ~470M parameters                              |
| **Inference Speed**                            | ✅ Fast                                        | ❌ Slower                                     |
| **Accuracy (Semantic Similarity)**             | Moderate to High                              | High                                          |
| **Fine-tuned For**                             | Paraphrase mining, sentence embeddings        | Language-agnostic sentence embeddings         |
| **Training Dataset**                           | Multilingual paraphrase pairs                 | Translation and alignment-based corpora       |
| **Sentence Embedding Quality**                 | Good, efficient                               | Very high, better alignment across languages  |
| **Best Use Case**                              | Real-time evaluation, large-scale tests       | High-accuracy offline QA evaluation           |
| **Model Hub**                                  | [`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | [`sentence-transformers/LaBSE`](https://huggingface.co/sentence-transformers/LaBSE) |


In [None]:
summary_csv = "evaluation/semantic_summary_results.csv"
detailed_csv = "evaluation/semantic_detailed_results.csv"

Plots saved in folder: plots


## 2.2 Bot-based Similarity

In this section, we evaluate the quality of answers generated by an assistant by comparing them to the expected answers using another LLM-based assistant as a judge. The judging assistant is prompted to assess the similarity between the expected and actual responses and return a numeric score from 0 to 10, with 10 indicating a perfect match.

This bot-based evaluation method leverages the language model's ability to interpret semantic meaning, making it useful for assessing open-ended or context-rich answers where traditional string matching techniques may fall short. By automating this process, we can efficiently and consistently evaluate large sets of question-answer pairs.

As the assistant’s evaluations are generative in nature, slight variability in scores is expected between runs, even when the inputs remain the same. This is a natural consequence of the probabilistic nature of language models.

In [7]:
results = []

for idx, (question, expected_answer) in enumerate(qa_pairs):
    print(f"Evaluating Q{idx+1}...")
    
    # Get assistant's answer
    actual_answer = query_assistant(question)
    evaluation = eval_chat(question, expected_answer, actual_answer)

    results.append({
        "question": question,
        "expected_answer": expected_answer,
        "actual_answer": actual_answer,
        "evaluation": evaluation
    })
    


Evaluating Q1...
Evaluating Q2...
Evaluating Q3...
Evaluating Q4...
Evaluating Q5...
Evaluating Q6...
Evaluating Q7...
Evaluating Q8...
Evaluating Q9...
Evaluating Q10...
Evaluating Q11...
Evaluating Q12...
Evaluating Q13...
Evaluating Q14...
Evaluating Q15...
Evaluating Q16...
Evaluating Q17...


In [8]:
for i in range(len(results)):
    print(f'Q{i+1}', results[i]["evaluation"])


Q1 5
Q2 8
Q3 7
Q4 7
Q5 7
Q6 7
Q7 7
Q8 8
Q9 7
Q10 4
Q11 4
Q12 2
Q13 7
Q14 7
Q15 8
Q16 6
Q17 8


**Save Results**

In [None]:
bot_eval = results.to_csv('./evaluation/bot_eval.csv')