<div style="overflow:auto;">
    <div style="float:left; margin-right:10px;">
        <img width='120' height='120' src='https://cityme.novaims.unl.pt/images/footer/novaims.png'>
    </div>

# <p align="center">Fidelidade chatbot</p>

---

## <p align="center">*1 - Evaluation*</p>

---

In this notebook, we aim to evaluate the performance of a chatbot by comparing its responses to a set of reference question-answer (Q&A) pairs extracted from a PDF. The goal is to systematically assess how accurately and semantically appropriately the chatbot answers real-world questions, helping to identify areas for improvement and optimization.

Our evaluation approach involves three key steps:

1. **Q&A Extraction**:  
   We parse a PDF document to extract structured Q&A pairs.

2. **Chatbot Response Generation**:  
   For each extracted question, we query the chatbot and record its generated answer along with the response time.

3. **Evaluation Metrics**:
   - **String Similarity**: We use fuzzy string matching (Levenshtein distance) to quantify the lexical similarity between the chatbot's answer and the expected answer.
   - **Semantic Similarity**: We compute the cosine similarity between sentence embeddings (using a transformer model) to assess contextual alignment, enabling evaluation beyond surface-level word overlap.

The output includes per-question evaluation results, summary statistics such as overall accuracy and average response time, and automatically saved reports for both string-based and semantic evaluations.

This evaluation framework provides both quantitative insights and qualitative feedback on chatbot behavior, helping guide further training and refinement efforts.


### 👥 **Team Members**
- **Ana Farinha** *(Student Number: 20211514)*  
- **António Oliveira** *(Student Number: 20211595)*  
- **Mariana Neto** *(Student Number: 20211527)*  
- **Salvador Domingues** *(Student Number: 20240597)*  

📅 **Date:** *May 26, 2025*  
📍 **Prepared for:** *FidelidadeC*

---

# Table of Contents

1. [Import Libraries & Data](#1-import-libraries--data)  
2. [Evaluation](#2-evaluation)  
   2.1. [String Similarity](#21-string-similarity)  
   2.2. [Semantic Similarity](#22-semantic-similarity)  

# 1. Import Libraries & Data

In [None]:
# Imports
from api import *
from sentence_transformers import SentenceTransformer
from eval_utils import *

import warnings
warnings.filterwarnings("ignore", message="CropBox missing from /Page") 

# 2. Evaluation

In this section, we extract question-and-answer (Q&A) pairs from a structured PDF document. These pairs will serve as the reference dataset for evaluating the performance of our chatbot.

The extraction process is designed to:
- Detect and separate questions (typically marked with a bullet point or other prefix),
- Identify corresponding answers (often marked with a specific prefix like "R:"),
- Clean and structure the text into a format suitable for further evaluation.

This automated extraction ensures consistency, reduces manual effort, and prepares our data for downstream comparison using both string similarity and semantic evaluation techniques.

In [5]:
qa_pairs = extract_qa_pairs_from_pdf('./docs/Documents for training and evaluation-20250507\Questions_Answers_Censored.pdf')

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


To understand how well our chatbot answers questions extracted from the PDF, we will evaluate its responses using several key metrics:

| **Metric**            | **What It Measures**                                                                 |
|-----------------------|--------------------------------------------------------------------------------------|
| **Accuracy**          | Percentage of chatbot answers that closely match the expected answer.               |
| **Similarity Score**  | Fuzzy match score (0–100) comparing chatbot’s answer to the expected answer.        |
| **Match (Yes/No)**    | Whether the answer is considered correct (based on a similarity threshold, e.g. 70).|
| **Response Time**     | Time (in seconds) taken by the chatbot to respond to a question.                    |
| **Average Response**  | Average of all response times — reflects chatbot’s overall speed/efficiency.       |

These metrics help quantify both the correctness and responsiveness of the chatbot, guiding improvements and benchmarking.

# 2.1 String Similarity

In this section, we assess how closely the chatbot's answers match the expected answers by measuring their textual similarity. We use fuzzy string matching techniques, which account for minor differences such as typos, paraphrasing, or formatting variations.

The main tool used is the **FuzzyWuzzy** library, which calculates a similarity score between 0 and 100 based on the Levenshtein distance. Higher scores indicate greater similarity.

By setting a similarity threshold, we classify chatbot answers as correct or incorrect. This method provides a straightforward, interpretable way to evaluate answer quality based purely on text overlap.

In [12]:
def evaluate_string_and_save(qa_pairs, threshold=0.7):
    total = len(qa_pairs)
    correct = 0
    total_time = 0.0

    results = []

    for idx, (question, expected) in enumerate(qa_pairs, 1):
        print(f"\n--- Q{idx} ---")
        print("Question:", question)

        start = time.time()
        answer = query_assistant(question)
        response_time = time.time() - start
        total_time += response_time

        similarity = fuzz.ratio(answer.lower(), expected.lower())
        match = "Yes" if similarity >= threshold else "No"
        if match == "Yes":
            correct += 1

        print()
        print("Answer:", answer)
        print()
        print("Expected:", expected)
        print()
        print(f"Similarity: {similarity}, Match: {match}, Time: {response_time:.2f}s")

        results.append({
            "Question": question,
            "Expected Answer": expected,
            "Chatbot Answer": answer,
            "Similarity": similarity,
            "Match": match,
            "Response Time (s)": round(response_time, 2)
        })

    accuracy = (correct / total * 100) if total else 0
    avg_response = (total_time / total) if total else 0

    # Save to CSV
    save_results_seq(results, folder="eval_string", base_filename="eval_results")


    print("\n===== Summary =====")
    print(f"Total: {total}")
    print(f"Correct: {correct}")
    print(f"Accuracy: {accuracy:.2f}%")
    print(f"Average Response Time: {avg_response:.2f}s")

In [13]:
# --- Run Evaluation ---
evaluate_string_and_save(
    qa_pairs,
    threshold = 0.7
)



--- Q1 ---
Question: O Fidelidade Savings é mais indicado para que perfil de risco?

Answer: O produto **Fidelidade My Savings** é mais indicado para investidores com um perfil de risco **baixo a moderado**. Este produto é ideal para aqueles que procuram uma forma de poupança segura, com a possibilidade de obter rendimentos, mas que não desejam expor-se a riscos elevados associados a investimentos mais voláteis.

Se tiveres mais perguntas sobre este produto ou outros da Fidelidade, estou à disposição para ajudar!

Expected: o Fidelidade Savings é adequado para clientes conservadores, através da sua opção Seguro, com capital e rendimento garantido. Para clientes que gostam moderadamente do risco, através da opção Proteção com garantia de 90% do capital investido no vencimento e exposição ao Índice Multiativos. Para clientes propensos ao risco através da opção Dinâmico com 100% de exposição ao Índice Multiativos. Também está disponível a opção ESG, que investe no índice Multiativos ESG,

#### Why String Similarity Evaluation Can Be Problematic

Evaluating chatbot or assistant answers using **string similarity** methods can be unreliable for several reasons:

- **Surface-Level Matching:** String similarity compares the literal characters or tokens in the text. It does not understand the *meaning* behind the answer. Two answers can be semantically equivalent but phrased very differently, leading to low similarity scores.

- **Sensitive to Minor Differences:** Even small changes like punctuation, synonyms, or word order can drastically reduce the similarity score, unfairly marking correct answers as wrong.

- **Ignores Context and Paraphrasing:** Good language models often rephrase answers or use synonyms, which string similarity methods might penalize despite the answer being correct.

Because of these limitations, string similarity can produce misleading evaluation results, especially in more complex or natural language scenarios.

---

#### Next Step: Evaluating with Semantic Similarity

To better capture the *meaning* behind responses, we will move to **semantic similarity** evaluation. This approach uses embeddings from models like Sentence Transformers to compare the *contextual meaning* of two sentences, rather than just their surface forms.

Semantic similarity can:

- Recognize paraphrases and synonyms
- Be more robust to minor wording differences
- Provide a more accurate measure of whether the chatbot's answer truly matches the expected answer

Next, we will implement and test semantic similarity evaluation using pre-trained sentence embedding models.


## 2.2 Semantic Similarity

In this section, we assess chatbot answers by measuring the similarity of their underlying meaning rather than exact wording. Using sentence embeddings and cosine similarity, this approach provides a deeper understanding of response quality—especially useful when answers are rephrased or use different vocabulary.

This evaluation complements string-based methods and offers a more flexible way to judge chatbot performance.

In [None]:
# Load the model ONCE here (in your notebook/session)
model = SentenceTransformer('all-mpnet-base-v2')

# Call your evaluation function, passing the loaded model and the save function
evaluate_semantic_and_save(
    qa_pairs = qa_pairs,
    model=model,
    threshold=0.7,
    save_results_seq=save_results_seq
)



--- Q1 ---
Question: O Fidelidade Savings é mais indicado para que perfil de risco?
Answer: O produto **Fidelidade My Savings** é mais indicado para investidores com um perfil de risco **conservador**. Este perfil é caracterizado por uma aversão a perdas e uma preferência por investimentos que ofereçam maior segurança e estabilidade, mesmo que isso signifique um potencial de retorno mais baixo.

Se tiveres mais perguntas sobre o produto ou sobre como ele se compara a outros no mercado, estou à disposição para ajudar!
Expected: o Fidelidade Savings é adequado para clientes conservadores, através da sua opção Seguro, com capital e rendimento garantido. Para clientes que gostam moderadamente do risco, através da opção Proteção com garantia de 90% do capital investido no vencimento e exposição ao Índice Multiativos. Para clientes propensos ao risco através da opção Dinâmico com 100% de exposição ao Índice Multiativos. Também está disponível a opção ESG, que investe no índice Multiativos E

# WARNING: with certainty about the results, it should be run at least 15 times. 
What about using another model for semantic?