### Index
1. [Instructions](#instructions)
2. [Introduction on Evaluation](#introduction)
3. [Library Imports](#library_imports)
4. [FAQ Preparation](#faq_preparation)
5. [Answer Generation](#answer_generation)
6. [ChatBot Response Evaluation](#chatbot_evaluation)
    - [Latency](#latency) 
    - [Answer Similarity](#answer_similarity)
    - [Faithfullness](#faithfullness)
    - [Relevance](#relevance)
    - [Answer Correctness](#answer_correctness)
7. [Context Evaluation](#context_evaluation)
    - [Entity Recall](#entity_recall)
8. [Proposed Methods of Improvement](#methods_of_improvement)



## ⚠️ Instruction before running this notebook
<a id='instructions'></a>

- Paste your huggingface api token in the [.env](./.env) file
- Install the [requirements](./requirements.txt) using `pip install -r requirements.txt`
- Run the command `python -m spacy download en`. This will download the default English language model for the SpaCy library for extracting the entities.

# Introduction
<a id='introduction'></a>

For evaluating the responses of our ChatBot, I have established an evaluation mechanism centered around four key metrics:
- **Answer Similarity**
- **Faithfulness**
- **Relevance**
- **Answer Correctness**

**How this works:**
- I have created an LLM Evaluator for each metric, incorporating specific evaluation criteria and a scoring rubric ranging from **1 to 5** (where **1** denotes the lowest score and **5** denotes the highest score).
- This criteria and scoring rubric are provided to an Evaluator LLM along with the generated response and the retrieved context from which the response is generated. After processing the input, the Evaluator LLM outputs:

    1. A numeric score between **1 and 5**
    2. Feedback explaining why this particular score was given

Each metric and its scoring criteria are defined in detail later in the report.

## Library Imports
<a id='library_imports'></a>

In [1]:
import warnings, os
warnings.filterwarnings('ignore')

from tqdm.auto import tqdm
import warnings, os
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split

## Preparing Question to be asked to ChatBot
<a id='faq_preparation'></a>
Here I'm extracting few question from the data to be asked to the ChatBot

In [2]:
# Splitting the questions to be asked to chatbot
aws_faq = pd.read_csv('../aws_faqs.csv')
X_train, X_test = train_test_split(aws_faq, test_size=0.02, random_state=42)
aws_faq = X_test

## Generate Answers using Chatbot
<a id='answer_generation'></a>
Methodology:
- Send each questions from the data frame to ChatBot.
- ChatBot will generate the answers and return the generated answer.
- Create a data frame named ```chatbot_response_df``` containing the generated answers and its relevant docs from the vector store.

In [3]:
# Importing the chatpot
from aws_faq_chatbot import AwsFaqChatBot
chatbot = AwsFaqChatBot()

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\tapas\.cache\huggingface\token
Login successful



In [4]:
# Generating answers and store it in the dataframe
chatbot_output = []
for row in tqdm(aws_faq.iterrows()):
    start_time = time.time()
    answer, relevant_docs = chatbot.ask(row[1]['question'])
    end_time = time.time()

    chatbot_output.append({
        'question': row[1]['question'],
        'generated_answer': answer, 
        'context': relevant_docs,
        'latency(seconds)': round((end_time - start_time), 2)})

chatbot_response_df = pd.DataFrame(chatbot_output)

13it [00:37,  2.92s/it]


## ChatBot Response Evaluation
<a id='chatbot_evaluation'></a>

Following are the 5 metrics for evaluating the chatbor response:
- Latency
- Answer Similarity
- Fithfullness
- Relevance
- Answer Correctness

#### Latency <a id='latency'></a>

In [5]:
# Restrincting to only 5 output for the sake of readability
for row in chatbot_response_df.head().iterrows():
    print(f"Question: {row[1]['question']}")
    print(f"Answer: {row[1]['generated_answer']}")
    print(f"Context: {row[1]['context']}")
    print(f"Latency(seconds): {row[1]['latency(seconds)']}")
    print('===================================================\n')

Question: Q. Can I tag a Spot Fleet request?
Answer:  Yes, you can tag a Spot Fleet request when you create it. However, the Spot Fleet itself cannot be tagged.
Context: ['You can request to launch Spot Instances with tags via Spot Fleet. The Fleet by itself cannot be tagged.']
Latency(seconds): 3.18

Question: Q: Where can I learn more about EFS?
Answer: 
    You can learn more about Amazon EFS by visiting the Amazon EFS FAQ page.
Context: ['You can visit the Amazon EFS FAQ pageAmazon EFS FAQ page.']
Latency(seconds): 1.43

Question: Q: Do RIs provide a capacity reservation?
Answer:  A: Yes, RIs do provide a capacity reservation. As per the context, to take advantage of the capacity reservation, one should buy a Reserved Instance (RI) in a specific Availability Zone (AZ).
Context: ['If you want to take advantage of the capacity reservation, then you should buy an RI in a specific AZ.']
Latency(seconds): 3.89

Question: Q: What happens to my data when a system terminates?
Answer: 
    

In [6]:
# calculating the overall latency of the ChatBot
chatbot_response_df['latency(seconds)'] = chatbot_response_df['latency(seconds)'].astype(float)

print("Overall Latency of the ChatBot:"
f" {round(np.mean(chatbot_response_df['latency(seconds)']), 2)} Seconds")

Overall Latency of the ChatBot: 2.92 Seconds


In [7]:
# Saving generated response .csv file
chatbot_response_df.to_csv('./scores/chatbot_response.csv', index=False)

#### Answer Similarity 
<a id='answer_similarity'></a>
Answer Similarity answers the question: __Is the response semantically similar based on the context?__

Answer similarity is scored based on how closely the output's meaning matches the target, with higher scores for greater alignment.

__Scoring Criteria__
- _Score __1__:_ The response has little to no semantic similarity to the reference answer.
- _Score __2__:_ The response displays partial semantic similarity to the reference answer on some aspects.
- _Score __3__:_ The response has moderate semantic similarity to the reference answer.
- _Score __4__:_ The response aligns with the reference answer in most aspects and has substantial semantic similarity.
- _Score __5__:_ The response closely aligns with the reference answer in all significant aspects.

In [9]:
from metrics.answer_similarity import AnswerSimilarity

ans_similarity = AnswerSimilarity()
answer_similarity_scores = ans_similarity.evaluate(chatbot_response_df)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\tapas\.cache\huggingface\token
Login successful


13it [00:50,  3.86s/it]


#### Similarity Score result

In [10]:
# Restrincting to only 2 output for the sake of readability
for row in answer_similarity_scores.head(2).iterrows():
    print(f"Question: {row[1]['question']}")
    print(f"Answer: {row[1]['generated_answer']}")
    print(f"Context: {row[1]['context']}")
    print(f"Similarity Score: {row[1]['similarity_score']}")
    print(f"{row[1]['similarity_score_feedback']}")
    print('===================================================\n')

Question: Q. Can I tag a Spot Fleet request?
Answer:  Yes, you can tag a Spot Fleet request when you create it. However, the Spot Fleet itself cannot be tagged.
Context: ['You can request to launch Spot Instances with tags via Spot Fleet. The Fleet by itself cannot be tagged.']
Similarity Score: 5
Feedback: The response correctly identifies that Spot Fleet requests can be tagged during creation, and also clarifies that the Spot Fleet itself cannot be tagged. This is semantically similar to the reference answer, which states that Spot Instances can be launched with tags via Spot Fleet, and that the Fleet itself cannot be tagged.

Question: Q: Where can I learn more about EFS?
Answer: 
    You can learn more about Amazon EFS by visiting the Amazon EFS FAQ page.
Context: ['You can visit the Amazon EFS FAQ pageAmazon EFS FAQ page.']
Similarity Score: 5
Feedback: The provided response, 'You can learn more about Amazon EFS by visiting the Amazon EFS FAQ page.', has a high degree of semantic 

In [11]:
# calculating the overall Similarity score of the ChatBot
answer_similarity_scores['similarity_score'] = answer_similarity_scores['similarity_score'].astype(int)

print("Overall Similarity Score Rating of the ChatBot:"
f" {round(np.mean(answer_similarity_scores['similarity_score']), 2)}")

Overall Similarity Score Rating of the ChatBot: 4.08


In [12]:
# Saving to similarity scores .csv file
answer_similarity_scores.to_csv('./scores/answer_similarity_scores.csv', index=False)

### Faithfullness
<a id='faithfullness'></a>
Faithfullness answers the question: __Does the response accurately reflect the given context?__

Faithfulness checks if the response is factually consistent with the context.

__Scoring Criteria__
- _Score __1__:_ None of the claims in the response can be inferred from the provided context.
- _Score __2__:_ Some of the claims in the response can be inferred from the provided context, but the majority of the response is missing from, inconsistent with, or contradictory to the provided context.
- _Score __3__:_ Half or more of the claims in the response can be inferred from the provided context.
- _Score __4__:_ Most of the claims in the response can be inferred from the provided context, with very little information that is not directly supported by the provided context.
- _Score __5__:_ All of the claims in the response are directly supported by the provided context, demonstrating high faithfulness to the provided context.

In [13]:
from metrics.faithfullness import Faithfullness

faithfullness = Faithfullness()
faithfullness_scores = faithfullness.evaluate(chatbot_response_df)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\tapas\.cache\huggingface\token
Login successful


13it [01:01,  4.75s/it]


#### Faithfullness result

In [14]:
# Restrincting to only 2 output for the sake of readability
for row in faithfullness_scores.head(2).iterrows():
    print(f"Question: {row[1]['question']}")
    print(f"Answer: {row[1]['generated_answer']}")
    print(f"Context: {row[1]['context']}")
    print(f"Faithfullness Score: {row[1]['faithfullness_score']}")
    print(f"{row[1]['faithfullness_score_feedback']}")
    print('===================================================\n')

Question: Q. Can I tag a Spot Fleet request?
Answer:  Yes, you can tag a Spot Fleet request when you create it. However, the Spot Fleet itself cannot be tagged.
Context: ['You can request to launch Spot Instances with tags via Spot Fleet. The Fleet by itself cannot be tagged.']
Faithfullness Score: 5
Feedback: The response correctly identifies that Spot Fleet requests can be tagged when created, and correctly identifies that the Spot Fleet itself cannot be tagged. These claims are directly supported by the context.

Question: Q: Where can I learn more about EFS?
Answer: 
    You can learn more about Amazon EFS by visiting the Amazon EFS FAQ page.
Context: ['You can visit the Amazon EFS FAQ pageAmazon EFS FAQ page.']
Faithfullness Score: 5
Feedback: The provided response accurately directs the user to learn more about Amazon EFS by visiting the Amazon EFS FAQ page, which is directly supported by the provided context.



In [15]:
# calculating the overall Faithfullness Score of the ChatBot
faithfullness_scores['faithfullness_score'] = faithfullness_scores['faithfullness_score'].astype(int)

print("Overall Faithfullness Score Rating of the ChatBot:"
f" {round(np.mean(faithfullness_scores['faithfullness_score']), 2)}")

Overall Faithfullness Score Rating of the ChatBot: 4.31


In [16]:
# Saving to faithfullness scores .csv file
faithfullness_scores.to_csv('./scores/faithfullness_scores.csv', index=False)

### Relevance
<a id='relevance'></a>
Relevance answers the question: __Does the response address the given question and context?__

Relevance measures how well the response addresses the question and context. Scores reflect how directly the output answers the input question within the context.

__Scoring Criteria__
- _Score __1__:_ The response doesn't mention anything about the question or is completely irrelevant to the provided context.
- _Score __2__:_ The response provides some relevance to the question and is somehow related to the provided context.
- _Score __3__:_ The response mostly answers the question and is largely consistent with the provided context.
- _Score __4__:_ The response answers the question and is consistent with the provided context.
- _Score __5__:_ The response answers the question comprehensively using the provided context.

In [17]:
from metrics.relevance import Relevance

relevance = Relevance()
relevance_scores = relevance.evaluate(chatbot_response_df)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\tapas\.cache\huggingface\token
Login successful


13it [01:12,  5.60s/it]


#### Relevance result

In [19]:
# Restrincting to only 2 output for the sake of readability
for row in relevance_scores.head(2).iterrows():
    print(f"Question: {row[1]['question']}")
    print(f"Answer: {row[1]['generated_answer']}")
    print(f"Context: {row[1]['context']}")
    print(f"Relevance Score: {row[1]['relevance_score']}")
    print(f"{row[1]['relevance_score_feedback']}")
    print('===================================================\n')

Question: Q. Can I tag a Spot Fleet request?
Answer:  Yes, you can tag a Spot Fleet request when you create it. However, the Spot Fleet itself cannot be tagged.
Context: ['You can request to launch Spot Instances with tags via Spot Fleet. The Fleet by itself cannot be tagged.']
Relevance Score: 5
Feedback: The response directly addresses the question and is consistent with the provided context. The context states that Spot Fleet cannot be tagged, and the response confirms this by stating that you can tag a Spot Fleet request when you create it, but the Spot Fleet itself cannot be tagged.

Question: Q: Where can I learn more about EFS?
Answer: 
    You can learn more about Amazon EFS by visiting the Amazon EFS FAQ page.
Context: ['You can visit the Amazon EFS FAQ pageAmazon EFS FAQ page.']
Relevance Score: 4
Feedback: The response directly addresses the question by suggesting a place to learn more about EFS. The context supports this by explicitly mentioning the Amazon EFS FAQ page. How

In [20]:
# calculating the overall Relevance Scores of the ChatBot
relevance_scores['relevance_score'] = relevance_scores['relevance_score'].astype(int)

print("Overall Relevance Scores Rating of the ChatBot:"
f" {round(np.mean(relevance_scores['relevance_score']), 2)}")

Overall Relevance Scores Rating of the ChatBot: 4.54


In [21]:
# Saving to relevance scores .csv file
relevance_scores.to_csv('./scores/relevance_scores.csv', index=False)

### Answer Correctness
<a id='answer_correctness'></a>
Answer Correctness answers the question: __Is the response pertinent to the question and context?__

Answer correctness is evaluated based on how accurately the response matches the target. Scores reflect the degree of semantic similarity and factual correctness, with higher scores indicating greater accuracy.

__Scoring Criteria__
- _Score __1__:_ The response is completely incorrect. It is completely different from or contradicts the provided target.
- _Score __2__:_ The response demonstrates some degree of semantic similarity and includes partially correct information. However, the response still has significant discrepancies with the provided target or inaccuracies."
- _Score __3__:_ The response addresses a couple of aspects of the input accurately, aligning with the provided target. However, there are still omissions or minor inaccuracies.
- _Score __4__:_ The response is mostly correct. It provides mostly accurate information, but there may be one or more minor omissions or inaccuracies.
- _Score __5__:_ The response is correct. It demonstrates a high degree of accuracy and semantic similarity to the target.

In [22]:
from metrics.answer_correctness import AnswerCorrectness

answer_correctness = AnswerCorrectness()
answer_correctness_scores = answer_correctness.evaluate(chatbot_response_df)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\tapas\.cache\huggingface\token
Login successful


13it [02:00,  9.26s/it]


#### Answer Correctness result

In [23]:
# Restrincting to only 2 output for the sake of readability
for row in answer_correctness_scores.head(2).iterrows():
    print(f"Question: {row[1]['question']}")
    print(f"Answer: {row[1]['generated_answer']}")
    print(f"Context: {row[1]['context']}")
    print(f"Correctness Score: {row[1]['correctness_score']}")
    print(f"{row[1]['correctness_score_feedback']}")
    print('===================================================\n')

Question: Q. Can I tag a Spot Fleet request?
Answer:  Yes, you can tag a Spot Fleet request when you create it. However, the Spot Fleet itself cannot be tagged.
Context: ['You can request to launch Spot Instances with tags via Spot Fleet. The Fleet by itself cannot be tagged.']
Correctness Score: 4
Feedback: The response correctly explains that a Spot Fleet request can be tagged during creation, and clarifies that the Spot Fleet itself cannot be tagged, which is in line with the target. However, the response could have been more detailed in explaining the concept of tagging Spot Fleet requests.

Question: Q: Where can I learn more about EFS?
Answer: 
    You can learn more about Amazon EFS by visiting the Amazon EFS FAQ page.
Context: ['You can visit the Amazon EFS FAQ pageAmazon EFS FAQ page.']
Correctness Score: 4
Feedback: The response is relevant to the given question and it includes the correct information. It suggests visiting the Amazon EFS FAQ page which aligns with the provide

In [24]:
# calculating the overall Faithfullness Score of the ChatBot
answer_correctness_scores['correctness_score'] = answer_correctness_scores['correctness_score'].astype(int)

print("Overall Correctness Score Rating of the ChatBot:"
f" {round(np.mean(answer_correctness_scores['correctness_score']), 2)}")

Overall Correctness Score Rating of the ChatBot: 3.85


In [25]:
# Saving to answer correctness scores .csv file
answer_correctness_scores.to_csv('./scores/answer_correctness_scores.csv', index=False)

## Context Evaluation
<a id='context_evaluation'></a>

To ensure the accuracy and relevance of the retrieved contexts in the ChatBot responses, I have use the **Entity Recall** metric.

### Entity Recall
<a id='entity_recall'></a>

**Entity Recall** measures the system's ability to correctly recall all relevant entities within the context compared to a set of reference entities. This metric evaluates whether the system can identify and retrieve all key entities necessary for a comprehensive understanding of the query context.

#### Methodology

- **Extract Entities**: Use Named Entity Recognition (NER) tools such as **SpaCy** and **NLTK** to dynamically extract entities from both the retrieved context and the generated answer.
- **Compare Entities**: Compare the extracted entities from the retrieved context with those from the generated answer.
- **Calculate Entity Recall**: Calculate the recall as the number of correctly recalled entities divided by the total number of relevant entities in the generated answer.

In [26]:
import spacy
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents]

def entity_recall(true_entities, retrieved_entities):
    true_positive = len(set(true_entities) & set(retrieved_entities))
    recall = true_positive / len(true_entities) if true_entities else 0
    return recall

In [27]:
entity_recall_output = []
for row in tqdm(chatbot_response_df.iterrows()):    
    generated_answer_entities = extract_entities(row[1]['generated_answer'])
    context_entities = extract_entities(row[1]['context'][0])

    entity_recall_output.append(entity_recall(context_entities, generated_answer_entities))

entity_recall_scores = pd.concat([pd.DataFrame({'entity_recall_score': entity_recall_output}), chatbot_response_df], axis=1)

13it [00:01, 12.27it/s]


#### Entity Recall result

In [28]:
# Restrincting to only 2 output for the sake of readability
for row in entity_recall_scores.head(2).iterrows():
    print(f"Question: {row[1]['question']}")
    print(f"Answer: {row[1]['generated_answer']}")
    print(f"Context: {row[1]['context']}")
    print(f"Entity Recall Score: {round(row[1]['entity_recall_score'], 2)}")
    print('===================================================\n')

Question: Q. Can I tag a Spot Fleet request?
Answer:  Yes, you can tag a Spot Fleet request when you create it. However, the Spot Fleet itself cannot be tagged.
Context: ['You can request to launch Spot Instances with tags via Spot Fleet. The Fleet by itself cannot be tagged.']
Entity Recall Score: 0.5

Question: Q: Where can I learn more about EFS?
Answer: 
    You can learn more about Amazon EFS by visiting the Amazon EFS FAQ page.
Context: ['You can visit the Amazon EFS FAQ pageAmazon EFS FAQ page.']
Entity Recall Score: 0.67



In [30]:
# calculating the overall Faithfullness Score of the ChatBot
entity_recall_scores['entity_recall_score'] = entity_recall_scores['entity_recall_score'].astype(int)

print("Overall Entity Recall Score of the ChatBot:"
f" {round(np.mean(entity_recall_scores['entity_recall_score']), 2)}")

Overall Entity Recall Score of the ChatBot: 0.08


In [31]:
# Saving to answer correctness scores .csv file
entity_recall_scores.to_csv('./scores/entity_recall_scores.csv', index=False)

## Methods for Improving ChatBot response <a id='methods_of_improvement'></a>

Currently, the ChatBot performs well and provides relevant responses based on the FAQs. However, several improvements can be made to enhance its response quality. These improvements are as follows:

1. **Improved Chunking and Text Splitting**
    - _Current Approach_: The entire answer is converted to embeddings and stored in the vector store.
    - _Improvement_: Utilize more sophisticated chunking and text splitting techniques to ensure meaningful segments are created for better embedding and retrieval.
2. **Use Different Vector Stores for Context Retrieval**
    - _Current Approach_: The ChatBot currently uses Chroma DB.
    - _Improvement_: Experiment with different vector stores such as Pinecone, Weaviate, FAISS, etc., to optimize context retrieval and potentially improve performance.
3. **User Feedback on the Response**
    - _Current Approach_: No feedback mechanism is in place.
    - _Improvement_: Integrate a feedback module to record users' feedback on the generated responses, allowing for continuous learning and improvement based on real user interactions.
4. **Using Different LLMs for Better Response Generation and Evaluation Metrics**
    - _Current Approach_: The ChatBot uses Mixtral-8x7B-Instruct-v0.1.
    - _Improvement_: Experiment with different models such as Meta's LLAMA, Google's Gemma, and OpenAI's models to enhance response generation and evaluation metrics.
5. **Improved Irrelevant Query Handling**
    - _Current Approach_: There is no specific mechanism for handling out-of-context queries.
    - _Improvement_: Implement a separate LLM specifically for filtering out and processing out-of-context queries to ensure the ChatBot handles irrelevant queries more effectively.