# Q-A Evaluating metrics

Question Answering Language Models (LLMs) are typically evaluated using a variety of metrics to assess their accuracy and effectiveness in answering questions. Here are some of the common metrics used for evaluating QA LLMs:



In [25]:
# Import package

from PyPDF2 import PdfReader
import pandas as pd

import re
import warnings
from sklearn.metrics import f1_score
import nltk
from rouge import Rouge
from jiwer import wer
from evaluate import load

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.prompts.prompt import PromptTemplate
from langchain.evaluation.qa import QAEvalChain
from dotenv import load_dotenv, find_dotenv
import os

from langchain.chains import LLMChain

warnings.filterwarnings("ignore")

In [11]:
load_dotenv(find_dotenv("credentials.env"), override=True)

True

**Exact Match (EM)**

The Exact Match (EM) metric is a binary metric used to evaluate the performance of a question-answering system. It measures whether the generated answer by the LLM exactly matches the correct answer. If the generated answer is identical to the correct answer, then the EM score is 1, indicating a perfect match between the generated answer and the correct answer. Otherwise, if the generated answer is different from the correct answer, the EM score is 0, indicating that the LLM did not produce the correct answer.

In [12]:
def normalize_answer(text):
    # Lowercase and remove non-alphanumeric characters
    text = re.sub(r"\W", " ", text.lower())
    # Remove leading and trailing whitespace
    text = text.strip()
    # Collapse multiple whitespace into a single whitespace
    text = re.sub(r"\s+", " ", text)
    return text

def exact_match_score(prediction, ground_truth):
    return int(normalize_answer(prediction) == normalize_answer(ground_truth))

# Example usage
prediction = "I am going to school"
ground_truth = "I am going to college"
em_score = exact_match_score(prediction, ground_truth)
print(f"Exact Match score: {em_score}")

Exact Match score: 0


The EM metric is a simple and straightforward way to evaluate the performance of a question-answering system, and it is often used in conjunction with other metrics, such as the F1 score, to provide a more comprehensive evaluation of the system's performance. However, the EM metric has its limitations, as it is very strict and unforgiving, and it does not take into account partial matches or variations in the correct answer.
For example, if the correct answer is "I am going to college", and the generated answer is "I am going to school", the EM score would be 0, even though the generated answer is very close to the correct answer. Therefore, the EM metric should be used in conjunction with other metrics to provide a more comprehensive evaluation of the performance of a question-answering system.

**F1-Score**

The F1 score is a commonly used metric for evaluating the performance of question-answering systems, particularly in cases where the answer is expected to be a span of text within a larger document. The F1 score is a harmonic mean of precision and recall, which are two other commonly used metrics in natural language processing.The F1 score is the harmonic mean of precision and recall, and it provides a single metric that balances both precision and recall. The F1 score ranges from 0 to 1, with a higher score indicating better performance. An F1 score of 1 means that the LLM generated the correct answer with 100% accuracy and completeness, while a score of 0 means that the LLM did not generate any correct answers.



In [13]:
def f1_score_metric(prediction, ground_truth):
    labels = list(set([prediction, ground_truth]))
    prediction_label = labels.index(prediction)
    ground_truth_label = labels.index(ground_truth)
    return f1_score([ground_truth_label], [prediction_label], average='weighted')

# Example usage
prediction = "I am going to school"
ground_truth = "I am going to college"
f1 = f1_score_metric(prediction, ground_truth)
print(f"F1 score: {f1}")

F1 score: 0.0


In question answering, the F1 score is often used in conjunction with other metrics, such as the Exact Match (EM) score, to provide a more comprehensive evaluation of the system's performance. The F1 score is particularly useful when the correct answer is expected to be a span of text within a larger document, as it takes into account the position and length of the generated answer in relation to the correct answer.

**BLEU (Bilingual Evaluation Understudy) Score**

BLEU (Bilingual Evaluation Understudy) Score is a metric used to evaluate the quality of machine-generated text against human-generated text. It is commonly used to evaluate the performance of machine translation systems, but it can also be used to evaluate other text generation tasks such as summarization, question answering, and text completion.

The BLEU score is calculated by first computing the precision of the machine-generated text for each n-gram size from 1 to N, where N is typically 4. Precision is the number of n-grams in the machine-generated text that also appear in the human-generated text, divided by the total number of n-grams in the machine-generated text. The precision values are then combined using a geometric mean to produce the final BLEU score.

One limitation of the BLEU score is that it does not take into account the fluency or coherence of the machine-generated text, only its similarity to the human-generated text in terms of n-gram overlap. For this reason, the BLEU score should be used in conjunction with other metrics such as perplexity and human evaluation to provide a more comprehensive evaluation of the quality of the machine-generated text.

In [14]:
def bleu_score_metric(prediction, ground_truth):
    reference = [ground_truth.split()]
    candidate = prediction.split()
    return nltk.translate.bleu_score.sentence_bleu(reference, candidate)

# Example usage
prediction = "I am going to school"
ground_truth = "I am going to college"
bleu = bleu_score_metric(prediction, ground_truth)
print(f"BLEU score: {bleu}")

BLEU score: 0.668740304976422


The score ranges from 0 to 1, with higher scores indicating better performance. However, it is important to note that the BLEU score has its limitations, and should be used in combination with other metrics and human evaluation to fully assess the quality and usefulness of machine-generated text. When interpreting BLEU scores, extremely low scores (e.g., on the order of 1e-154 or 1e-231) suggest that the machine-generated text is very dissimilar to the reference text, while scores closer to 1 suggest that the machine-generated text is more similar to the reference text.

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score**

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score is a metric used to evaluate the quality of machine-generated text against human-generated text. It is commonly used to evaluate the performance of text summarization systems, but it can also be used to evaluate other text generation tasks such as machine translation and question answering.

ROUGE score measures the overlap between the machine-generated summary and the human-generated summary using n-gram co-occurrence statistics. It is based on the recall of n-gram overlapping between machine-generated text and the reference (human-generated text). The score ranges from 0 to 1, with 1 indicating a perfect match between the two texts. The higher the ROUGE score, the better the machine-generated text is considered to be.



In [15]:
def rouge_score_metric(prediction, ground_truth):
    rouge = Rouge()
    scores = rouge.get_scores(prediction, ground_truth)
    return scores[0]["rouge-1"]["f"]

# Example usage
prediction = "I am going to school"
ground_truth = "I am going to college"
rouge = rouge_score_metric(prediction, ground_truth)
print(f"ROUGE score: {rouge}")

ROUGE score: 0.7999999950000002


In above case, the ROUGE score is 0.7999999950000002, which is a relatively high score and suggests that there is a good amount of overlap between the model-generated text and reference text. One limitation of the ROUGE score is that it only evaluates the overlap between the machine-generated text and the reference, without considering other important factors such as fluency, coherence, and content preservation. As a result, the ROUGE score should be used in conjunction with other metrics such as BLEU score, perplexity, and human evaluation to provide a more comprehensive evaluation of the quality of the machine-generated text.


**WER (Word Error Rate)**

WER (Word Error Rate) is a metric used for evaluating the performance of speech recognition systems or OCR (Optical Character Recognition) systems. It measures the percentage of words in the recognized text that differ from the words in the reference text. WER is calculated by dividing the total number of word errors (insertions, deletions, and substitutions) by the total number of words in the reference text. 

In [16]:
# Example usage
prediction = "I am going to school"
ground_truth = "I am going to college"
wer = wer(prediction, ground_truth)
print(f"WER score: {wer}")

WER score: 0.2


A WER (Word Error Rate) score of 0.2 means that 20% of the words in the recognized text are different from the words in the reference text. In other words, there is an error rate of 20%. 

## Evaluation for Generative QA model

### [BERTScore](https://github.com/Tiiiger/bert_score)

BERTScore is a metric that evaluates the quality of generated text by comparing it to reference text. BERTScore uses the pre-trained BERT (Bidirectional Encoder Representations from Transformers) model to encode both the generated text and reference text. It then calculates the similarity between the two encodings by computing the cosine similarity between them. BERTScore has been shown to outperform other embedding-based metrics like ROUGE and BLEU in terms of correlation with human judgments of text quality.

In [18]:

bertscore = load("bertscore")
predictions = ["hello world", "I am going to school"]
references = ["hello world", "I am going to college"]
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
print(results)


{'precision': [1.0, 0.9577858448028564], 'recall': [1.0, 0.9577858448028564], 'f1': [1.0, 0.9577858448028564], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.27.2)'}


In this case, the output shows that the precision, recall, and F1 score values for the first phrase "hello world" are all 1.0, indicating that the generated text is identical to the reference text. However, for the second phrase "I am going to school", the precision, recall, and F1 score values are all slightly lower, at around 0.96. This indicates that the generated text is similar to the reference text, but not identical.

# LLM Grading

We can evaluate generic question answering problems using [Langchain](https://langchain.readthedocs.io/en/latest/use_cases/evaluation/question_answering.html). Here is the situation where we have an example containing a question and its corresponding answer, and we want to measure how well the language model does at answering those questions.

In [30]:
prompt = PromptTemplate(template="Question: {question}\nAnswer:", input_variables=["question"])

In [31]:
llm = OpenAI(model_name="text-davinci-003", temperature=0)
chain = LLMChain(llm=llm, prompt=prompt)

In [32]:
examples = [
    {
        "question": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?",
        "answer": "11"
    },
    {
        "question": 'Is the following sentence plausible? "Joao Moutinho caught the screen pass in the NFC championship."',
        "answer": "No"
    }
]


In [33]:
predictions = chain.apply(examples)

In [34]:
predictions

[{'text': ' 11 tennis balls'},
 {'text': ' No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.'}]

We can see that if we tried to just do exact match on the answer answers (11 and No) they would not match what the lanuage model answered. However, semantically the language model is correct in both cases. In order to account for this, we can use a language model itself to evaluate the answers.

In [35]:
llm = OpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions, question_key="question", prediction_key="text")

In [36]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + eg['question'])
    print("Real Answer: " + eg['answer'])
    print("Predicted Answer: " + predictions[i]['text'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Real Answer: 11
Predicted Answer:  11 tennis balls
Predicted Grade:  CORRECT

Example 1:
Question: Is the following sentence plausible? "Joao Moutinho caught the screen pass in the NFC championship."
Real Answer: No
Predicted Answer:  No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.
Predicted Grade:  CORRECT



We can also customize the prompt that is used. Here is an example prompting it using  a score from 0 to 100. The custom prompt requires 3 input variables: “query”, “answer” and “result”. Where “query” is the question, “answer” is the ground truth answer, and “result” is the predicted answer.

In [41]:
_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:
{query}
Here is the real answer:
{answer}
You are grading the following predicted answer:
{result}
What grade do you give from 0 to 10, where 0 is the lowest (very low similarity) and 100 is the highest (very high similarity)?
"""

PROMPT = PromptTemplate(input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE)

In [42]:
llm = OpenAI(model_name="text-davinci-003", temperature=0)
evalchain = QAEvalChain.from_llm(llm=llm,prompt=PROMPT)
evalchain.evaluate(examples, predictions, question_key="question", answer_key="answer", prediction_key="text")

[{'text': '\n100. The predicted answer is exactly the same as the real answer, so it deserves a perfect score.'},
 {'text': '\n90. The predicted answer is very accurate and shows a good understanding of the context.'}]

# Conclusion

In conclusion, evaluating the performance of Language Model (LM) question answering systems is essential to measure their accuracy and effectiveness. Different evaluation metrics such as Exact Match, F1 score, BLEU score, ROUGE score, and WER are commonly used to measure the quality of LM responses. Additionally, evaluation metrics for generative models such as BERT_score and LLM grading can provide valuable insights into the quality and effectiveness of the language models