## Evaluate deepset/roberta-base-squad2-distilled LLM for answer generation

https://huggingface.co/deepset/roberta-base-squad2-distilled

In [1]:
# provide project root path
ProjectRoot = "<PROVIDE PROJECT ROOT PATH>"
DatasetRoot = ProjectRoot + "/Dataset/"

In [2]:
try:
    import bert_score
except ImportError:
    !pip install bert_score

try:
    from evaluate import load
except ImportError:
    !pip install evaluate

In [3]:
import torch
from transformers import pipeline
import pandas as pd
import json
import bert_score
import numpy as np
import re
from evaluate import load

In [4]:
# load context and question train set which was created by doc2query
train_df = pd.read_csv(DatasetRoot + 'q_a_trainset.csv')

In [5]:
# loading full article from json file
with open(DatasetRoot + 'raw_knowledge.json', 'r') as f:
    raw_text_json = json.load(f)

In [6]:
raw_df = pd.DataFrame(list(raw_text_json.items()), columns=['raw_para_id', 'raw_text'])
raw_df['raw_para_id'] = raw_df['raw_para_id'].astype('int64')

In [7]:
# create dataframe of raw, summarized paragraphs and question
train_df = train_df.merge(raw_df, left_on='raw_para_id', right_on='raw_para_id', how='left')

In [8]:
train_df.head()

Unnamed: 0,raw_para_id,paragraph,question,Final_answer,raw_text
0,0,data science is an interdisciplinary academic ...,what is data science?,Data science is an interdisciplinary academic ...,Data science is an interdisciplinary academic ...
1,0,data science is an interdisciplinary academic ...,What is data science?,Data science is an interdisciplinary academic ...,Data science is an interdisciplinary academic ...
2,0,data science is an interdisciplinary academic ...,How is data science (based on machine learning...,"Data science, based on machine learning, is di...",Data science is an interdisciplinary academic ...
3,0,data science is an interdisciplinary academic ...,what does data science mean in linux?,"Data science, in general, refers to the proces...",Data science is an interdisciplinary academic ...
4,1,data science is multifaceted and can be descri...,What is Data Science?,Data Science,Data science also integrates domain knowledge ...


### Evaluation

In [9]:
model = pipeline("question-answering", model='deepset/roberta-base-squad2-distilled')

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


#### Calculate Different metric scores

In [10]:
# LLM inference wrapper
def AskLLM(context, question):
    answer = model(question=question, context=context)
    final_answer = answer['answer']
    return final_answer

In [11]:
candidate_answers = []
true_answers = []

for _, eval_data in train_df.iterrows():
    context = eval_data.raw_text
    question = eval_data.question

    # true_answers.append(eval_data.paragraph)
    true_answers.append(eval_data.Final_answer)
    candidate_answers.append(AskLLM(context, question))


In [12]:
# Calculate BERTScore
# bert_metrics = bert_score.score(cands=candidate_answers, refs=true_answers, model_type='roberta-large', nthreads=4)
bert_metrics = bert_score.score(cands=candidate_answers, refs=true_answers, model_type='bert-base-uncased', nthreads=4)

# Fetch precision, recall, F1 score from BERT score (https://lightning.ai/docs/torchmetrics/stable/text/bert_score.html)
print(f"Mean Precision: {np.mean(np.array(bert_metrics[0]))}")
print(f"Mean Recall: {np.mean(np.array(bert_metrics[1]))}")
print(f"Mean F1 Score: {np.mean(np.array(bert_metrics[2]))}")



Mean Precision: 0.6673619151115417
Mean Recall: 0.5323598980903625
Mean F1 Score: 0.5836908221244812


In [13]:
# Calculate BERTScore via https://huggingface.co/spaces/evaluate-metric/bertscore
bertscore = load("bertscore")
bert_metrics2 = bertscore.compute(predictions=candidate_answers, references=true_answers, lang="en")

print(f"Mean Precision: {np.mean(np.array(bert_metrics2['precision']))}")
print(f"Mean Recall: {np.mean(np.array(bert_metrics2['recall']))}")
print(f"Mean F1 Score: {np.mean(np.array(bert_metrics2['f1']))}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mean Precision: 0.8931051840384802
Mean Recall: 0.8619900736957788
Mean F1 Score: 0.8768731709569693


In [14]:
# calculate meteor via https://huggingface.co/spaces/evaluate-metric/meteor
meteor = load('meteor')
meteor_score = meteor.compute(predictions=candidate_answers, references=true_answers)

print(f"METEOR Score: {meteor_score['meteor']}")

[nltk_data] Downloading package wordnet to /home/sangram/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/sangram/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/sangram/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


METEOR Score: 0.2504144414635457


In [15]:
# calculate rouge via https://huggingface.co/spaces/evaluate-metric/rouge
rouge = load("rouge")
rouge_score = rouge.compute(predictions=candidate_answers, references=true_answers)

print(f"ROUGE Score: {rouge_score}")

ROUGE Score: {'rouge1': 0.35410041774256273, 'rouge2': 0.25583332526855695, 'rougeL': 0.34751153203083895, 'rougeLsum': 0.3460470777077809}
