## Evaluate Qwen2 LLM for answer generation

https://huggingface.co/Qwen/Qwen2-1.5B-Instruct

In [1]:
# provide project root path
ProjectRoot = "/home/sangram/Tutorbot_capstone/git_hub/Tutorbot/"
DatasetRoot = ProjectRoot + "Dataset/"

In [2]:
try:
    import bert_score
except ImportError:
    !pip install bert_score

try:
    from evaluate import load
except ImportError:
    !pip install evaluate


In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import pandas as pd
import json
import bert_score
import numpy as np
import re
from evaluate import load
from tqdm import tqdm

In [4]:
# load context and question train set which was created by doc2query
train_df = pd.read_csv(DatasetRoot + 'q_a_trainset.csv')

In [5]:
# loading full article from json file
with open(DatasetRoot + 'raw_knowledge.json', 'r') as f:
    raw_text_json = json.load(f)

In [6]:
raw_df = pd.DataFrame(list(raw_text_json.items()), columns=['raw_para_id', 'raw_text'])
raw_df['raw_para_id'] = raw_df['raw_para_id'].astype('int64')

In [7]:
# create dataframe of raw, summarized paragraphs and question
train_df = train_df.merge(raw_df, left_on='raw_para_id', right_on='raw_para_id', how='left')

### Evaluation

In [8]:
if torch.cuda.is_available():
    torch.set_default_device("cuda")
    print("CUDA is available!!")
else:
    raise RuntimeError("CUDA is not available!! LLM cannot run, rerun with GPU")

CUDA is available!!


In [9]:
# for RuntimeError: cutlassF: no kernel found to launch! | https://huggingface.co/stabilityai/stable-cascade/discussions/11
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

In [10]:
# load model
model_name = "Qwen/Qwen2-1.5B-Instruct"
model_qwen = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
tokenizer_qwen = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

qwen_pipeline = pipeline(
    "text-generation",
    model=model_qwen,
    tokenizer=tokenizer_qwen,
)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


#### Prompt Engineering

NOTE: Following prompts are created with the aid of ChatGPT

In [11]:
def generate_prompt(context, question):
    prompt_template = """
You are an expert in understanding and interpreting provided text contexts. Given a context and a question, your task is to generate an accurate and informative answer based on the provided context. Here is the structure:

1. **Context:** The detailed text or passage that contains the information needed to answer the question.
2. **Question:** A specific question that needs to be answered based on the context.

Please make sure your response is clear, concise, and directly addresses the question. If the context does not contain sufficient information to answer the question, say I don't know.

**Context:**
{context}

**Question:**
{question}

The response is a valid JSON with fields `explanation` and `response`.
"""
    return prompt_template.format(context=context, question=question)

#### Calculate Different metric scores

In [12]:
# LLM inference wrapper
def AskLLM(context, question):
    
    prompt = generate_prompt(context, question)
    messages = [{"role": "user", "content": prompt}]
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "do_sample": False
    }

    # Ask LLM to answer 
    output = qwen_pipeline(messages, **generation_args)
    answer = output[0]['generated_text']

    final_answer = "I don't know."

    # Extract the answer
    json_match = re.search(r'{.*}', answer, re.DOTALL)

    if json_match:
        # extract and parse JSON
        try:
            json_string = json_match.group(0)
            response_dict = json.loads(json_string)
            final_answer = response_dict['response']
        except json.JSONDecodeError:
            print(f"Error decoding JSON: {json_string}")

    return final_answer

In [13]:
candidate_answers = []
true_answers = []

for _, eval_data in tqdm(train_df.iterrows(), total=len(train_df)):
    context = eval_data.raw_text
    question = eval_data.question

    true_answers.append(eval_data.Final_answer)
    candidate_answers.append(AskLLM(context, question))

 10%|█         | 10/96 [00:40<05:11,  3.63s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
 77%|███████▋  | 74/96 [04:35<01:53,  5.17s/it]

Error decoding JSON: {
  "explanation": "Data science, data analysis, and statistical computing all involve analyzing data using various methods such as machine learning, regression analysis, and predictive modeling. They share some common skills like proficiency in statistics, programming, and data visualization. However, they differ in their focus areas, methodologies, and applications.",
  "response": "Pros:
- Data science offers a broader scope, including machine learning, deep learning, and artificial intelligence, which can lead to more innovative solutions.
- It involves interdisciplinary work across multiple domains, making it highly applicable in various industries.
- It fosters creativity and innovation by allowing developers to create new algorithms and models.
- It provides opportunities for career advancement in tech companies and startups."
  
Cons:
- Requires extensive training and expertise in complex algorithms and techniques.
- May lack practical application in certai

 90%|████████▉ | 86/96 [05:23<00:34,  3.47s/it]

Error decoding JSON: {
  "explanation": "A distributed computing framework for preparing large datasets is called Spark.",
  "response": "Spark is a popular distributed computing framework designed to handle big data workloads. It enables data scientists to process and analyze large datasets in parallel, which can reduce processing times.",
}


100%|██████████| 96/96 [05:44<00:00,  3.59s/it]


In [14]:
# Calculate BERTScore
# bert_metrics = bert_score.score(cands=candidate_answers, refs=true_answers, model_type='roberta-large', nthreads=4)
bert_metrics = bert_score.score(cands=candidate_answers, refs=true_answers, model_type='bert-base-uncased', nthreads=4)

# Fetch precision, recall, F1 score from BERT score (https://lightning.ai/docs/torchmetrics/stable/text/bert_score.html)
print(f"Mean Precision: {np.mean(np.array(bert_metrics[0]))}")
print(f"Mean Recall: {np.mean(np.array(bert_metrics[1]))}")
print(f"Mean F1 Score: {np.mean(np.array(bert_metrics[2]))}")



Mean Precision: 0.637467622756958
Mean Recall: 0.7010056376457214
Mean F1 Score: 0.6577903032302856


In [15]:
# Calculate BERTScore via https://huggingface.co/spaces/evaluate-metric/bertscore
bertscore = load("bertscore")
bert_metrics2 = bertscore.compute(predictions=candidate_answers, references=true_answers, lang="en")

print(f"Mean Precision: {np.mean(np.array(bert_metrics2['precision']))}")
print(f"Mean Recall: {np.mean(np.array(bert_metrics2['recall']))}")
print(f"Mean F1 Score: {np.mean(np.array(bert_metrics2['f1']))}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mean Precision: 0.8921381408969561
Mean Recall: 0.9051140416413546
Mean F1 Score: 0.8982096202671528


In [16]:
# calculate meteor via https://huggingface.co/spaces/evaluate-metric/meteor
meteor = load('meteor')
meteor_score = meteor.compute(predictions=candidate_answers, references=true_answers)

print(f"METEOR Score: {meteor_score['meteor']}")

[nltk_data] Downloading package wordnet to /home/sangram/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/sangram/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/sangram/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


METEOR Score: 0.42133591832031775


In [17]:
# calculate rouge via https://huggingface.co/spaces/evaluate-metric/rouge
rouge = load("rouge")
rouge_score = rouge.compute(predictions=candidate_answers, references=true_answers)

print(f"ROUGE Score: {rouge_score}")

ROUGE Score: {'rouge1': 0.40043498820954077, 'rouge2': 0.2646104844303946, 'rougeL': 0.3620210789229402, 'rougeLsum': 0.36012231921043786}


In [18]:
# revert CUDA settings to default ones
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_flash_sdp(True)
