In [None]:

from src.basic_functions import get_prediction
from transformers import pipeline
from src.utils import save_to_references
from metrics.correctness import correctness
from metrics.consistency import consistency
from metrics.plausibility import plausibility
from metrics.QAG import qag
from metrics.contextual import contextual_faithfulness
from metrics.counterfactual import counterfactual_faithfulness
from metrics.faithfulness import faithfulness
from metrics.trustworthiness import lext

# Target Models and Helper Keys Initialization 

Add the ner-pipe entitiy from huggingface using which you want your explanations will be tagged. 
The below ner_pipe is a medical NER model. Other domain-specific NER models can be appropriately used for evaluating explanations. 

In [2]:
ner_pipe = pipeline("token-classification", model="Clinical-AI-Apollo/Medical-NER", aggregation_strategy='simple')

This evaluation uses groq API for accessing bigger models like llama3-70b. Create an API key from Groq and replace it in the placeholder below. You can also replace the bigger model by going to [src/basic_functions.py](src/basic_functions.py) and replacing it with any othe model from groq

In [None]:
groq_api= "your-groq-api"
model ="target-model"

Fill in the question and context, along with the ground label and explanation for computing the LeXT Score

In [4]:
question = "I have a fever, should I take a paracetamol. Yes/No?"
context= None 
ground_label= "Yes"
ground_explanation= "Paracetamol is the recommended and is effective for reduction of fever in adults and children. However it is always recommended to consult with a healthcare professional before taking any medication. "

# Computing the Overall LExT Score
Run the below function. All the relevant scores will be printed and the results along with the reference data for further analysis will be stored in references.csv. Make sure that the models are pulled locally on ollama and the api keys and pipelines are correct. 

In [5]:
LExT = lext(context, question, ground_explanation, ground_label, model, groq_api, ner_pipe)

Computing weighted accuracy

Computing Context Relevancy

Computing correctness

Accuracy: 0.008375763770279323, Context Relevancy: 0.5532126426696777, Correctness: 0.28079420321997856

Computing iterative stability

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing paraphrase stability

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing consistency

Iterative Stability: 0.9094947076828562, Paraphrase Stability: 0.9135900299589156, Consistency: 0.9115423688208859

Computing plausibility

Plausibility: 0.5961682860204323

Computing QAG score

Computing Counterfactual Faithfulness

Computing Contextual Faithfulness

Computing Faithfulness

QAG: 1.0, Counterfactual: 1.0, Contextual: 0, Faithfulness: 0.6666666666666666

LExT (Language Explanation Trustworthiness) Score: 0.6294496730042588
All model outputs are sa

# Computing Individual Metrics

To compute any individual or aggregate set of metrics, run the below code to get the sample prediction and create a row reference dictionary that can be used to access and save data by other helper functions. 

In [6]:
label, explanation = get_prediction(context, question, model, groq_api, include_context="False")

In [7]:
row_reference = {
        "ground_context": context,
        "ground_question": question,
        "ground_explanation": ground_explanation,
        "ground_label":ground_label,
        "predicted_explanation": explanation,
        "predicted_label" : label
    }

You can now use any of the below functions to compute the individual metrics. 

In [10]:
correctness_score = correctness(ground_explanation, explanation, question, groq_api, ner_pipe, row_reference)

Computing weighted accuracy

Computing Context Relevancy

Computing correctness

Accuracy: 0.008375763770279323, Context Relevancy: 0.5394374132156372, Correctness: 0.2739065884929583



In [12]:
consistency_score = consistency(question, context, model, ground_explanation, groq_api, ner_pipe, row_reference)

Computing iterative stability

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing paraphrase stability

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing weighted accuracy

Computing consistency

Iterative Stability: 0.9509319557815018, Paraphrase Stability: 1.0, Consistency: 0.9754659778907508



In [None]:
plausibility_score = plausibility(context, question, ground_explanation, model, groq_api, ner_pipe, row_reference)

In [None]:
qag_score = qag(ground_explanation, groq_api, model, row_reference)

In [None]:
counterfactual_faithfulness_score = counterfactual_faithfulness(explanation, question, label, model, groq_api, row_reference)

In [None]:
contextual_faithfulness_score = contextual_faithfulness(context, question, label, model, groq_api, row_reference)

In [None]:
faithfulness_score = faithfulness(explanation, label, question, ground_label, context, groq_api, model, row_reference)

When you call individual metrics, the evaluation scores along with the model responses and results of the experiments will be saved in row_references dictionary. To save this, you can run the code below. 

In [None]:
save_to_references(row_reference)