# P-GALM Framework Evaluation

This notebook evaluates the Probabilistic Graph-Augmented Language Model (P-GALM) on a subset of the ScienceQA dataset. 

It performs the following steps:
1.  Loads the ScienceQA validation dataset.
2.  Selects a random sample of questions.
3.  Runs the vPGM inference pipeline on each question.
4.  Compares the predicted answer with the ground truth.
5.  Calculates accuracy and displays detailed results.

In [1]:
import os
import json
import random
import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv

# Load environment variables (OPENAI_API_KEY)
load_dotenv()

# Import P-GALM modules
from scienceqa_vpgm_loader import load_scienceqa, load_prompt_template, build_scienceqa_skeleton
from vpgm_llm_client import infer_vpgm_for_skeleton

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Configuration
SAMPLE_SIZE = 10  # Number of questions to evaluate
SPLIT = "validation"
TEMPLATE_ID = "scienceqa_vpgm_4latent_generic"

In [3]:
print("Loading dataset and template...")
dataset = load_scienceqa(split=SPLIT)
template = load_prompt_template()
print(f"Loaded {len(dataset)} examples from {SPLIT} split.")

Loading dataset and template...
Loaded 4241 examples from validation split.


In [4]:
# Select a random sample
# Filter out images for this evaluation to focus on text reasoning, or keep them if you want full eval
# For this demo, we'll try to pick text-only questions to avoid potential image processing issues if any,
# but the pipeline supports images if they are handled correctly.

candidates = [ex for ex in dataset if ex.get('image') is None]
print(f"Found {len(candidates)} text-only questions.")

if len(candidates) < SAMPLE_SIZE:
    sample = candidates
else:
    sample = random.sample(candidates, SAMPLE_SIZE)

print(f"Selected {len(sample)} questions for evaluation.")

Found 2144 text-only questions.
Selected 10 questions for evaluation.


In [5]:
results = []

print("Starting inference...")
for i, ex in enumerate(tqdm(sample)):
    # 1. Build Skeleton
    # Use index as synthetic ID if needed
    sqa_id = ex.get("id") or ex.get("qid") or f"sample_{i}"
    skeleton = build_scienceqa_skeleton(ex, TEMPLATE_ID, template, override_id=sqa_id)
    
    # 2. Run Inference
    try:
        prediction_result = infer_vpgm_for_skeleton(skeleton, template, template_id=TEMPLATE_ID)
        
        # 3. Extract Answer
        # The model returns 'selected_answer' which is the text of the option.
        # We need to map it back to the index (0, 1, 2...)
        selected_text = prediction_result["answer_posterior"]["selected_answer"]
        options = ex["choices"]
        
        # Simple exact match or substring match
        pred_index = -1
        if selected_text in options:
            pred_index = options.index(selected_text)
        else:
            # Fallback: try to find which option is contained in the selected text
            for idx, opt in enumerate(options):
                if opt in selected_text or selected_text in opt:
                    pred_index = idx
                    break
        
        # 4. Compare with Ground Truth
        ground_truth_index = ex["answer"]
        is_correct = (pred_index == ground_truth_index)
        
        results.append({
            "id": sqa_id,
            "question": ex["question"],
            "ground_truth_index": ground_truth_index,
            "ground_truth_text": options[ground_truth_index],
            "predicted_index": pred_index,
            "predicted_text": selected_text,
            "is_correct": is_correct,
            "reasoning_quality": prediction_result["latent_posteriors"]["Z4_logical_reasoning"]["justification"]
        })
        
    except Exception as e:
        print(f"Error processing {sqa_id}: {e}")
        results.append({
            "id": sqa_id,
            "question": ex["question"],
            "ground_truth_index": ex["answer"],
            "ground_truth_text": ex["choices"][ex["answer"]],
            "predicted_index": -1,
            "predicted_text": "ERROR",
            "is_correct": False,
            "reasoning_quality": str(e)
        })

Starting inference...


100%|██████████| 10/10 [02:16<00:00, 13.67s/it]


In [6]:
# Create DataFrame
df = pd.DataFrame(results)

# Calculate Accuracy
accuracy = df["is_correct"].mean()
print(f"\nOverall Accuracy: {accuracy:.2%}")

# Display Results
pd.set_option('display.max_colwidth', 100)
display(df[["question", "ground_truth_text", "predicted_text", "is_correct"]])


Overall Accuracy: 100.00%


Unnamed: 0,question,ground_truth_text,predicted_text,is_correct
0,Which is smoother?,plastic bag,plastic bag,True
1,What information supports the conclusion that Rudy inherited this trait?,Rudy's biological father has curly hair.,Rudy's biological father has curly hair.,True
2,What information supports the conclusion that Tony acquired this trait?,Tony's scar was caused by an accident. He cut his arm when he fell off his bicycle.,Tony's scar was caused by an accident. He cut his arm when he fell off his bicycle.,True
3,Which is a run-on sentence?,"Will picked apples, he will give some away.","Will picked apples, he will give some away.",True
4,What do these two changes have in common?\nbreaking a ceramic plate\npicking up a paper clip wit...,Both are only physical changes.,Both are only physical changes.,True
5,What is the mass of a full bag of groceries?,5 pounds,5 pounds,True
6,Which of the following contains a vague pronoun reference?,"Jake and his best friend go to the same college, but he is graduating this coming June.","Jake and his best friend go to the same college, but he is graduating this coming June.",True
7,Would you find the word throb on a dictionary page with the following guide words?\ntaper - tent...,no,no,True
8,Which is a complete sentence?,My family will swim at the town pool tomorrow.,My family will swim at the town pool tomorrow.,True
9,Is teaching art a good or a service?,a service,a service,True


In [7]:
# Inspect Reasoning for Incorrect Answers
incorrect = df[~df["is_correct"]]
if not incorrect.empty:
    print("\n--- Analysis of Incorrect Predictions ---")
    for _, row in incorrect.iterrows():
        print(f"\nQ: {row['question']}")
        print(f"True: {row['ground_truth_text']} | Pred: {row['predicted_text']}")
        print(f"Reasoning: {row['reasoning_quality']}")
else:
    print("No incorrect predictions in this sample!")

No incorrect predictions in this sample!
