# 🧪 Robust Evaluation System for GPT-2 Singapore Financial Q&A

## 📊 **Comprehensive Metrics:**
- **BLEU & ROUGE scores** for text quality
- **Semantic similarity** using sentence transformers  
- **Domain accuracy** with Singapore financial keywords
- **Factual accuracy** assessment
- **Response time** measurement
- **Singapore content detection**

## 🎯 **Evaluation Framework:**
- **15 comprehensive test questions** across all financial topics
- **Ground truth answers** for accurate comparison
- **Base vs Fine-tuned** model comparison
- **Aggregate statistics** and detailed breakdowns
- **Production readiness** assessment


In [None]:
# Install and import evaluation dependencies
!pip install rouge-score nltk sentence-transformers -q

import torch
import json
import time
import numpy as np
from pathlib import Path
from typing import Dict, List, Tuple

# Core libraries
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Evaluation metrics
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sentence_transformers import SentenceTransformer
import nltk
nltk.download('punkt', quiet=True)

print("🧪 ROBUST EVALUATION SYSTEM")
print("=" * 50)
print("Comprehensive evaluation with multiple metrics")


In [None]:
# Load models for evaluation
print("🔄 Loading models...")

# Load tokenizer and models
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Base model
base_model = AutoModelForCausalLM.from_pretrained("gpt2")

# Try to load fine-tuned model (adjust path as needed)
try:
    finetuned_model = PeftModel.from_pretrained(
        AutoModelForCausalLM.from_pretrained("gpt2"),
        "gpt2_singapore_production/lora_adapters"
    )
    print("✅ Loaded fine-tuned model")
except:
    print("⚠️ Could not load fine-tuned model, using base model for comparison")
    finetuned_model = base_model

# Move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
base_model.to(device)
finetuned_model.to(device)

# Initialize evaluation tools
rouge_scorer_obj = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
smoothing = SmoothingFunction().method1
semantic_model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"✅ Models loaded on {device}")


In [None]:
# Comprehensive test dataset with ground truth
test_questions = [
    {
        "question": "What does MAS stand for?",
        "ground_truth": "MAS stands for Monetary Authority of Singapore, which is Singapore's central bank and integrated financial regulator."
    },
    {
        "question": "What currency does Singapore use?",
        "ground_truth": "Singapore uses the Singapore Dollar (SGD) as its official currency."
    },
    {
        "question": "Who regulates banks in Singapore?",
        "ground_truth": "The Monetary Authority of Singapore (MAS) regulates banks in Singapore."
    },
    {
        "question": "What are the minimum capital requirements for banks in Singapore?",
        "ground_truth": "Banks in Singapore must maintain a minimum Common Equity Tier 1 (CET1) capital ratio of 6.5% and a Total Capital Ratio of 10% as required by MAS."
    },
    {
        "question": "How often must banks report capital adequacy to MAS?",
        "ground_truth": "Banks must submit capital adequacy returns to MAS on a monthly basis."
    },
    {
        "question": "What is STRO and what does it do?",
        "ground_truth": "STRO is the Suspicious Transaction Reporting Office, which receives and analyzes suspicious transaction reports from financial institutions in Singapore."
    },
    {
        "question": "What are the AML reporting requirements for financial institutions?",
        "ground_truth": "Financial institutions must report suspicious transactions to STRO within 15 days, regardless of the transaction amount."
    },
    {
        "question": "What is the minimum capital requirement for major payment institutions?",
        "ground_truth": "Major payment institutions must maintain minimum base capital of SGD 1 million under the Payment Services Act."
    },
    {
        "question": "How often must banks conduct penetration testing?",
        "ground_truth": "Banks must conduct penetration testing of critical systems at least annually as required by MAS Technology Risk Management Guidelines."
    },
    {
        "question": "What are the cyber incident reporting requirements?",
        "ground_truth": "Financial institutions must report significant cyber incidents to MAS within 1 hour of discovery."
    },
    {
        "question": "What does PDPA stand for and how does it apply to banks?",
        "ground_truth": "PDPA stands for Personal Data Protection Act. Banks must comply with PDPA requirements including obtaining consent for data collection and notifying individuals of data breaches within 72 hours."
    },
    {
        "question": "What is the minimum capital requirement for digital banks?",
        "ground_truth": "Digital banks must meet minimum paid-up capital of SGD 1.5 billion to obtain a banking license from MAS."
    },
    {
        "question": "What is the minimum Capital Adequacy Ratio for insurers?",
        "ground_truth": "Insurers in Singapore must maintain a minimum Capital Adequacy Ratio (CAR) of 120% under MAS's Risk-Based Capital framework."
    },
    {
        "question": "What does SFA stand for in Singapore?",
        "ground_truth": "SFA stands for Securities and Futures Act, which governs Singapore's capital markets and requires licensing for securities activities."
    },
    {
        "question": "What does PSA stand for in Singapore financial regulation?",
        "ground_truth": "PSA stands for Payment Services Act, which is Singapore's regulatory framework for payment services."
    }
]

print(f"📊 Created comprehensive test set: {len(test_questions)} questions")
print("✅ Each question has ground truth answer for accurate evaluation")
