# 🔧 FIXED QUALITY TRAINING - Full Fine-Tuning Approach

## 🔍 **Problem Analysis:**
The large dataset approach achieved **100% different responses** but with **terrible quality**:
- Nonsensical answers ("Maternal Surgery of Melbourne" for MAS)
- No Singapore financial knowledge retained
- LoRA insufficient for large model + large dataset

## ✅ **SOLUTION: Full Fine-Tuning**
- **Remove LoRA entirely** - train all parameters
- **Use Flan-T5-small** - easier to fine-tune completely
- **Aggressive training** - higher LR, more epochs
- **Quality-focused dataset** - better Singapore content

## 🎯 **Expected Results:**
- **High-quality Singapore financial responses**
- **Proper MAS knowledge**
- **Coherent regulatory answers**
- **Production-ready model**


In [None]:
!pip install transformers datasets torch accelerate -q

import json
import torch
import random
from datasets import Dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSeq2SeqLM, 
    TrainingArguments, 
    Trainer,
    DataCollatorForSeq2Seq
)
from pathlib import Path
from collections import Counter

print("🔧 FIXED QUALITY TRAINING - Full Fine-Tuning Approach")
print("=" * 60)
print("Removing LoRA, using full fine-tuning for quality responses")


## 📊 High-Quality Singapore Dataset Generation


In [None]:
def generate_quality_singapore_dataset():
    """Generate high-quality, diverse Singapore financial Q&A pairs"""
    
    # High-quality Singapore financial Q&A pairs
    quality_qa_pairs = [
        # MAS and Basic Singapore Finance
        {
            "input": "What does MAS stand for?",
            "output": "MAS stands for the Monetary Authority of Singapore. It is Singapore's central bank and integrated financial regulator, responsible for monetary policy, banking supervision, insurance regulation, and securities market oversight."
        },
        {
            "input": "What currency does Singapore use?",
            "output": "Singapore uses the Singapore Dollar (SGD) as its official currency. The SGD is managed by the Monetary Authority of Singapore (MAS) and is one of the most stable currencies in Asia."
        },
        {
            "input": "Who regulates banks in Singapore?",
            "output": "Banks in Singapore are regulated by the Monetary Authority of Singapore (MAS). MAS supervises all banking institutions operating in Singapore to ensure financial stability and consumer protection."
        },
        
        # Capital Adequacy
        {
            "input": "What are the capital requirements for banks in Singapore?",
            "output": "Banks in Singapore must maintain a minimum Common Equity Tier 1 (CET1) capital ratio of 6.5% and a Total Capital Ratio of 10%, as required by MAS. These requirements ensure banks have sufficient capital to absorb losses."
        },
        {
            "input": "How often must banks report capital adequacy to MAS?",
            "output": "Banks must submit capital adequacy returns to MAS on a monthly basis. These reports help MAS monitor the financial health and stability of Singapore's banking sector."
        },
        
        # AML/CFT
        {
            "input": "How should financial institutions implement AML measures?",
            "output": "Financial institutions in Singapore must implement robust Anti-Money Laundering (AML) measures including customer due diligence, ongoing monitoring, suspicious transaction reporting to STRO within 15 days, and staff training as required by MAS."
        },
        {
            "input": "What is the suspicious transaction reporting threshold in Singapore?",
            "output": "In Singapore, financial institutions must report suspicious transactions to the Suspicious Transaction Reporting Office (STRO) regardless of amount. There is no minimum threshold for suspicious transaction reporting."
        },
        
        # Payment Services
        {
            "input": "What are the licensing requirements for payment institutions?",
            "output": "Under Singapore's Payment Services Act, major payment institutions must obtain a license from MAS and maintain minimum base capital of SGD 1 million. They must also meet fit and proper criteria and have robust risk management systems."
        },
        {
            "input": "What is the Payment Services Act in Singapore?",
            "output": "The Payment Services Act (PSA) is Singapore's comprehensive framework for regulating payment services. It requires licensing for payment service providers and establishes conduct and operational requirements to protect consumers."
        },
        
        # Cybersecurity
        {
            "input": "What cybersecurity requirements must banks meet?",
            "output": "Banks in Singapore must comply with MAS Technology Risk Management Guidelines, including implementing robust cybersecurity controls, conducting annual penetration testing, maintaining incident response plans, and reporting cyber incidents to MAS within 1 hour."
        },
        {
            "input": "How often must banks conduct penetration testing?",
            "output": "Banks must conduct penetration testing of critical systems at least annually as required by MAS Technology Risk Management Guidelines. More frequent testing may be required for high-risk systems."
        },
        
        # Data Protection
        {
            "input": "What are the data protection requirements for financial institutions?",
            "output": "Financial institutions in Singapore must comply with the Personal Data Protection Act (PDPA), including obtaining consent for data collection, implementing data protection measures, and notifying affected individuals of data breaches within 72 hours."
        },
        {
            "input": "How long can banks retain customer data in Singapore?",
            "output": "Banks in Singapore must retain customer data for at least 5 years after account closure as required by MAS regulations. However, they should not retain data longer than necessary and must comply with PDPA requirements."
        },
        
        # Digital Banking
        {
            "input": "What are the licensing requirements for digital banks in Singapore?",
            "output": "Digital banks in Singapore must obtain a banking license from MAS and meet stringent requirements including minimum paid-up capital of SGD 1.5 billion, robust business plans, strong risk management frameworks, and adequate technology infrastructure."
        },
        {
            "input": "How does MAS promote fintech innovation?",
            "output": "MAS promotes fintech innovation through the regulatory sandbox, which allows fintech companies to test innovative solutions in a controlled environment with relaxed regulatory requirements for up to 2 years."
        },
        
        # Insurance
        {
            "input": "How does MAS regulate insurance companies?",
            "output": "MAS regulates insurance companies under the Insurance Act, requiring them to maintain adequate capital under the Risk-Based Capital framework, obtain proper licensing, and ensure fair treatment of policyholders through appropriate product design and distribution."
        },
        {
            "input": "What are the solvency requirements for insurers in Singapore?",
            "output": "Insurers in Singapore must maintain a minimum Capital Adequacy Ratio (CAR) of 120% under MAS's Risk-Based Capital framework. This ensures insurers have sufficient capital to meet their obligations to policyholders."
        },
        
        # Securities and Futures
        {
            "input": "What is the Securities and Futures Act?",
            "output": "The Securities and Futures Act (SFA) is Singapore's comprehensive legislation governing capital markets. It requires licensing for securities and futures activities, establishes conduct requirements, and provides investor protection measures."
        },
        {
            "input": "How does Singapore protect retail investors?",
            "output": "Singapore protects retail investors through the SFA's conduct requirements, mandatory disclosure obligations, cooling-off periods for certain products, and MAS supervision of financial advisors and investment products."
        }
    ]
    
    # Create variations to reach ~100 samples
    expanded_pairs = []
    
    # Add original pairs
    for pair in quality_qa_pairs:
        expanded_pairs.append({
            "instruction": "You are an expert in Singapore financial regulations. Answer the following question accurately based on MAS guidelines:",
            "input": pair["input"],
            "output": pair["output"]
        })
    
    # Add variations with different instruction formats
    instruction_variants = [
        "Based on Singapore financial regulations, provide a detailed answer:",
        "As a Singapore financial regulation expert, explain:",
        "According to MAS guidelines, answer the following:",
        "Provide a comprehensive answer about Singapore financial regulations:"
    ]
    
    for pair in quality_qa_pairs[:15]:  # Use first 15 for variations
        for instruction in instruction_variants:
            expanded_pairs.append({
                "instruction": instruction,
                "input": pair["input"],
                "output": pair["output"]
            })
    
    return expanded_pairs

# Generate high-quality dataset
print("📊 Generating high-quality Singapore financial dataset...")
quality_data = generate_quality_singapore_dataset()

print(f"✅ Generated {len(quality_data)} high-quality Q&A pairs!")

# Save dataset
Path("processed_data").mkdir(parents=True, exist_ok=True)
quality_dataset_path = "processed_data/quality_training_data.json"

with open(quality_dataset_path, 'w', encoding='utf-8') as f:
    json.dump(quality_data, f, indent=2, ensure_ascii=False)

# Load as dataset
dataset = Dataset.from_list(quality_data)
print(f"✅ Quality dataset loaded: {len(dataset)} examples")

# Show samples
print(f"\n📋 Sample high-quality data:")
for i in range(3):
    print(f"\n{i+1}. Input: {dataset[i]['input']}")
    print(f"   Output: {dataset[i]['output'][:100]}...")


## 🚀 Full Fine-Tuning Setup (No LoRA)


In [None]:
# Load smaller model for full fine-tuning
print("Loading Flan-T5-small for full fine-tuning...")
model_name = "google/flan-t5-small"  # Smaller model, easier to fine-tune completely
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print(f"✅ Loaded {model_name} for FULL fine-tuning (no LoRA)")
print(f"📊 Model parameters: {model.num_parameters():,}")
print("🔧 Training ALL parameters for maximum quality!")


In [None]:
# Preprocess quality dataset
def preprocess_function(examples):
    inputs = [f"Answer this Singapore financial regulation question: {q}" for q in examples["input"]]
    targets = examples["output"]
    
    model_inputs = tokenizer(inputs, max_length=256, truncation=True, padding=True)
    labels = tokenizer(targets, max_length=256, truncation=True, padding=True)
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

print("Preprocessing quality dataset...")
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset.column_names
)

# Split for evaluation
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

print(f"✅ Quality dataset preprocessed: {len(train_dataset)} train, {len(eval_dataset)} eval")


In [None]:
# AGGRESSIVE full fine-tuning arguments
print("Setting up AGGRESSIVE full fine-tuning...")
training_args = TrainingArguments(
    output_dir="quality_finetuned_model",
    num_train_epochs=8,  # More epochs for quality
    per_device_train_batch_size=2,  # Smaller batch for more updates
    per_device_eval_batch_size=2,
    learning_rate=3e-4,  # Higher learning rate for full fine-tuning
    logging_steps=5,
    eval_steps=20,
    save_steps=20,
    warmup_steps=20,
    save_total_limit=2,
    eval_strategy="steps",
    load_best_model_at_end=True,
    remove_unused_columns=False,
    report_to=None,
    fp16=torch.cuda.is_available(),  # Use mixed precision if available
)

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

print("✅ AGGRESSIVE full fine-tuning setup complete!")
print("🎯 This should produce HIGH-QUALITY Singapore financial responses!")


In [None]:
# Train with full fine-tuning!
print("🚀 Starting FULL FINE-TUNING for quality responses...")
print("Training ALL parameters - expect significant improvement!")

trainer.train()
trainer.save_model()

print("✅ Full fine-tuning completed!")
print("🎯 Model should now have proper Singapore financial knowledge!")


## 🔍 Quality Testing (Proper Singapore Responses Expected)


In [None]:
print("🧪 QUALITY TESTING - Expecting Proper Singapore Financial Responses")
print("=" * 70)

# Quality test questions
quality_test_questions = [
    "What does MAS stand for?",
    "What currency does Singapore use?",
    "Who regulates banks in Singapore?",
    "What are the capital requirements for banks in Singapore?",
    "How should financial institutions implement AML measures?"
]

different_count = 0
quality_count = 0

# Get device info
device = next(model.parameters()).device
print(f"Model device: {device}")
original_model = original_model.to(device)

for i, question in enumerate(quality_test_questions, 1):
    print(f"\n{i}. Question: {question}")
    
    # Tokenize and move to device
    inputs = tokenizer(f"Answer this Singapore financial regulation question: {question}", return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Base model response
    original_model.eval()
    with torch.no_grad():
        base_outputs = original_model.generate(
            **inputs, 
            max_new_tokens=100, 
            num_beams=3,
            do_sample=False
        )
    base_response = tokenizer.decode(base_outputs[0], skip_special_tokens=True)
    
    # Fine-tuned model response
    model.eval()  # Use eval mode for quality testing
    with torch.no_grad():
        trained_outputs = model.generate(
            **inputs, 
            max_new_tokens=100, 
            num_beams=3,
            do_sample=False  # Use deterministic generation for quality
        )
    trained_response = tokenizer.decode(trained_outputs[0], skip_special_tokens=True)
    
    print(f"   Base:        '{base_response}'")
    print(f"   Fine-tuned:  '{trained_response}'")
    
    # Check if responses are different
    if base_response != trained_response:
        print("   ✅ SUCCESS: Different responses!")
        different_count += 1
        
        # Check for quality Singapore content
        quality_keywords = [
            'monetary authority of singapore', 'mas', 'singapore dollar', 'sgd',
            'capital ratio', 'cet1', 'aml', 'suspicious transaction', 'pdpa',
            'payment services act', 'securities and futures act'
        ]
        
        if any(keyword in trained_response.lower() for keyword in quality_keywords):
            print("   🎯 EXCELLENT: High-quality Singapore financial content!")
            quality_count += 1
        else:
            print("   ⚠️ Different but low quality content")
    else:
        print("   ❌ Identical responses")

# Final quality assessment
total_questions = len(quality_test_questions)
success_rate = (different_count / total_questions) * 100
quality_rate = (quality_count / total_questions) * 100

print(f"\n" + "=" * 70)
print("🎯 QUALITY RESULTS")
print("=" * 70)
print(f"Different responses: {different_count}/{total_questions} ({success_rate:.1f}%)")
print(f"High-quality responses: {quality_count}/{total_questions} ({quality_rate:.1f}%)")

if quality_rate >= 80:
    print("\n🎉 OUTSTANDING: High-quality Singapore financial responses achieved!")
    print("✅ Full fine-tuning successfully learned Singapore regulations!")
    print("✅ PRODUCTION READY for Singapore financial Q&A!")
elif quality_rate >= 60:
    print("\n✅ GOOD: Significant improvement in response quality!")
    print("✅ Full fine-tuning approach works better than LoRA!")
elif quality_rate >= 40:
    print("\n⚠️ MODERATE: Some improvement but needs more training")
else:
    print("\n❌ POOR: Still low quality - need different approach")

print(f"\n💡 COMPARISON:")
print(f"   Large Dataset + LoRA: 100% different, 0% quality")
print(f"   Quality Dataset + Full FT: {success_rate:.1f}% different, {quality_rate:.1f}% quality")

print("\n✅ Quality testing completed!")


## 🎯 SUMMARY: Quality-Focused Full Fine-Tuning

### ✅ **What We Fixed:**

1. **Removed LoRA Entirely** - Full parameter training for maximum learning
2. **High-Quality Dataset** - 19 expert Singapore Q&A pairs + 60 variations (79 total)
3. **Flan-T5-small** - Smaller model, easier to fine-tune completely
4. **Aggressive Training** - 8 epochs, 3e-4 LR, smaller batches

### 🔬 **Key Differences from Failed Approach:**

| Aspect | Failed Large Dataset | Fixed Quality Approach |
|--------|---------------------|------------------------|
| **Method** | LoRA (r=32, alpha=64) | Full Fine-Tuning |
| **Model** | Flan-T5-base (large) | Flan-T5-small (manageable) |
| **Dataset** | 496 repetitive samples | 79 high-quality samples |
| **Training** | 3 epochs, 1e-4 LR | 8 epochs, 3e-4 LR |
| **Expected Quality** | Terrible (proven) | High Singapore expertise |

### 📊 **Expected Results:**

- **Proper MAS responses** instead of "Maternal Surgery of Melbourne"
- **Singapore Dollar (SGD)** instead of "Shanghai"
- **Coherent regulatory knowledge** instead of gibberish
- **Production-ready quality** for real-world deployment

### 🚀 **This Should Work Because:**

1. **Full fine-tuning** can override base model knowledge completely
2. **Quality over quantity** - expert-crafted Singapore content
3. **Smaller model** is easier to reshape with limited data
4. **Aggressive training** ensures deep learning of new knowledge

**Run this notebook to see if we achieve high-quality Singapore financial responses!** 🎯
