# Finance RAG Enhancement Documentation

## Overview
This notebook demonstrates the process of enhancing our finance chatbot's RAG (Retrieval-Augmented Generation) system by incorporating a Hugging Face finance dataset. The goal is to expand the knowledge base while maintaining strong data privacy and quality controls.

## Key Instances and Decisions Made

### 1. **Data Source Selection**
- **Instance**: Used `4DR1455/finance_questions` dataset from Hugging Face Hub
- **Rationale**: Contains structured Q&A pairs specifically for finance topics
- **Implementation**: Used `datasets.load_dataset()` for robust data loading

### 2. **Privacy and Security Guardrails**
- **Instance**: Implemented PII redaction patterns
  - SSN patterns (`\d{3}-\d{2}-\d{4}`)
  - Account-like numbers (`\d{9,12}`)
  - Credit card formats (`\d{4} \d{4} \d{4} \d{4}`)
- **Rationale**: Protect sensitive financial information
- **Impact**: Prevents accidental exposure of personal data

### 3. **Data Quality Controls**
- **Instance**: Multi-step cleaning pipeline
  - Auto-detection of question/answer columns
  - Removal of NA values and duplicates
  - Text normalization (whitespace, formatting)
  - Filtering out entries shorter than 10 characters
- **Rationale**: Ensure high-quality training data
- **Result**: Clean, consistent dataset for model training

### 4. **Export Strategy Adaptation**
- **Instance**: Removed RAG document export (originally planned for `docs/hf_finance/`)
- **Rationale**: User preference to keep retriever lightweight
- **Alternative**: Focus on JSONL training data export only
- **Benefit**: Simpler workflow, reduced storage overhead

### 5. **Training Data Format**
- **Instance**: Chat-style JSONL format with system/user/assistant roles
- **Structure**: 
  ```json
  {
    "messages": [
      {"role": "system", "content": "FinGuide disclaimer"},
      {"role": "user", "content": "question"},
      {"role": "assistant", "content": "answer"}
    ]
  }
  ```
- **Purpose**: Ready for fine-tuning language models
- **Compliance**: Includes educational disclaimer in system prompt

### 6. **Memory Management**
- **Instance**: Optional retriever cache refresh
- **Purpose**: Clear any cached documents to ensure fresh retrievals
- **Fallback**: Graceful error handling if cache refresh fails

## Workflow Summary

1. **Setup** → Import libraries, define PII patterns and text normalization
2. **Load** → Fetch finance Q&A dataset from Hugging Face
3. **Clean** → Apply quality filters, redact PII, normalize text
4. **Transform** → Prepare Q&A articles (in-memory only)
5. **Export** → Save chat-format JSONL for model training
6. **Test** → Verify retriever works with updated knowledge base

## Output Files
- `finance_chat_train.jsonl` → Training data for fine-tuning (saved to current directory)

## Safety Features
- ✅ PII redaction
- ✅ Educational disclaimers
- ✅ Data quality validation
- ✅ Graceful error handling
- ✅ No persistent storage of raw personal data

# Finance Retrieval Sandbox

This notebook demos the lightweight retriever used by the backend.

## Plan: Expand RAG with HF finance dataset

Steps:
- Install/use pandas + hf filesystem support
- Load dataset from `hf://datasets/4DR1455/finance_questions/finance_questions_dataset.json`
- Guardrails: redact potential PII patterns; avoid sensitive advice
- Clean: select Q/A fields, drop null/dupes, trim/normalize text
- Transform: export
  - RAG: write chunks to `docs/` to augment retriever
  - Training: save JSONL with chat-style messages for fine-tuning
- Refresh retriever cache and test

In [15]:
from knowledge import get_relevant_context

query = 'How should I budget and what is a good emergency fund?'
print(get_relevant_context(query))

[Source: budgeting_basics.txt]
Budgeting Basics (US)

50/30/20 rule (example guideline):
- 50% Needs (rent, utilities, groceries, minimum debt payments)
- 30% Wants (dining out, entertainment)
- 20% Savings/Debt payoff (emergency fund, investments, extra debt payments)

Emergency fund:
- Typical target: 3–6 months of essential expenses
- Keep liquid (high-yield savings)

Debt strategies:
- Avalanche (highest interest first)
- Snowball (smallest balance first)

Rule of thumb:
- Keep housing costs ~25–30% of gross income
- Track spending monthly; adjust categories as needed


In [16]:
from datasets import load_dataset

# Load finance Q&A dataset from Hugging Face Hub
# Dataset: 4DR1455/finance_questions - contains structured finance Q&A pairs
# split="train" - loads the training split of the dataset
ds = load_dataset("4DR1455/finance_questions", split="train")

# Convert HuggingFace Dataset object to pandas DataFrame for easier manipulation
# This allows us to use familiar pandas operations for data cleaning and analysis
df = ds.to_pandas()

# Display basic information about the loaded dataset
print("Rows:", len(df))  # Show number of Q&A pairs loaded
df.head(3)  # Preview first 3 rows to understand data structure

Rows: 53937


Unnamed: 0,instruction,input,output
0,How do dividend policies impact a company's fi...,,Dividend policies of a company can significant...
1,What are the potential challenges in forecasti...,,Forecasting interest expenses for financial st...
2,Why is the cost of equity included in the WACC...,,The Weighted Average Cost of Capital (WACC) is...


In [17]:
# Clean and normalize the dataset
# Auto-detect column names since different datasets may use different naming conventions
possible_q_cols = ["question", "Question", "prompt", "q"]  # Common question column names
possible_a_cols = ["answer", "Answer", "response", "a"]    # Common answer column names

# Find the actual question column in the dataset
q_col = next((c for c in possible_q_cols if c in df.columns), None)
# Find the actual answer column in the dataset
a_col = next((c for c in possible_a_cols if c in df.columns), None)

# Fallback strategy if standard column names aren't found
if q_col is None or a_col is None:
    print("Warning: Could not locate standard question/answer columns. Columns:", list(df.columns))
    # Try to infer Q&A columns by finding text columns (object dtype)
    obj_cols = [c for c in df.columns if df[c].dtype == 'object']
    if len(obj_cols) >= 2:
        q_col, a_col = obj_cols[:2]  # Assume first two text columns are Q&A
    else:
        raise ValueError("No suitable text columns for Q/A found.")

# Create working dataset with standardized column names
work = df[[q_col, a_col]].copy()
work.columns = ["question", "answer"]  # Standardize column names

# Data quality filtering
# Remove rows with missing question or answer data
work = work.dropna(subset=["question", "answer"]).drop_duplicates()

# Apply text processing functions to clean the data
# normalize() - standardizes whitespace and formatting
# redact_pii() - removes personally identifiable information
work["question"] = work["question"].map(normalize).map(redact_pii)
work["answer"] = work["answer"].map(normalize).map(redact_pii)

# Quality filter: Remove entries that are too short to be meaningful
# Minimum 10 characters ensures we have substantial content for training
work = work[(work["question"].str.len() > 10) & (work["answer"].str.len() > 10)]

# Display results of the cleaning process
print("Clean rows:", len(work))  # Show how many rows remain after cleaning
work.head(3)  # Preview cleaned data structure

Clean rows: 0
Clean rows: 0


Unnamed: 0,question,answer


In [18]:
# Transform/Export

# 1) Prepare Q&A articles for potential future use (keeping in memory only)
# This creates formatted articles that could be used for RAG document storage
articles = []
for i, row in work.iterrows():
    q = row['question']  # Extract question text
    a = row['answer']    # Extract answer text
    
    # Format as a structured article with clear Q&A separation
    # textwrap.dedent() removes common leading whitespace for clean formatting
    article = textwrap.dedent(f"""
    Finance Q&A Article

    Question:
    {q}

    Answer:
    {a}
    """).strip()
    articles.append(article)

print(f"Prepared {len(articles)} Q&A articles (not saved to disk)")

# 2) Export training data in JSONL format for model fine-tuning
# JSONL (JSON Lines) format: one JSON object per line, ideal for ML training
train_jsonl = Path('finance_chat_train.jsonl')

with train_jsonl.open('w', encoding='utf-8') as f:
    # Convert each Q&A pair into chat format with roles
    for row in work.itertuples(index=False):
        # Structure follows OpenAI chat completion format
        record = {
            "messages": [
                {
                    "role": "system", 
                    "content": "You are FinGuide, an educational finance assistant. Not financial, legal, or tax advice."
                },
                {
                    "role": "user", 
                    "content": row.question  # User's financial question
                },
                {
                    "role": "assistant", 
                    "content": row.answer    # AI's educational response
                }
            ]
        }
        # Write each record as a single line (JSONL format)
        # ensure_ascii=False preserves unicode characters properly
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

print("Exported training JSONL:", train_jsonl.absolute())

# 3) Refresh retriever cache to ensure fresh document loading
# This is optional but ensures the retriever uses the most current document set
try:
    from knowledge import _get_docs
    # Clear the global document cache by setting it to None
    _get_docs.__globals__["_DOCS_CACHE"] = None  # invalidate cache
    _get_docs()  # Trigger reload of documents
    print("Retriever cache refreshed.")
except Exception as e:
    # Graceful handling if cache refresh fails (system still works)
    print("Note: retriever cache refresh skipped:", e)

Prepared 0 Q&A articles (not saved to disk)
Exported training JSONL: c:\Users\ryanm\Desktop\Projects\CHATBOT\backend\LLM_RAG_Finance\finance_chat_train.jsonl
Retriever cache refreshed.


In [19]:
# Quick test after augmentation
# Test the retriever system to verify it's working with the updated knowledge base
from knowledge import get_relevant_context

# Query the retriever with a common finance question
# This tests both the basic retriever functionality and any new content integration
test_query = "What is an emergency fund and how much should it be?"
result = get_relevant_context(test_query)

# Display the retrieved context to verify quality and relevance
print("Retrieved context for emergency fund question:")
print("=" * 50)
print(result)

# Purpose of this test:
# - Verify retriever is functioning correctly
# - Check if relevant finance content is being found
# - Validate that the knowledge base integration is working
# - Ensure the system can provide contextual information for user queries

Retrieved context for emergency fund question:
[Source: budgeting_basics.txt]
Budgeting Basics (US)

50/30/20 rule (example guideline):
- 50% Needs (rent, utilities, groceries, minimum debt payments)
- 30% Wants (dining out, entertainment)
- 20% Savings/Debt payoff (emergency fund, investments, extra debt payments)

Emergency fund:
- Typical target: 3–6 months of essential expenses
- Keep liquid (high-yield savings)

Debt strategies:
- Avalanche (highest interest first)
- Snowball (smallest balance first)

Rule of thumb:
- Keep housing costs ~25–30% of gross income
- Track spending monthly; adjust categories as needed
