# Domain-Specific Financial Assistant via LLM Fine-Tuning

This notebook fine-tunes **Gemma-2B** using **QLoRA** (4-bit quantization) on the Financial-QA-10k dataset to create a domain-specific assistant for financial question answering.

## Overview
- **Model**: Google Gemma-2B
- **Method**: QLoRA (4-bit quantization with LoRA adapters)
- **Dataset**: Financial-QA-10k (3,000 samples from SEC 10-K filings)
- **Format**: Alpaca instruction-response template
- **Hardware**: Optimized for Kaggle T4/P100 GPUs

## Sections
1. Environment Setup
2. Data Preprocessing
3. Model Configuration
4. Training
5. Inference & Evaluation

---
## 1. Environment Setup

Install required dependencies and set up the environment.

In [66]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [67]:
# Install required packages
!pip install -q transformers>=4.40.0 \
    peft>=0.10.0 \
    datasets>=2.18.0 \
    accelerate>=0.27.0 \
    bitsandbytes>=0.43.0 \
    trl>=0.8.0 \
    sentencepiece>=0.2.0 \
    evaluate>=0.4.1 \
    rouge-score>=0.1.2 \
    scikit-learn>=1.4.0

print("✓ All packages installed successfully!")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


✓ All packages installed successfully!


In [68]:
# Import libraries
import os
import json
import re
import random
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Suppress tokenizers parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model
)
from datasets import Dataset, DatasetDict
from trl import SFTTrainer
import evaluate
from tqdm.auto import tqdm

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Using device: cuda
GPU: Tesla T4
VRAM: 15.64 GB


In [69]:
# Optional: Login to Hugging Face (for model uploads)
# Set HUGGINGFACE_TOKEN environment variable or use notebook secrets
import os
from huggingface_hub import login

if 'HUGGINGFACE_TOKEN' in os.environ:
    login(token=os.environ['HUGGINGFACE_TOKEN'])
    print("✓ Logged in to Hugging Face")
else:
    print("⚠ HUGGINGFACE_TOKEN not set - skipping HF login")

---
## 2. Data Preprocessing

Load and preprocess the Financial-QA-10k dataset into Alpaca format.

In [70]:
# Configuration
RAW_DATA_PATH = "../dataset/Financial-QA-10k.csv"  
MODEL_NAME = "google/gemma-2b"
MAX_SAMPLES = 7000
MAX_SEQ_LENGTH = 256
TRAIN_RATIO = 0.90
VAL_RATIO = 0.05
TEST_RATIO = 0.05

# Alpaca prompt template
ALPACA_TEMPLATE = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

print("Configuration loaded successfully!")

Configuration loaded successfully!


In [71]:
# Helper functions
def normalize_text(text):
    """Normalize text by cleaning whitespace and standardizing formatting."""
    if pd.isna(text) or text is None:
        return ""
    
    text = str(text).strip()
    text = re.sub(r'\s+', ' ', text)  # Multiple spaces to single
    text = re.sub(r'\$\s+', '$', text)  # Remove space after dollar sign
    text = ''.join(char for char in text if ord(char) >= 32 or char == '\n')
    
    return text


def create_alpaca_format(row):
    """Convert a dataset row into Alpaca format."""
    return {
        "instruction": normalize_text(row['question']),
        "input": normalize_text(row['context']),
        "output": normalize_text(row['answer']),
        "ticker": row['ticker'],
        "filing": row['filing']
    }


def truncate_context(text, tokenizer, max_tokens=1500):
    """Truncate context to fit within token limit."""
    tokens = tokenizer.encode(text, add_special_tokens=False)
    
    if len(tokens) <= max_tokens:
        return text
    
    truncated_tokens = tokens[:max_tokens]
    truncated_text = tokenizer.decode(truncated_tokens, skip_special_tokens=True)
    
    # Try to end at sentence boundary
    sentences = truncated_text.split('. ')
    if len(sentences) > 1:
        truncated_text = '. '.join(sentences[:-1]) + '.'
    
    return truncated_text

print("Helper functions defined!")

Helper functions defined!


In [72]:
# Load dataset
print("Loading dataset...")
df = pd.read_csv(RAW_DATA_PATH)
print(f"✓ Loaded {len(df)} examples")
print(f"\nCompanies: {df['ticker'].unique().tolist()}")
print(f"\nCompany distribution:\n{df['ticker'].value_counts()}")

Loading dataset...
✓ Loaded 7000 examples

Companies: ['NVDA', 'AAPL', 'TSLA', 'LULU', 'PG', 'COST', 'ABNB', 'MSFT', 'BRK-A', 'META', 'AXP', 'PTON', 'SBUX', 'NKE', 'PLTR', 'AMZN', 'NFLX', 'GOOGL', 'ABBV', 'V', 'GME', 'AMC', 'CRM', 'LLY', 'AVGO', 'UNH', 'JNJ', 'HD', 'WMT', 'AMD', 'CVX', 'BAC', 'KO', 'T', 'AZO', 'CAT', 'SCHW', 'CMG', 'CB', 'CMCSA', 'CVS', 'DVA', 'DAL', 'DLTR', 'EBAY', 'EA', 'ENPH', 'EFX', 'ETSY', 'FDX', 'F', 'GRMN', 'GIS', 'GM', 'GILD', 'GS', 'HAS', 'HSY', 'HPE', 'HLT', 'HPQ', 'HUM', 'IBM', 'ICE', 'INTU', 'IRM', 'JPM', 'KR', 'LVS']

Company distribution:
ticker
JNJ     200
AAPL    100
TSLA    100
LULU    100
NVDA    100
       ... 
INTU    100
IRM     100
JPM     100
KR      100
LVS     100
Name: count, Length: 69, dtype: int64


In [73]:
# Sample data (stratified by company ticker)
print(f"\nSampling {MAX_SAMPLES} examples (stratified by ticker)...")

if len(df) > MAX_SAMPLES:
    df_sampled = df.groupby('ticker', group_keys=False).apply(
        lambda x: x.sample(frac=MAX_SAMPLES/len(df), random_state=42)
    ).reset_index(drop=True)
    
    # Adjust to exact count
    if len(df_sampled) < MAX_SAMPLES:
        additional = df.drop(df_sampled.index).sample(
            n=MAX_SAMPLES - len(df_sampled), random_state=42
        )
        df_sampled = pd.concat([df_sampled, additional]).reset_index(drop=True)
    elif len(df_sampled) > MAX_SAMPLES:
        df_sampled = df_sampled.sample(n=MAX_SAMPLES, random_state=42).reset_index(drop=True)
else:
    df_sampled = df.copy()

print(f"✓ Selected {len(df_sampled)} examples")
print(f"\nSampled distribution:\n{df_sampled['ticker'].value_counts()}")


Sampling 7000 examples (stratified by ticker)...
✓ Selected 7000 examples

Sampled distribution:
ticker
JNJ     200
AAPL    100
TSLA    100
LULU    100
NVDA    100
       ... 
INTU    100
IRM     100
JPM     100
KR      100
LVS     100
Name: count, Length: 69, dtype: int64


In [74]:
# Load tokenizer
print(f"\nLoading tokenizer: {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"✓ Tokenizer loaded (vocab size: {tokenizer.vocab_size})")


Loading tokenizer: google/gemma-2b...
✓ Tokenizer loaded (vocab size: 256000)


In [75]:
# Convert to Alpaca format and truncate if needed
print("\nConverting to Alpaca format...")
formatted_data = []
truncated_count = 0

for idx, row in tqdm(df_sampled.iterrows(), total=len(df_sampled), desc="Processing"):
    example = create_alpaca_format(row)
    
    # Check sequence length
    full_prompt = ALPACA_TEMPLATE.format(
        instruction=example['instruction'],
        input=example['input'],
        output=example['output']
    )
    token_count = len(tokenizer.encode(full_prompt, add_special_tokens=True))
    
    # Truncate if needed
    if token_count > MAX_SEQ_LENGTH:
        overhead = len(tokenizer.encode(
            ALPACA_TEMPLATE.format(
                instruction=example['instruction'],
                input="",
                output=example['output']
            ),
            add_special_tokens=True
        ))
        
        max_context_tokens = MAX_SEQ_LENGTH - overhead - 50
        example['input'] = truncate_context(example['input'], tokenizer, max_context_tokens)
        truncated_count += 1
    
    formatted_data.append(example)

print(f"\n✓ Formatted {len(formatted_data)} examples")
print(f"✓ Truncated {truncated_count} contexts to fit within {MAX_SEQ_LENGTH} tokens")


Converting to Alpaca format...


Processing:   0%|          | 0/7000 [00:00<?, ?it/s]


✓ Formatted 7000 examples
✓ Truncated 147 contexts to fit within 256 tokens


In [76]:
# Analyze sequence lengths
print("\nAnalyzing sequence lengths...")
lengths = []

for example in tqdm(formatted_data, desc="Measuring"):
    prompt = ALPACA_TEMPLATE.format(
        instruction=example['instruction'],
        input=example['input'],
        output=example['output']
    )
    tokens = tokenizer.encode(prompt, add_special_tokens=True)
    lengths.append(len(tokens))

print(f"\n✓ Token length statistics:")
print(f"   Min: {min(lengths)}")
print(f"   Max: {max(lengths)}")
print(f"   Mean: {np.mean(lengths):.1f}")
print(f"   Median: {np.median(lengths):.1f}")
print(f"   95th percentile: {np.percentile(lengths, 95):.1f}")
print(f"   99th percentile: {np.percentile(lengths, 99):.1f}")


Analyzing sequence lengths...


Measuring:   0%|          | 0/7000 [00:00<?, ?it/s]


✓ Token length statistics:
   Min: 41
   Max: 390
   Mean: 134.5
   Median: 128.0
   95th percentile: 207.0
   99th percentile: 238.0


In [77]:
# Split data into train/val/test
print(f"\nSplitting data (train: {TRAIN_RATIO:.0%}, val: {VAL_RATIO:.0%}, test: {TEST_RATIO:.0%})...")

# First split: train + (val + test)
train_data, temp_data = train_test_split(
    formatted_data,
    train_size=TRAIN_RATIO,
    random_state=42,
    stratify=[d['ticker'] for d in formatted_data]
)

# Second split: val and test
val_ratio_adjusted = VAL_RATIO / (VAL_RATIO + TEST_RATIO)
val_data, test_data = train_test_split(
    temp_data,
    train_size=val_ratio_adjusted,
    random_state=42,
    stratify=[d['ticker'] for d in temp_data]
)

print(f"✓ Train: {len(train_data)} examples")
print(f"✓ Validation: {len(val_data)} examples")
print(f"✓ Test: {len(test_data)} examples")


Splitting data (train: 90%, val: 5%, test: 5%)...
✓ Train: 6300 examples
✓ Validation: 350 examples
✓ Test: 350 examples


In [78]:
# Display sample examples
print("\n" + "="*60)
print("SAMPLE TRAINING EXAMPLE")
print("="*60)
sample = train_data[0]
print(f"\nTicker: {sample['ticker']}")
print(f"\nInstruction:\n{sample['instruction']}")
print(f"\nInput (context):\n{sample['input'][:300]}...")  # Show first 300 chars
print(f"\nOutput:\n{sample['output']}")


SAMPLE TRAINING EXAMPLE

Ticker: DAL

Instruction:
What is the net amount of Delta Air Lines' total assets in 2023 according to their financial statements?

Input (context):
Total assets for Delta Air Lines in 2023 are reported as $73,644 million in the financial statements....

Output:
$73,644 million


---
## 3. Model Configuration

Set up QLoRA configuration and load the base model with 4-bit quantization.

In [79]:
# QLoRA Configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print("QLoRA (4-bit quantization) configuration:")
print(f"  - Quantization type: NF4 (4-bit)")
print(f"  - Compute dtype: bfloat16")
print(f"  - Double quantization: Enabled")

QLoRA (4-bit quantization) configuration:
  - Quantization type: NF4 (4-bit)
  - Compute dtype: bfloat16
  - Double quantization: Enabled


In [80]:
# LoRA Configuration
peft_config = LoraConfig(
    r=8,  # LoRA rank
    lora_alpha=16,  # LoRA scaling factor
    lora_dropout=0.05,  # Dropout probability
    target_modules=["q_proj", "v_proj"],  # Target attention layer
    bias="none",
    task_type="CAUSAL_LM",
)

print("\nLoRA configuration:")
print(f"  - Rank (r): {peft_config.r}")
print(f"  - Alpha: {peft_config.lora_alpha}")
print(f"  - Dropout: {peft_config.lora_dropout}")
print(f"  - Target modules: {peft_config.target_modules}")
print(f"  - Task type: {peft_config.task_type}")


LoRA configuration:
  - Rank (r): 8
  - Alpha: 16
  - Dropout: 0.05
  - Target modules: {'q_proj', 'v_proj'}
  - Task type: CAUSAL_LM


In [93]:
# Load base model with quantization
print(f"\nLoading model: {MODEL_NAME}...")
print("This may take a few minutes...\n")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map={"": 0},           # Pin to first GPU to avoid DataParallel
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

model.gradient_checkpointing_enable()
model.config.use_cache = False
model.enable_input_require_grads()
# Apply LoRA
model = get_peft_model(model, peft_config)

print("✓ Model loaded and LoRA adapters applied!")

# Print trainable parameters
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"\nTrainable params: {trainable_params:,} || "
          f"All params: {all_param:,} || "
          f"Trainable%: {100 * trainable_params / all_param:.2f}%")

print_trainable_parameters(model)


Loading model: google/gemma-2b...
This may take a few minutes...



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✓ Model loaded and LoRA adapters applied!

Trainable params: 921,600 || All params: 1,516,189,696 || Trainable%: 0.06%


---
## 4. Training

Prepare datasets and train the model using the SFTTrainer.

In [94]:
# Prepare datasets
def format_instruction(example):
    """Format example into Alpaca template for training."""
    text = ALPACA_TEMPLATE.format(
        instruction=example['instruction'],
        input=example['input'],
        output=example['output']
    )
    return {"text": text}

# Create HuggingFace datasets
train_dataset = Dataset.from_list([{k: v for k, v in d.items() if k in ['instruction', 'input', 'output']} for d in train_data])
val_dataset = Dataset.from_list([{k: v for k, v in d.items() if k in ['instruction', 'input', 'output']} for d in val_data])
test_dataset = Dataset.from_list([{k: v for k, v in d.items() if k in ['instruction', 'input', 'output']} for d in test_data])

# Apply formatting
train_dataset = train_dataset.map(format_instruction)
val_dataset = val_dataset.map(format_instruction)
test_dataset = test_dataset.map(format_instruction)

print(f"✓ Prepared training dataset: {len(train_dataset)} examples")
print(f"✓ Prepared validation dataset: {len(val_dataset)} examples")
print(f"✓ Prepared test dataset: {len(test_dataset)} examples")

Map:   0%|          | 0/6300 [00:00<?, ? examples/s]

Map:   0%|          | 0/350 [00:00<?, ? examples/s]

Map:   0%|          | 0/350 [00:00<?, ? examples/s]

✓ Prepared training dataset: 6300 examples
✓ Prepared validation dataset: 350 examples
✓ Prepared test dataset: 350 examples


In [88]:
# Training arguments
output_dir = "../outputs/"
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,  # Increased from 2 to 3
    per_device_train_batch_size=2,  # Increased from 1 to 2 if GPU allows
    gradient_accumulation_steps=8,  # Adjusted to keep effective batch = 16
    per_device_eval_batch_size=2,
    learning_rate=1e-4,  # Reduced from 2e-4 - lower learning rate for stability
    lr_scheduler_type="cosine",
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    eval_steps=500,
    eval_strategy="steps",  # Changed to "steps" to monitor training
    save_strategy="steps",
    save_total_limit=2,  # Keep 2 checkpoints to compare
    fp16=False,  # Enable fp16 for better training stability
    bf16=False,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_8bit",
    max_grad_norm=1.0,  # CRITICAL FIX: Enable gradient clipping
    report_to="none",
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    load_best_model_at_end=True,  # Load best checkpoint at end
)

print("Training configuration:")
print(f"  - Epochs: {training_args.num_train_epochs}")
print(f"  - Batch size (per device): {training_args.per_device_train_batch_size}")
print(f"  - Gradient accumulation steps: {training_args.gradient_accumulation_steps}")
print(f"  - Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  - Learning rate: {training_args.learning_rate}")
print(f"  - LR scheduler: {training_args.lr_scheduler_type}")
print(f"  - Warmup steps: {training_args.warmup_steps}")
print(f"  - Max grad norm: {training_args.max_grad_norm}")
print(f"  - Optimizer: {training_args.optim}")
print(f"  - FP16: {training_args.fp16}")
print(f"  - Evaluation strategy: {training_args.eval_strategy}")
print(f"  - Gradient checkpointing: {training_args.gradient_checkpointing}")

Training configuration:
  - Epochs: 3
  - Batch size (per device): 2
  - Gradient accumulation steps: 8
  - Effective batch size: 16
  - Learning rate: 0.0001
  - LR scheduler: SchedulerType.COSINE
  - Warmup steps: 100
  - Max grad norm: 1.0
  - Optimizer: OptimizerNames.PAGED_ADAMW_8BIT
  - FP16: False
  - Evaluation strategy: IntervalStrategy.STEPS
  - Gradient checkpointing: False


In [97]:

# Check TrainingArguments configuration
print("--- TrainingArguments Precision Settings ---")
print(f"fp16 (Half Precision): {training_args.fp16}")
print(f"bf16 (Bfloat16): {training_args.bf16}")

# Check the actual model data type
print("\n--- Model Hardware Precision ---")
print(f"Model primary dtype: {model.dtype}")

# If using QLoRA, check the 4-bit compute dtype
if hasattr(model, "config") and hasattr(model.config, "quantization_config"):
    compute_dtype = model.config.quantization_config.bnb_4bit_compute_dtype
    print(f"Quantization compute_dtype: {compute_dtype}")

# Verify CUDA compatibility for the selected precision
if training_args.bf16:
    cuda_supports_bf16 = torch.cuda.is_bf16_supported()
    print(f"\n--- Hardware Compatibility ---")
    print(f"GPU supports bf16: {cuda_supports_bf16}")
    if not cuda_supports_bf16:
        print("WARNING: You are using bf16 on a GPU that does not support it (like Tesla T4). This will cause errors.")

if training_args.fp16:
    print(f"\n--- Hardware Compatibility ---")
    print(f"FP16 enabled: Compatible with most GPUs including T4")

--- TrainingArguments Precision Settings ---
fp16 (Half Precision): False
bf16 (Bfloat16): False

--- Model Hardware Precision ---
Model primary dtype: torch.float32
Quantization compute_dtype: torch.float16


In [95]:
# Clear GPU cache and reset accelerator state before training
import gc
from accelerate import Accelerator

# Clear GPU memory
gc.collect()
torch.cuda.empty_cache()

# Reset accelerator state if it exists (fixes retraining in notebooks)
try:
    from accelerate.state import AcceleratorState
    if AcceleratorState._shared_state != {}:
        AcceleratorState._reset_state()
        print("✓ Accelerator state reset")
except:
    pass

# Reinitialize trainer to ensure clean state
trainer = SFTTrainer(
    model=model,                   
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    processing_class=tokenizer,     
    args=training_args,
)
print("✓ Trainer reinitialized with clean accelerator state")

Adding EOS to train dataset:   0%|          | 0/6300 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/6300 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/6300 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/350 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/350 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/350 [00:00<?, ? examples/s]

✓ Trainer reinitialized with clean accelerator state


In [96]:

# Start training
print("\n" + "="*60)
print("STARTING TRAINING")
print("="*60)

# Train the model
trainer.train()

print("\n" + "="*60)
print("✓ TRAINING COMPLETE!")
print("="*60)


STARTING TRAINING


NotImplementedError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

In [None]:
# Check training history to diagnose issues
print("\n" + "="*60)
print("TRAINING DIAGNOSTICS")
print("="*60)

if hasattr(trainer.state, 'log_history') and len(trainer.state.log_history) > 0:
    # Extract loss values
    train_losses = [log['loss'] for log in trainer.state.log_history if 'loss' in log]
    eval_losses = [log['eval_loss'] for log in trainer.state.log_history if 'eval_loss' in log]
    
    print(f"\nTraining steps completed: {trainer.state.global_step}")
    print(f"\nTraining Loss:")
    print(f"  - Initial: {train_losses[0]:.4f}")
    print(f"  - Final: {train_losses[-1]:.4f}")
    print(f"  - Change: {train_losses[-1] - train_losses[0]:.4f}")
    
    if eval_losses:
        print(f"\nValidation Loss:")
        print(f"  - Best: {min(eval_losses):.4f}")
        print(f"  - Final: {eval_losses[-1]:.4f}")
    
    # Warning if loss didn't decrease
    if len(train_losses) > 1 and train_losses[-1] >= train_losses[0]:
        print("\n⚠️  WARNING: Training loss did NOT decrease!")
        print("   This indicates training failed. Check hyperparameters.")
    else:
        print("\n✓ Training loss decreased successfully")
else:
    print("⚠️  No training history available")


TRAINING DIAGNOSTICS
⚠️  No training history available


In [None]:
# Save the final model
final_model_path = "../models/final/gemma-2b-financial-lora"
os.makedirs(final_model_path, exist_ok=True)

print(f"Saving model to: {final_model_path}")
trainer.model.save_pretrained(final_model_path)
tokenizer.save_pretrained(final_model_path)

print(f"\n✓ Model saved successfully!")
print(f"✓ Save location: {final_model_path}")

# List saved files
saved_files = os.listdir(final_model_path)
print(f"\nSaved files ({len(saved_files)}):")
for file in saved_files[:10]:  # Show first 10 files
    print(f"  - {file}")
if len(saved_files) > 10:
    print(f"  ... and {len(saved_files) - 10} more files")

print("\nThe LoRA adapters have been saved locally.")

Saving model to: ../models/final/gemma-2b-financial-lora

✓ Model saved successfully!
✓ Save location: ../models/final/gemma-2b-financial-lora

Saved files (7):
  - special_tokens_map.json
  - tokenizer.json
  - adapter_config.json
  - tokenizer_config.json
  - tokenizer.model
  - README.md
  - adapter_model.safetensors

The LoRA adapters have been saved locally.


---
## 5. Inference & Evaluation

Test the fine-tuned model with sample questions and evaluate on the test set.

In [24]:
# Inference function
def generate_response(instruction, input_context, max_new_tokens=256, temperature=0.7, top_p=0.9):
    """
    Generate response for a given instruction and context.
    
    Args:
        instruction: The question to answer
        input_context: The context from 10-K filing
        max_new_tokens: Maximum tokens to generate
        temperature: Sampling temperature (higher = more creative)
        top_p: Nucleus sampling parameter
    
    Returns:
        Generated response text
    """
    # Format prompt
    prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_context}

### Response:
"""
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    # Decode
    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract only the response part
    response = full_output.split("### Response:")[-1].strip()
    
    return response

print("✓ Inference function ready!")

✓ Inference function ready!


In [25]:
# Test with sample examples from test set
print("="*60)
print("INFERENCE EXAMPLES")
print("="*60)

num_examples = 3
for i in range(min(num_examples, len(test_data))):
    example = test_data[i]
    
    print(f"\n{'='*60}")
    print(f"EXAMPLE {i+1} - {example['ticker']}")
    print(f"{'='*60}")
    
    print(f"\nQuestion:\n{example['instruction']}")
    print(f"\nContext (excerpt):\n{example['input'][:200]}...")
    
    # Generate prediction
    prediction = generate_response(
        example['instruction'], 
        example['input'],
        max_new_tokens=200,
        temperature=0.7
    )
    
    print(f"\n{'─'*60}")
    print(f"MODEL PREDICTION:\n{prediction}")
    print(f"\n{'─'*60}")
    print(f"GROUND TRUTH:\n{example['output']}")
    print(f"{'─'*60}")

INFERENCE EXAMPLES

EXAMPLE 1 - ABBV

Question:
How will the Inflation Reduction Act of 2022 impact Medicare Part D and Part B drug pricing?

Context (excerpt):
The Inflation Reduction Act of 2022 requires the government to set prices for select high expenditure Medicare Part D drugs and Part B drugs that are more than nine years (for small-molecule drugs) or...





────────────────────────────────────────────────────────────
MODEL PREDICTION:
TheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheThe

────────────────────────────────────────────────────────────
GROUND TRUTH:
The Inflation Reduction Act requires government-set prices for select drugs more than nine years from FDA approval and mandates rebates when drug prices increase faster than inflation.
──────────────────────────────────────────────────────────

In [26]:
# Evaluate on test set
print("\n" + "="*60)
print("EVALUATING ON TEST SET")
print("="*60)

# Load ROUGE metric
rouge = evaluate.load('rouge')

predictions = []
references = []

print(f"\nGenerating predictions for {len(test_data)} test examples...\n")

for example in tqdm(test_data[:100], desc="Evaluating"):  # Evaluate on first 100 for speed
    pred = generate_response(
        example['instruction'], 
        example['input'],
        max_new_tokens=200,
        temperature=0.7
    )
    predictions.append(pred)
    references.append(example['output'])

# Calculate ROUGE scores
rouge_scores = rouge.compute(predictions=predictions, references=references)

print("\n" + "="*60)
print("EVALUATION RESULTS")
print("="*60)
print(f"\nROUGE Scores (on {len(predictions)} test examples):")
print(f"  - ROUGE-1: {rouge_scores['rouge1']:.4f}")
print(f"  - ROUGE-2: {rouge_scores['rouge2']:.4f}")
print(f"  - ROUGE-L: {rouge_scores['rougeL']:.4f}")
print(f"  - ROUGE-Lsum: {rouge_scores['rougeLsum']:.4f}")


EVALUATING ON TEST SET

Generating predictions for 350 test examples...



Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]


EVALUATION RESULTS

ROUGE Scores (on 100 test examples):
  - ROUGE-1: 0.0061
  - ROUGE-2: 0.0000
  - ROUGE-L: 0.0060
  - ROUGE-Lsum: 0.0059


In [None]:
# Save evaluation results
results = {
    "model": MODEL_NAME,
    "method": "QLoRA",
    "lora_config": {
        "r": peft_config.r,
        "alpha": peft_config.lora_alpha,
        "dropout": peft_config.lora_dropout,
        "target_modules": peft_config.target_modules
    },
    "training_examples": len(train_data),
    "eval_examples": len(predictions),
    "rouge_scores": rouge_scores,
    "sample_predictions": [
        {
            "instruction": test_data[i]['instruction'],
            "prediction": predictions[i],
            "reference": references[i]
        }
        for i in range(min(5, len(predictions)))
    ]
}

results_path = "../outputs/results/evaluation_results.json"
os.makedirs(os.path.dirname(results_path), exist_ok=True)

with open(results_path, 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print(f"\n✓ Evaluation results saved to: {results_path}")

TypeError: Object of type set is not JSON serializable

---
## 6. Interactive Testing (Optional)

Test the model with custom questions.

In [29]:
# Interactive testing - modify these to test your own questions
custom_instruction = "What was the company's total revenue?"
custom_context = """The company reported strong financial performance for fiscal year 2023. 
Total revenue increased by 15% year-over-year to reach $26.97 billion. 
This growth was primarily driven by increased demand for our products and services."""

print("="*60)
print("CUSTOM QUESTION TEST")
print("="*60)
print(f"\nQuestion: {custom_instruction}")
print(f"\nContext: {custom_context}")

response = generate_response(custom_instruction, custom_context, max_new_tokens=150)

print(f"\nModel Response:\n{response}")

CUSTOM QUESTION TEST

Question: What was the company's total revenue?

Context: The company reported strong financial performance for fiscal year 2023. 
Total revenue increased by 15% year-over-year to reach $26.97 billion. 
This growth was primarily driven by increased demand for our products and services.

Model Response:
TheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheTheThe


---
## 7. Export Model (Optional)

Upload to HuggingFace Hub or download locally.

In [None]:
# Optional: Push to HuggingFace Hub
# Uncomment and configure if you want to share your model

# from huggingface_hub import HfApi
# 
# model_id = "your-username/gemma-2b-financial-qa-lora"
# trainer.model.push_to_hub(model_id)
# tokenizer.push_to_hub(model_id)
# 
# print(f"✓ Model pushed to HuggingFace Hub: {model_id}")

In [None]:
# Create a downloadable archive (for local use)
import shutil

archive_path = "../models/final/gemma-2b-financial-qa-model"
if os.path.exists(archive_path + ".zip"):
    os.remove(archive_path + ".zip")

shutil.make_archive(archive_path, 'zip', final_model_path)
print(f"✓ Model archived to: {archive_path}.zip")
print("\nThe model archive has been saved locally.")

✓ Model archived to: ../gemma-2b-financial-qa-model.zip

You can download this file and use it locally or upload to HuggingFace.


---
## Troubleshooting Guide

### If You See Repetitive Text (Model Collapse):

**Symptoms**: Model outputs "TheTheThe..." or "$$$..." repeatedly

**Common Causes & Fixes**:
1. **Gradient clipping disabled** → Set `max_grad_norm=1.0` ✓ (Fixed above)
2. **Learning rate too high** → Reduced to `1e-4` from `2e-4` ✓ (Fixed above)
3. **Not enough training** → Increased to 3 epochs ✓ (Fixed above)
4. **No monitoring** → Enabled `eval_strategy="steps"` ✓ (Fixed above)

### What to Check After Retraining:
- Training loss should **decrease** (check cell 27 output)
- Validation loss should **decrease** (monitored every 500 steps)
- First inference examples should show **coherent text**, not repetition

### If Problems Persist:
- Increase `num_train_epochs` to 4-5
- Try `learning_rate=5e-5` (even lower)
- Increase `max_seq_length` to 512 (more context)
- Check that your GPU has enough VRAM

---
## Summary

You have successfully:
1. ✓ Preprocessed 5,000 financial Q&A examples into Alpaca format
2. ✓ Fine-tuned Gemma-2B using QLoRA (4-bit quantization)
3. ✓ Evaluated the model on test data using ROUGE metrics
4. ✓ Saved the fine-tuned LoRA adapters

**Next Steps:**
- Experiment with different hyperparameters (learning rate, LoRA rank, etc.)
- Try fine-tuning on the full dataset (7,000+ examples)
- Test with real-world financial questions
- Integrate into a RAG (Retrieval-Augmented Generation) pipeline
- Deploy as an API using FastAPI or Gradio

**Model Usage:**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and LoRA adapters
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")
model = PeftModel.from_pretrained(base_model, "path/to/lora/adapters")
tokenizer = AutoTokenizer.from_pretrained("path/to/lora/adapters")
```