# Financial PII Detection Model - Supervised Fine-Tuning

This notebook demonstrates how to fine-tune a language model to detect and protect personally identifiable information (PII) in financial documents.

## Setup

First, make sure you're using a GPU runtime in Colab:
- Go to Runtime > Change runtime type
- Select GPU from the Hardware accelerator dropdown
- Click Save

## Step 1: Install Unsloth

Unsloth is a library that makes fine-tuning faster and more memory-efficient.

In [None]:
!pip install unsloth -q

## Step 2: Import Libraries

In [None]:
import torch
from unsloth import FastLanguageModel
from huggingface_hub import login
from google.colab import userdata
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

## Step 3: Set Configuration Parameters

In [None]:
max_seq_length = 2048  # Maximum number of tokens in a sequence
dtype = None           # Data type for model weights (None uses default)
load_in_4bit = True    # Enable 4-bit quantization for memory efficiency

## Step 4: Log in to Hugging Face

You'll need to add your Hugging Face token to your Google Colab secrets:
1. Go to https://huggingface.co/settings/tokens to create a token if you don't have one
2. In Colab, click on the key icon in the left sidebar
3. Add a new secret with name "HuggingFace" and your token as the value

In [None]:
hf_token = userdata.get('HuggingFace')
login(hf_token)

## Step 5: Load the Model and Tokenizer

We're using the DeepSeek-R1-Distill-Llama-8B model, which is a good balance of power and efficiency.

In [None]:
print("Loading base model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
)

## Step 6: Define the Prompt Formats

We'll create two prompt formats:
- One for inference (testing)
- One for training that includes the "think" tags for chain-of-thought reasoning

In [None]:
# Format for inference
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:

You are a financial expert specializing in identifying and protecting personally identifiable information (PII) in financial documents.

Please analyze the following document for any PII and explain which elements need protection.

### Document:

{}

### Response:

{}"""

# Format for training (with think tags)
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:

You are a financial expert specializing in identifying and protecting personally identifiable information (PII) in financial documents.

Please analyze the following document for any PII and explain which elements need protection.

### Document:

{}

### Response:

<think>

{}

</think>

{}"""

## Step 7: Test the Base Model Before Fine-Tuning

Let's see how the model performs on a sample financial document before training.

In [None]:
print("Testing base model...")
test_document = """
Loan Application

Full Legal Name: Jane A. Smith
Date of Birth: 04/12/1990

Mailing Address:
123 Main Street
Boston, MA 02108

Phone Number: (617) 555-1234
Email Address: jane.smith@email.com

Bank Account: 9876543210
Social Security Number: 123-45-6789
"""

FastLanguageModel.for_inference(model) 
inputs = tokenizer([prompt_style.format(test_document, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print("Base model response:")
print(response[0].split("### Response:")[1])

## Step 8: Initialize the Model for Fine-Tuning

We'll use LoRA (Low-Rank Adaptation) to efficiently fine-tune the model while keeping memory usage low.

In [None]:
print("Initializing for fine-tuning...")
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                   # Rank of the adaptation matrices
    target_modules=[        # Which modules to fine-tune
        "q_proj",           # Query projection
        "k_proj",           # Key projection
        "v_proj",           # Value projection
        "o_proj",           # Output projection
        "gate_proj",        # Gate projection for MLP
        "up_proj",          # Upward projection for MLP
        "down_proj",        # Downward projection for MLP
    ],
    lora_alpha=16,          # Alpha parameter for LoRA
    lora_dropout=0,         # Dropout probability for LoRA
    bias="none",            # Whether to train bias parameters
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=9001,      # Random seed for reproducibility
    use_rslora=False,       # Whether to use rank-stabilized LoRA
    loftq_config=None,      # LoftQ quantization config
)

## Step 9: Create the Formatting Function for Training Data

In [None]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    inputs = examples["text"]
    texts = []
    
    for input_text in inputs:
        # For the reasoning part, we'll use empty strings during training
        # The model will learn to generate this during fine-tuning
        cot = ""
        
        # Create simple output that identifies PII
        output = "This document contains personally identifiable information (PII) that should be protected according to financial regulations."
        
        # Format the text according to our training prompt style
        text = train_prompt_style.format(input_text, cot, output) + EOS_TOKEN
        texts.append(text)
        
    return {"text": texts}

## Step 10: Load and Prepare the Dataset

We'll use the Gretel AI synthetic financial PII dataset from Hugging Face.

In [None]:
print("Loading dataset...")
dataset = load_dataset("gretelai/synthetic_pii_finance_multilingual", split="train[0:500]")
dataset = dataset.filter(lambda example: example["language"] == "English")  # Optional: filter to just English documents

print("Preparing dataset...")
dataset = dataset.map(formatting_prompts_func, batched=True)

# Print a sample to verify
print("Sample formatted training data:")
print(dataset["text"][0])

## Step 11: Set Up the Trainer

In [None]:
print("Setting up trainer...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,      # Number of examples per GPU
        gradient_accumulation_steps=4,      # Number of updates to accumulate before updating weights
        warmup_steps=5,                     # Steps of warmup for learning rate
        max_steps=60,                       # Total number of training steps
        learning_rate=2e-4,                 # Learning rate
        fp16=not is_bfloat16_supported(),   # Whether to use 16-bit floating point precision
        bf16=is_bfloat16_supported(),       # Whether to use bfloat16 precision (better on newer GPUs)
        logging_steps=10,                   # How often to log stats
        optim="adamw_8bit",                 # Optimizer to use
        weight_decay=0.01,                  # L2 regularization strength
        lr_scheduler_type="linear",         # Learning rate schedule
        seed=3407,                          # Random seed
        output_dir="outputs",               # Directory to save outputs
        report_to="none"                    # Disable reporting to services like Wandb
    ),
)

## Step 12: Train the Model

This is the main training step. For a quick demonstration, we're only using 60 steps, but for production use, you'd want to increase this to 500-2000 steps.

In [None]:
print("Starting fine-tuning...")
trainer_stats = trainer.train()
print("Fine-tuning complete!")
print(f"Training stats: {trainer_stats}")

## Step 13: Test the Fine-Tuned Model

Let's see how the model performs on the same document after training.

In [None]:
print("Testing fine-tuned model...")
FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(test_document, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print("Fine-tuned model response:")
print(response[0].split("### Response:")[1])

## Step 14: Save the Model

Now that our model is trained, let's save it locally.

In [None]:
print("Saving model...")
new_model_local = "Financial-PII-Detection-Expert"

model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)

## Step 15: Push to Hugging Face (Optional)

If you want to share your model with others, you can upload it to Hugging Face.

In [None]:
# Set to True if you want to push to Hugging Face
push_to_hub = False

if push_to_hub:
    print("Pushing to Hugging Face...")
    # Change to your username
    new_model_online = "your-username/financial-pii-detection-expert"  
    
    model.push_to_hub(new_model_online)
    tokenizer.push_to_hub(new_model_online)
    print(f"Model pushed to: {new_model_online}")

## Conclusion

Congratulations! You've successfully fine-tuned a language model to identify and protect PII in financial documents. This model can now be used as part of your data privacy workflow.

Some ideas for how to use this model:
- Create an automated PII detection system for document processing
- Build a compliance checking tool for financial documents
- Use it as a training tool for staff handling sensitive information
- Integrate it into a document redaction pipeline