# Insurance Document Classification with RoBERTa

This notebook demonstrates how to build an intelligent insurance document classifier using the DistilRoBERTa base model. The classifier will be able to categorize insurance documents into three classes:

1. Policy documents 
2. Claims
3. Support queries

We chose DistilRoBERTa because:
- It's a lighter, distilled version of RoBERTa that maintains strong performance
- RoBERTa is particularly good at document classification tasks due to its robust pretraining
- The model size (~82M parameters) provides a good balance between accuracy and computational efficiency
- The distilled architecture allows for faster inference in production

The notebook covers:
- Data preparation and preprocessing
- Model fine-tuning
- Evaluation
- Building an inference pipeline

In [11]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
from prep_data import load_and_prepare_data, create_datasets
from model_training import ModelTrainer
from insurance_agent import InsuranceAgent

## Data Preparation

We first need to prepare out synthetic data for fine tuning our model. We'd want many more examples than 100, but this is just a demo.

In [13]:
data_splits = load_and_prepare_data(file_path="../data/synthetic_insurance_data.json")

Loading dataset from: ../data/synthetic_insurance_data.json

Dataset split complete:
Training examples: 86
Testing examples: 22


In [14]:
# Create datasets
tokenizer, train_dataset, test_dataset = create_datasets(data_splits)


Initializing tokenizer: distilroberta-base


In [15]:
# Let's look at an example from our training dataset
print("\nExample from training dataset:")
example_idx = 0
example = train_dataset[example_idx]

print("Input IDs shape:", example['input_ids'].shape)
print("Attention mask shape:", example['attention_mask'].shape)
print("Label:", example['labels'])


Example from training dataset:
Input IDs shape: torch.Size([69])
Attention mask shape: torch.Size([69])
Label: 0


We now have a dataset that we can use to fine tune our model.

## Fine Tuning RoBERTa

We'll use the `AutoModelForSequenceClassification` class to fine tune our RoBERTa model.

In [16]:
trainer = ModelTrainer()

trained_model, eval_results = trainer.train(train_dataset, test_dataset)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Initializing distilroberta-base for fine-tuning...

Starting training...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.1004,0.709337,0.909091,0.910279
2,0.0649,0.016989,1.0,1.0
3,0.0063,0.002791,1.0,1.0



Training completed!

Evaluation Results:
eval_loss: 0.0170
eval_accuracy: 1.0000
eval_f1: 1.0000
eval_runtime: 2.5593
eval_samples_per_second: 8.5960
eval_steps_per_second: 1.1720
epoch: 3.0000


In [17]:

print("\nSaving model and tokenizer...")
trained_model.save_model("../models/insurance-model")
tokenizer.save_pretrained("../models/insurance-model")

print("\nModel and tokenizer saved to './insurance-model'")



Saving model and tokenizer...

Model and tokenizer saved to './insurance-model'


Great! Our model is trained for demo purposes. It overfit on the training data due to such a small dataset. In practice, we'd want a larger data set and would focus on improving the training parameters to prevent overfitting.

## Agentic Approach

Now that we have a fine-tuned model, we can use it to classify documents. The `InsuranceAgent` class is a simple agent that uses the fine-tuned model to classify documents. Downstream operations are in place for more complex tasks, but are not implemented in this demo.

In [18]:
# Initialize the agent
agent = InsuranceAgent(model_path="../models/insurance-model")


Initializing Insurance Agent...


In [19]:
# Example documents to process
test_documents = [
    {
        "type": "Policy Document",
        "text": "This cyber insurance policy provides coverage for data breaches and ransomware attacks. Policy limits: $1M per occurrence."
    },
    {
        "type": "Claim Submission",
        "text": "Filing a claim for ransomware attack that occurred on Aug 15, 2024. Systems were encrypted and business was interrupted for 48 hours."
    },
    {
        "type": "Support Query",
        "text": "Need clarification on cyber coverage limits and exclusions for cloud service provider outages."
    }
]


In [20]:
# Process each document and show results
for doc in test_documents:
    print(f"\nProcessing {doc['type']}...")
    result = agent.process_document(doc['text'])
    
    print(f"\nResults:")
    print(f"Document Type: {result['document_type']}")
    print(f"Confidence: {result['confidence']:.2%}")
    print(f"Response: {result['response']}")
    print(f"Next Steps: {result['next_steps']}")
    print("-" * 50)


Processing Policy Document...

Results:
Document Type: policy
Confidence: 99.06%
Response: {'action': 'policy_analysis', 'message': 'Policy analysis would extract key information, coverage details, and generate recommendations.', 'sample_data': {'policy_type': 'Example Policy Type', 'coverage_amount': '$500,000', 'effective_date': '2024-01-01'}}
Next Steps: ['Example Step 1 for policy', 'Example Step 2 for policy', 'Contact support if needed']
--------------------------------------------------

Processing Claim Submission...

Results:
Document Type: claim
Confidence: 99.22%
Response: {'action': 'claim_processing', 'message': 'Claim processing would validate details, assess urgency, and determine required documentation.', 'sample_data': {'claim_id': 'CLM123456', 'status': 'Under Review', 'priority': 'Medium'}}
Next Steps: ['Example Step 1 for claim', 'Example Step 2 for claim', 'Contact support if needed']
--------------------------------------------------

Processing Support Query...


## Conclusion

We can see our model properly classifies the documents and provides hypothetical next steps. This demo is simple, but shows how we can use a fine-tuned model to handle a wide range of tasks.

Building specialized agents for each document type (policy, claims, support) would allow us to incorporate domain-specific logic and optimizations. For example, a dedicated claims agent could have specialized NER models for extracting incident details, while a policy agent could focus on coverage analysis and risk assessment.

For even more sophisticated processing, we can leverage different transformer architectures based on the specific needs of each agent:
T5 models excel at structured question answering tasks, making them ideal for claims processing agents that need to extract specific details from incident reports. For instance, the claims agent could use T5 to systematically query the document: "When did the incident occur?", "What type of damage occurred?", "Were there any witnesses?" T5's encoder-decoder architecture is particularly well-suited for these focused extraction tasks where we need precise, factual answers.

GPT models, with their autoregressive architecture, are better suited for more open-ended generation tasks. A policy analysis agent could leverage GPT to generate comprehensive coverage summaries, identify potential coverage gaps, or explain complex policy terms in plain language. The model's strength in maintaining context and generating coherent, contextually relevant text makes it valuable for tasks requiring more nuanced understanding and explanation.

We could also implement hybrid approaches where different models handle different aspects of the processing pipeline. For example, a support ticket agent might use T5 to extract key issue details and categorize the request, then use GPT to generate appropriate response templates or suggest resolution steps. This combination allows us to leverage the strengths of each architecture - T5's precision in information extraction and GPT's fluency in generation.

The key is matching the model architecture to the specific requirements of each agent's tasks. 