
# WCF AI Hands-On Lab: Claims Analytics & Fraud Detection

This Colab notebook is designed for a 1‑day training with the Workers Compensation Fund (WCF) tech team.
You'll work with an anonymized **WCF‑style sample dataset** and (optionally) the Kaggle **“Easy Peasy – Predict Worker Compensation Claims”** dataset.

**Objectives**
- Explore, clean, and visualize claims data.
- Detect anomalies (potential fraud) using an unsupervised model.
- Summarize injury text fields.
- (Optional) Load and compare with the Kaggle *Easy Peasy* dataset.


In [None]:

#@title Install and import libraries
#!pip -q install pandas scikit-learn matplotlib nltk kaggle --upgrade

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
import nltk
import os
nltk.download('punkt')
print("Setup complete.")


## 1) Load the WCF-style Sample Dataset

**Option A (Upload):** Download the CSV from the training resources, then upload here.  
**Option B (Google Drive):** Place the CSV in your Drive and load via path.

> File name: `wcf_sample_claims.csv`


In [None]:
# Try local dataset path first (relative to the notebook's working directory)
local_dataset_dir = os.path.join(os.getcwd(), "datasets")
local_csv_path = os.path.join(local_dataset_dir, "synthetic_sample_claims.csv")
    
# Fallback: absolute path to the repo dataset folder (if running from a subdir)
repo_root = "/Users/aronkondoro/Library/Mobile Documents/com~apple~CloudDocs/Projects/WCF"
fallback_csv_path = os.path.join(repo_root, "dataset", "synthetic_sample_claims.csv")

csv_path = local_csv_path if os.path.exists(local_csv_path) else fallback_csv_path
print(f"Loading CSV from: {csv_path}")

df = pd.read_csv(csv_path, parse_dates=["Date_Filed"]) 
df.head()


## 2) Quick Exploration


In [None]:

df.info()
display(df.describe(include='all'))


In [None]:

df['Suspected_Fraud'].value_counts(normalize=True).rename('share').to_frame()


## 3) Visualize Claims


In [None]:

plt.figure()
plt.scatter(df['Claim_Amount_TZS'], df['Processing_Time_Days'])
plt.xlabel("Claim Amount (TZS)")
plt.ylabel("Processing Time (days)")
plt.title("Claims: Amount vs Processing Time")
plt.show()

In [None]:

plt.figure()
df.groupby('Region')['Claim_Amount_TZS'].mean().sort_values().plot(kind='bar')
plt.title("Average Claim Amount by Region")
plt.ylabel("TZS")
plt.tight_layout()
plt.show()


## 4) Anomaly Detection (Unsupervised)
We'll use **IsolationForest** on numerical features to flag potentially unusual claims.


In [None]:

features = df[['Claim_Amount_TZS', 'Processing_Time_Days', 'Age']].copy()
model = IsolationForest(contamination=0.05, random_state=42)
df['anomaly'] = model.fit_predict(features)

outliers = df[df['anomaly'] == -1]
print(f"Flagged {len(outliers)} / {len(df)} claims as unusual (~{len(outliers)/len(df):.1%}).")
outliers[['Claim_ID','Claim_Amount_TZS','Processing_Time_Days','Sector','Channel','Suspected_Fraud']].head(10)

In [None]:
plt.figure()
plt.scatter(df['Claim_Amount_TZS'], df['Processing_Time_Days'], c=(df['anomaly']==-1).astype(int))
plt.xlabel("Claim Amount (TZS)")
plt.ylabel("Processing Time (days)")
plt.title("Outliers Highlighted")
plt.show()


## 5) Quick Text Summaries (NLP)


In [None]:

from nltk.tokenize import sent_tokenize

def summarize_text(text):
    sents = sent_tokenize(text or "")
    return sents[0] if sents else ""

df['Summary'] = df['Injury_Description'].apply(summarize_text)
df[['Claim_ID','Injury_Description','Summary']].head(5)


## 6) (Optional) Load the Kaggle **“Easy Peasy – Predict Worker Compensation Claims”** Dataset

**One-time setup in Colab:**
1. Create a Kaggle account, then go to *Account → API → Create New Token*. This downloads `kaggle.json`.
2. Upload `kaggle.json` to Colab (left sidebar → files).
3. Run the cell below to place it at `~/.kaggle/kaggle.json` and download the dataset.

> Dataset slug: `lucamassaron/easy-peasy-its-lemon-squeezy`


In [None]:

#@title Kaggle API setup (run after uploading kaggle.json)
import os, shutil, zipfile, glob

os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)
if os.path.exists("/content/kaggle.json"):
    shutil.move("/content/kaggle.json", os.path.expanduser("~/.kaggle/kaggle.json"))
os.chmod(os.path.expanduser("~/.kaggle/kaggle.json"), 0o600)

!kaggle datasets download -d lucamassaron/easy-peasy-its-lemon-squeezy -p /content/easy_peasy -q
# Unzip
for z in glob.glob("/content/easy_peasy/*.zip"):
    with zipfile.ZipFile(z, 'r') as zip_ref:
        zip_ref.extractall("/content/easy_peasy")
print("Easy Peasy dataset ready at /content/easy_peasy")


In [None]:

# List files to identify CSVs to load
import os

for root, dirs, files in os.walk("/content/easy_peasy"):
    for f in files:
        if f.lower().endswith(".csv"):
            print(os.path.join(root, f))

# Example usage:
# easy_df = pd.read_csv("/content/easy_peasy/<replace_with_csv_name>.csv")
# easy_df.head()



## 7) Next Steps
- Discuss which additional fields (e.g., diagnosis codes, employer compliance history) would improve detection.
- Draft a WCF pilot plan: data access, success metrics, integration points, and governance.
