# Injury Type Classification and Retrieval with Pinecone and Vector Embeddings

Every workplace—whether it's a bustling construction site, a corporate office, or a quiet warehouse—carries some level of risk. When accidents occur, be it a slip, a fall, or a repetitive strain injury, workers’ compensation plays a crucial role. This system ensures injured employees receive the support they need while helping businesses manage claims in a structured and efficient manner.

Now, it’s time for us to step into the action.

In this project, our goal is to analyze workers’ compensation claims and retrieve cases that match specific injury types based on user-defined query prompts. From uncovering patterns in back injuries and falls to detecting repetitive motion strains, our work aims to reveal how injuries occur and how similar claims are addressed. These insights have the power to inform better workplace safety practices and more strategic resource allocation.

---

## 📊 Data Overview

The dataset for this project consists of **synthetically generated workers’ compensation insurance policies**, where every case includes an accident. Each record provides:

- Demographic information  
- Worker-related details  
- A natural language description of the accident  

To maintain privacy and optimize performance, we've narrowed the dataset to the **first 100 records** of the original file sourced from [Kaggle](https://www.kaggle.com/). This sample allows for meaningful analysis while conserving OpenAI API usage.

📁 **Dataset:** `insurance_claims_top_100.csv`


| Column                     | Description                                                                                   |
|----------------------------|-----------------------------------------------------------------------------------------------|
| `'ClaimNumber'`            | Unique policy identifier. Each policy has a single claim in this synthetically generated data set.                                                                      |
| `'DateTimeOfAccident'`     | Date and time when the accident occurred (MM/DD/YYYY HH:MM:SS).                                |
| `'DateReported'`           | Date the accident was reported to the insurer (MM/DD/YYYY).                                   |
| `'Age'`                    | Age of the worker involved in the claim.                                                      |
| `'Gender'`                 | Gender of the worker: `M` for Male, `F` for Female, or `U` for Unknown.                       |
| `'MaritalStatus'`          | Marital status of the worker: Married, Single, or Unknown.                                    |
| `'DependentChildren'`      | Number of dependent children.                                                                 |
| `'DependentsOther'`        | Number of dependents excluding children.                                                      |
| `'WeeklyWages'`            | Total weekly wage of the worker.                                                              |
| `'PartTimeFullTime'`       | Employment type: `P` for Part-time or `F` for Full-time.                                      |
| `'HoursWorkedPerWeek'`     | Total hours worked per week by the worker.                                                    |
| `'DaysWorkedPerWeek'`      | Number of days worked per week by the worker.                                                 |
| `'ClaimDescription'`       | Free-text description of the claim, providing details about the incident.                     |
| `'InitialIncurredClaimCost'` | Initial cost estimate for the claim made by the insurer.                                      |
| `'UltimateIncurredClaimCost'` | Total claims payments by the insurance company. This is the target variable for prediction. |


In [74]:
# Install the Pinecone Python SDK
!pip install pinecone
!pip install openai pinecone-client

Defaulting to user installation because normal site-packages is not writeable
Collecting pinecone-plugin-inference<4.0.0,>=2.0.0 (from pinecone)
  Using cached pinecone_plugin_inference-3.1.0-py3-none-any.whl.metadata (2.2 kB)
Using cached pinecone_plugin_inference-3.1.0-py3-none-any.whl (87 kB)
Installing collected packages: pinecone-plugin-inference
  Attempting uninstall: pinecone-plugin-inference
    Found existing installation: pinecone-plugin-inference 1.1.0
    Uninstalling pinecone-plugin-inference-1.1.0:
      Successfully uninstalled pinecone-plugin-inference-1.1.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pinecone-client 5.0.1 requires pinecone-plugin-inference<2.0.0,>=1.0.3, but you have pinecone-plugin-inference 3.1.0 which is incompatible.[0m[31m
[0mSuccessfully installed pinecone-plugin-inference-3.1.0
Defaulting to user installatio

In [75]:
# Import the relevant Python libraries
import pandas as pd
import openai
from openai import OpenAI
import pinecone
from pinecone import ServerlessSpec
import os

In [76]:
# Here we are loading the insurance claims dataset into Pandas
df = pd.read_csv('insurance_claims_top_100.csv')
print(df.head())
print(df.describe())

  ClaimNumber  ... InitialIncurredCalimsCost
0   WC8145235  ...                      5300
1   WC2005111  ...                      2000
2   WC6899143  ...                     20000
3   WC5502023  ...                       350
4   WC4785156  ...                      3000

[5 rows x 14 columns]
              Age  ...  InitialIncurredCalimsCost
count  100.000000  ...                 100.000000
mean    33.550000  ...                7119.520000
std     12.251881  ...                9734.215264
min     16.000000  ...                 229.000000
25%     23.750000  ...                 632.500000
50%     31.000000  ...                3250.000000
75%     41.000000  ...                9500.000000
max     67.000000  ...               43587.000000

[8 rows x 7 columns]


In [77]:
print("Take a look of the table:")
display(df)

Take a look of the table:


Unnamed: 0,ClaimNumber,DateTimeOfAccident,DateReported,Age,Gender,MaritalStatus,DependentChildren,DependentsOther,WeeklyWages,PartTimeFullTime,HoursWorkedPerWeek,DaysWorkedPerWeek,ClaimDescription,InitialIncurredCalimsCost
0,WC8145235,2002-04-02T10:00:00Z,2002-05-07T00:00:00Z,26,M,S,1,0,600.18,F,40.0,5,CAUGHT RIGHT HAND WITH HAMMER BURN TO RIGHT HAND,5300
1,WC2005111,1988-04-06T16:00:00Z,1988-04-15T00:00:00Z,31,M,M,0,0,311.54,F,35.0,5,SPRAINED RIGHT ANKLE FRACTURE RIGHT ELBOW,2000
2,WC6899143,1999-03-08T09:00:00Z,1999-04-04T00:00:00Z,57,M,M,0,0,1000.00,F,38.0,5,STRUCK HAMMER CRUSH INJURY FINGERS HAND,20000
3,WC5502023,1996-07-26T09:00:00Z,1996-09-04T00:00:00Z,33,M,M,0,0,200.00,F,38.0,5,STRUCK AGAINST AIR HOSE STRUCK GLASS LACERATIO...,350
4,WC4785156,1994-04-13T14:00:00Z,1994-07-07T00:00:00Z,32,F,M,0,0,359.60,F,40.0,5,FOREIGN BODY IN RIGHT FOOT BRUISED RIGHT BIG TOE,3000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,WC7232597,2000-07-10T11:00:00Z,2000-08-10T00:00:00Z,62,M,M,0,0,200.00,F,38.0,5,FELL DOWN LARGE STEP STRAIN LEFT CALF MUSCLE,34000
96,WC2603789,1990-02-23T09:00:00Z,1990-04-02T00:00:00Z,28,M,S,0,0,200.00,F,38.0,5,NAIL GUN FIRED THROUGH TOE PUNCTURE WOUND LEFT...,350
97,WC8290625,2002-04-15T11:00:00Z,2002-06-09T00:00:00Z,44,M,M,0,0,500.00,F,38.0,5,LIFTING TYRES LOWER BACK STRAIN,1000
98,WC9114830,2004-07-19T07:00:00Z,2004-07-26T00:00:00Z,17,M,S,0,0,500.00,F,40.0,5,LIFTING FRIDGE RIGHT SHOULDER STRAIN NECK RIGH...,5000


In [78]:
# API KEYS 
openai_api_key = os.environ["OPENAI_API_KEY"]
pinecone_api_key = os.environ["PINECONE_API_KEY"]

In [79]:
# Pinecone and OpenAI client
client = OpenAI(api_key=openai_api_key)
pc = Pinecone(api_key=pinecone_api_key)

# Create and setup for the Pinecone index
index_name = "insurance-claims"

# if index does not existe, we will create e new index
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="euclidean",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

index = pc.Index(index_name)

In [80]:
# function to create embeddings
def get_embedding(text, model="text-embedding-3-small"):
	text = text.replace("\n", " ")
	return client.embeddings.create(input=[text], model=model).data[0].embedding

In [68]:
# Creating the embeddings and upserting them to Pinecone
for idx, row in df.iterrows():
	embedding = get_embedding(row['ClaimDescription'])
	index.upsert([(str(row['ClaimNumber']), embedding)])

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_338/542309297.py", line 4, in <module>
    index.upsert([(str(row['ClaimNumber']), embedding)])
  File "/home/repl/.local/lib/python3.8/site-packages/pinecone/utils/error_handling.py", line 11, in inner_func
    return func(*args, **kwargs)
  File "/home/repl/.local/lib/python3.8/site-packages/pinecone/data/index.py", line 218, in upsert
    yield batch
  File "/home/repl/.local/lib/python3.8/site-packages/pinecone/data/index.py", line 247, in _upsert_batch
    disable=not show_progress,
  File "/home/repl/.local/lib/python3.8/site-packages/pinecone/core/openapi/shared/api_client.py", line 821, in __call__
    self.settings["http_method"],
  File "/home/repl/.local/lib/python3.8/site-packages/pinecone/core/openapi/data/api/data_plane_api.py", line 811, in __upsert
    retu

In [69]:
# function to find similar claims
def find_similar_claims(query, top_k=5):
	# Generate the embedding for the query
	query_embedding = get_embedding(query)
	# Query the Pinecone index using keyword arguments
	results = index.query(vector=query_embedding, top_k=top_k, include_metadata=True)
	return results

In [70]:
# Example query
query = "Car accident with rear-end collision"
similar_claims = find_similar_claims(query)

# Print the results
print(similar_claims)

{'matches': [{'id': 'WC8133442',
              'metadata': {},
              'score': 0.893837214,
              'values': []},
             {'id': 'WC3625998',
              'metadata': {},
              'score': 0.978904724,
              'values': []},
             {'id': 'WC7425192',
              'metadata': {},
              'score': 1.10025048,
              'values': []},
             {'id': 'WC9445451',
              'metadata': {},
              'score': 1.14009118,
              'values': []},
             {'id': 'WC5114658',
              'metadata': {},
              'score': 1.15762162,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}


In [82]:
# finding the ID of the most similar claim
closest_claim_id = similar_claims['matches'][0]['id']
print(closest_claim_id)

WC8133442


In [83]:
# finding the description of the most similar claim
closest_claim_description = df[df['ClaimNumber'] == closest_claim_id]['ClaimDescription'].to_numpy()[0]
print(closest_claim_description)

COLLISION WITH MOTOR VEHICLE ACCIDENT SORE NECK


In [84]:
# finding the description of the claim most similar to the query: "Worker developed carpal tunnel syndrome from repetitive typing"
query = "Worker developed carpal tunnel syndrome from repetitive typing"
similar_claims_carpal_tunnel = find_similar_claims(query)
closest_claim_carpal_tunnel = similar_claims_carpal_tunnel['matches'][0]['id']
closest_claim_description_carpal_tunnel = df[df['ClaimNumber'] == closest_claim_carpal_tunnel]['ClaimDescription'].to_numpy()[0]
print(closest_claim_description_carpal_tunnel)

WHILE DEALING CARDS RIGHT TENDON SYNOVITIS RIGHT WRIST
