# NYCHHC AI-Powered Charge Capture System
### A Step-by-Step Implementation

**Objective:** This notebook demonstrates an AI-driven charge capture system using synthetic NYCHHC data. It follows these steps:
- **Synthetic Data Generation** (Clinical Notes & CPT Codes)
- **NLP-Based Preprocessing** (Extracting Procedures from Notes)
- **Retrieval-Augmented CPT Code Matching** (Using FAISS for Search)
- **AI-Based CPT Code Suggestions** (Using GPT-like LLM Models)
- **HB vs. PB Billing Mismatch Detection** (Charge Reconciliation)
- **Automated Report Generation** (Revenue Recovery Analysis)

## Step 1: Generate Synthetic Clinical Notes Data
**Goal:** Create a dataset with clinical notes, encounters, and assigned CPT codes.
- Some encounters will have **missing CPT codes** to simulate revenue loss scenarios.
- We will use **Faker** to generate random patient and encounter data.

In [1]:
import pandas as pd
import random
from faker import Faker

# Initialize Faker for synthetic data generation
fake = Faker()

# Generate synthetic clinical notes with corresponding CPT codes (some missing)
num_records = 100
cpt_code_list = [
    {"cpt": "47562", "desc": "Laparoscopic cholecystectomy"},
    {"cpt": "74300", "desc": "Intraoperative cholangiography"},
    {"cpt": "99291", "desc": "Critical care, first 30-74 minutes"},
    {"cpt": "45378", "desc": "Colonoscopy, diagnostic"},
    {"cpt": "66984", "desc": "Cataract removal with lens insertion"},
    {"cpt": "92950", "desc": "Cardiopulmonary resuscitation (CPR)"},
    {"cpt": "19318", "desc": "Breast reduction surgery"},
    {"cpt": "31622", "desc": "Bronchoscopy, diagnostic"},
    {"cpt": "64483", "desc": "Epidural injection, lumbar or sacral"},
    {"cpt": "20610", "desc": "Joint aspiration, major joint"}
]

synthetic_data = []
for _ in range(num_records):
    encounter_id = fake.unique.random_int(min=1000, max=9999)
    patient_id = fake.unique.random_int(min=10000, max=99999)
    note_text = fake.sentence() + " " + random.choice([
        "Laparoscopic cholecystectomy performed.", "Colonoscopy revealed polyps.",
        "Patient underwent bronchoscopy for lung biopsy.", "Administered CPR successfully.",
        "Cataract extraction with intraocular lens placement.", "Epidural steroid injection performed.",
        "Breast reduction surgery performed.", "Joint aspiration for synovial fluid analysis.",
        "Intraoperative cholangiography done during surgery.", "Critical care provided for 1 hour."
    ])

    # Randomly assign a CPT code or leave it blank to simulate missing charges
    assigned_cpt = random.choice(cpt_code_list) if random.random() > 0.3 else None

    synthetic_data.append({
        "encounter_id": encounter_id,
        "patient_id": patient_id,
        "note_text": note_text,
        "cpt_code": assigned_cpt["cpt"] if assigned_cpt else "",
        "cpt_desc": assigned_cpt["desc"] if assigned_cpt else ""
    })

# Convert to DataFrame
df_synthetic = pd.DataFrame(synthetic_data)
df_synthetic.head()

Unnamed: 0,encounter_id,patient_id,note_text,cpt_code,cpt_desc
0,7629,17194,Provide simply reach important street. Laparos...,,
1,1574,87437,Yard partner something quality. Joint aspirati...,74300.0,Intraoperative cholangiography
2,7177,52481,Later argue cut. Patient underwent bronchoscop...,74300.0,Intraoperative cholangiography
3,9341,11767,Year time than book data. Colonoscopy revealed...,47562.0,Laparoscopic cholecystectomy
4,9577,72569,Field worker capital form although PM mission....,31622.0,"Bronchoscopy, diagnostic"


## Next Steps:
Now that we have **synthetic clinical notes and billing data**, we will proceed with:
- **Preprocessing Clinical Notes** (NLP extraction of procedures)
- **Retrieving Missing CPT Codes** (Using FAISS vector search)
- **AI-Based CPT Code Suggestions** (Using GPT or LLM models)
- **Billing Mismatch Detection** (Comparing HB vs. PB charges)
- **Automated Report Generation** (For charge audit)

Stay tuned for the next steps! 🚀

## Step 2: NLP-Based Preprocessing of Clinical Notes
**Goal:** Extract relevant medical procedures from unstructured clinical notes using NLP.

**Techniques Used:**
- **Text Cleaning & Tokenization**
- **Named Entity Recognition (NER) with SpaCy**
- **Standardizing Procedure Descriptions**


In [2]:
import spacy
import re

# Load NLP Model (Can use 'en_core_sci_sm' from SciSpacy for better results)
nlp = spacy.load("en_core_web_sm")

def extract_procedures(note_text):
    """Extracts medical procedures from clinical notes."""
    doc = nlp(note_text)
    procedures = [ent.text for ent in doc.ents if ent.label_ in ["PROCEDURE", "TREATMENT"]]
    return procedures if procedures else [note_text]  # Default to full text if no entities found

# Apply extraction to synthetic data
df_synthetic["extracted_procedures"] = df_synthetic["note_text"].apply(extract_procedures)
df_synthetic.head()

Unnamed: 0,encounter_id,patient_id,note_text,cpt_code,cpt_desc,extracted_procedures
0,7629,17194,Provide simply reach important street. Laparos...,,,[Provide simply reach important street. Laparo...
1,1574,87437,Yard partner something quality. Joint aspirati...,74300.0,Intraoperative cholangiography,[Yard partner something quality. Joint aspirat...
2,7177,52481,Later argue cut. Patient underwent bronchoscop...,74300.0,Intraoperative cholangiography,[Later argue cut. Patient underwent bronchosco...
3,9341,11767,Year time than book data. Colonoscopy revealed...,47562.0,Laparoscopic cholecystectomy,[Year time than book data. Colonoscopy reveale...
4,9577,72569,Field worker capital form although PM mission....,31622.0,"Bronchoscopy, diagnostic",[Field worker capital form although PM mission...


## Step 3: Retrieving Missing CPT Codes Using FAISS
**Goal:** Use FAISS (Facebook AI Similarity Search) to find the best CPT code for each extracted procedure.

**Techniques Used:**
- **Embedding CPT Descriptions Using Sentence Transformers**
- **Building a FAISS Vector Index**
- **Retrieving the Closest CPT Code Match for Each Clinical Note**


In [3]:
import faiss
from sentence_transformers import SentenceTransformer

# Load Sentence Transformer Model for Embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

# Prepare CPT Descriptions for Indexing
cpt_texts = [c["desc"] for c in cpt_code_list]
cpt_vectors = model.encode(cpt_texts)

# Build FAISS Index
dimension = cpt_vectors.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(cpt_vectors)

def retrieve_cpt(procedure_text, top_k=1):
    """Finds the most relevant CPT code for a given procedure."""
    query_vector = model.encode([procedure_text])
    _, indices = index.search(query_vector, top_k)
    return cpt_code_list[indices[0][0]]  # Return the best match

# Apply Retrieval to Extracted Procedures
df_synthetic["suggested_cpt"] = df_synthetic["extracted_procedures"].apply(lambda x: retrieve_cpt(x[0]) if x else None)
df_synthetic.head()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,encounter_id,patient_id,note_text,cpt_code,cpt_desc,extracted_procedures,suggested_cpt
0,7629,17194,Provide simply reach important street. Laparos...,,,[Provide simply reach important street. Laparo...,"{'cpt': '47562', 'desc': 'Laparoscopic cholecy..."
1,1574,87437,Yard partner something quality. Joint aspirati...,74300.0,Intraoperative cholangiography,[Yard partner something quality. Joint aspirat...,"{'cpt': '20610', 'desc': 'Joint aspiration, ma..."
2,7177,52481,Later argue cut. Patient underwent bronchoscop...,74300.0,Intraoperative cholangiography,[Later argue cut. Patient underwent bronchosco...,"{'cpt': '31622', 'desc': 'Bronchoscopy, diagno..."
3,9341,11767,Year time than book data. Colonoscopy revealed...,47562.0,Laparoscopic cholecystectomy,[Year time than book data. Colonoscopy reveale...,"{'cpt': '45378', 'desc': 'Colonoscopy, diagnos..."
4,9577,72569,Field worker capital form although PM mission....,31622.0,"Bronchoscopy, diagnostic",[Field worker capital form although PM mission...,"{'cpt': '99291', 'desc': 'Critical care, first..."


In [4]:
import nbformat
from nbconvert import PythonExporter

def convert_ipynb_to_clean_py(ipynb_file, py_file):
    with open(ipynb_file, 'r', encoding='utf-8') as f:
        notebook_content = nbformat.read(f, as_version=4)

    python_code = []
    for cell in notebook_content.cells:
        if cell.cell_type == "code" and cell.source.strip():  # Ignore empty code cells
            cleaned_code = "\n".join(line.strip() for line in cell.source.split("\n") if line.strip())  # Remove extra spaces
            python_code.append(cleaned_code)

    with open(py_file, 'w', encoding='utf-8') as f:
        f.write("\n\n".join(python_code))  # Ensure proper spacing between code blocks

# Convert this notebook to a clean Python script
convert_ipynb_to_clean_py("NYCHHC_Charge_Capture_System_v2.ipynb", "NYCHHC_Charge_Capture_System_v2.py")



## Next Steps:
Now that we have **preprocessed clinical notes and retrieved suggested CPT codes**, we will proceed with:
- **AI-Based CPT Code Suggestions (GPT-Like Models for Text Analysis)**
- **Billing Mismatch Detection (Comparing Hospital and Professional Charges)**
- **Automated Reporting (Charge Audit Analysis)**

Stay tuned for Step 4! 🚀