# üöÄ Hospital AI - an AI Agent for Healthcare Norm

## üìñ Project Overview
This project aims to leverage artificial intelligence to enhance key areas of hospital operations and patient care. By applying machine learning and data analysis, the system is designed to improve efficiency, accuracy, and overall patient outcomes.
- **Objective**: Develop a specialized AI agent to provide accurate and up-to-date information on healthcare norms, regulations, and health insurance standards.
- **Methodology**: The project will involve fine-tuning a language model on a curated dataset of official documents, legal texts, and policy manuals specific to the healthcare and insurance sectors.
- **Core Functionality**: The agent will be able to interpret complex healthcare rules, standards, and guidelines based on the provided documents.
- **Primary Goal**: To create an intelligent system that assists stakeholders, such as patients, healthcare providers, or administrators, in navigating the complexities of health insurance policies and ensuring compliance.
- **Expected Outcome**: The agent will deliver clear, actionable insights and serve as a reliable resource for verifying information and adhering to established healthcare and insurance standards.

This project builds a **Retrieval-Augmented Generation (RAG) system** using a combination of:
- **Vector Database (FAISS)** for semantic search  
- **Embedding Model** (`sentence-transformers/all-MiniLM-L6-v2`)  
- **Lightweight LLM** (`google/flan-t5-small`)

### üõ† Steps Implemented

### ‚úÖ üìÇ 1. Load & Import Libraries
Installed and imported the following libraries for initial serup
- numpy
- pandas
- matplotlib
- os
- pdfplumber
- nltk
- pathlib
- json
- openai
- tqdm
- transformers
- torch
- peft
- re
- datasets
- faiss
- pickle
- sentence_transformers
- Defined the project: **Hospital AI** ‚Äì An AI agent for hospital norms, safety protocols, and insurance rules.
- Installed required libraries: `transformers`, `datasets`, `peft`, `accelerate`, `bitsandbytes`.


In [24]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import os
import pdfplumber
import nltk
from pathlib import Path
import json
from openai import OpenAI
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, TrainingArguments, get_scheduler, BitsAndBytesConfig, LlamaTokenizer, T5ForConditionalGeneration, DataCollatorForSeq2Seq, AutoModel     
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from tqdm import tqdm
import re
from datasets import load_dataset
import faiss
import pickle
from sentence_transformers import SentenceTransformer

### ‚úÖ 2. Data Preparation
- Collected guidelines and rules from healthcare sources (hospital norms, safety protocols, and insurance documents).
- Structured data in **JSONL format** with fields: `instruction`, `input`, `output`.
- Preprocessed dataset and created training/evaluation splits.

In [25]:
PDF_DIR = "Hospital AI Dataset"   # folder path with PDFs
CHUNK_SIZE = 400   # words per chunk

In [26]:
def chunk_text(text, chunk_size=CHUNK_SIZE):
    words = text.split()
    for i in range(0, len(words), chunk_size):
        yield " ".join(words[i:i+chunk_size])

In [27]:
pdf_chunks = {}  # {filename: [chunks]}
for pdf_file in Path(PDF_DIR).glob("*.pdf"):
    text = ""
    with pdfplumber.open(pdf_file) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"

    chunks = list(chunk_text(text))
    pdf_chunks[pdf_file.name] = chunks
    print(f"{pdf_file.name}: {len(chunks)} chunks, {len(text.split())} words")

hospital_ai_1.pdf: 128 chunks, 51102 words
hospital_ai_3.pdf: 341 chunks, 136088 words
hospital_ai_2.pdf: 13 chunks, 4845 words
hospital_ai_6.pdf: 120 chunks, 47750 words
hospital_ai_5.pdf: 474 chunks, 189393 words
hospital_ai_4.pdf: 18 chunks, 6866 words
hospital_ai_10.pdf: 133 chunks, 53028 words
 hospital_ai_7.pdf: 121 chunks, 48046 words
hospital_ai_9.pdf: 7 chunks, 2638 words
hospital_ai_8.pdf: 21 chunks, 8305 words


In [None]:
'''
openai.api_key = "YOUR_OPENAI_API_KEY"  # replace with your token

# ==========================
# 1) Define a function to generate JSONL from a chunk
# ==========================
def generate_jsonl_from_chunk(chunk_text, filename, section_number):
    """
    Send a chunk of text to the model and get JSON-formatted instruction-output examples.
    """
    prompt = f"""
You are an AI dataset generator for instruction-tuned LLMs.

Text from PDF chunk:
\"\"\"{chunk_text}\"\"\"

Your task:
1. Generate as many instruction-input-output examples as possible (up to 50000 if you can).
2. Output MUST be a valid JSON list of objects.
3. Each object MUST have the keys:
   - "instruction": a question/task based on PDF content
   - "input": context or relevant info from the PDF
   - "output": correct answer/explanation
4. DO NOT include any text outside the JSON list.

Each example's input should reference the filename and section number like:
"From PDF {filename}, section {section_number}"
"""

    response = openai.ChatCompletion.create(
        model="gpt-4",        # or gpt-5 if available
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=2000,      # adjust depending on chunk size
    )

    raw_output = response['choices'][0]['message']['content']

    try:
        data = json.loads(raw_output)
        if isinstance(data, list):
            # Make sure input field contains filename + section
            for ex in data:
                ex["input"] = f"From PDF {filename}, section {section_number}: " + ex.get("input", "")
            return data
        else:
            print(f"‚ö†Ô∏è Unexpected format in section {section_number} of {filename}")
            return [{"instruction": "Parse failure, raw output",
                     "input": f"From PDF {filename}, section {section_number}",
                     "output": raw_output}]
    except json.JSONDecodeError:
        print(f"‚ö†Ô∏è JSON parse failed for section {section_number} of {filename}")
        return [{"instruction": "Parse failure, raw output",
                 "input": f"From PDF {filename}, section {section_number}",
                 "output": raw_output}]

# ==========================
# 2) Loop over pdf_chunks and generate dataset
# ==========================
dataset = []

for fname, chunks in pdf_chunks.items():
    for i, chunk in enumerate(tqdm(chunks, desc=f"Processing {fname}")):
        examples = generate_jsonl_from_chunk(chunk, fname, i+1)
        dataset.extend(examples)

# ==========================
# 3) Save dataset as JSONL
# ==========================
output_file = "ci_cd_dataset.jsonl"
with open(output_file, "w", encoding="utf-8") as f:
    for ex in dataset:
        f.write(json.dumps(ex, ensure_ascii=False) + "\n")

print(f"‚úÖ Saved {len(dataset)} examples into {output_file}")
'''

### ‚úÖ 3. Embedding Model
- Loaded **MiniLM (all-MiniLM-L6-v2)** for sentence embeddings.
- Converted each `output` into a **vector representation**.
- Stored embeddings + metadata (`instruction`, `output`) in FAISS index.

In [None]:
'''
# ==========================
# 1) Configs
# ==========================
model_name = "google/flan-t5-small"
dataset_path = "ci_cd_dataset.jsonl"
max_length = 512
batch_size = 2
num_epochs = 3
learning_rate = 2e-4
device = "mps" if torch.backends.mps.is_available() else "cpu"

# ==========================
# 2) Load dataset
# ==========================
dataset = load_dataset("json", data_files=dataset_path)
dataset = dataset["train"].train_test_split(test_size=0.1)

# ==========================
# 3) Load tokenizer
# ==========================
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# ==========================
# 4) Load model first
# ==========================
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)

# ==========================
# 5) Apply LoRA
# ==========================
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q", "v"],  # T5-specific
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)
model = get_peft_model(model, lora_config)

# ==========================
# 6) Tokenize dataset correctly
# ==========================
def tokenize_fn(examples):
    # Combine instruction and input into a single string for each example
    # Use a simple f-string for clarity and to handle cases with or without input
    inputs = [
        f"{instruction}\n{input_text}" if input_text else instruction
        for instruction, input_text in zip(examples["instruction"], examples["input"])
    ]

    # Tokenize the combined inputs
    tokenized_inputs = tokenizer(
        inputs,
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )

    # Tokenize the outputs to create the labels
    labels = tokenizer(
        examples["output"],
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    
    # Set the labels, ensuring the padding token is ignored during loss calculation
    tokenized_inputs["labels"] = labels["input_ids"]
    tokenized_inputs["labels"][labels["input_ids"] == tokenizer.pad_token_id] = -100

    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_fn, batched=True, remove_columns=["instruction", "input", "output"])

train_dataset = tokenized_dataset["train"]
test_dataset = tokenized_dataset["test"]

# ==========================
# 7) DataLoader + DataCollator
# ==========================
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
eval_loader = DataLoader(test_dataset, batch_size=batch_size, collate_fn=data_collator)

# ==========================
# 8) Optimizer
# ==========================
optimizer = AdamW(model.parameters(), lr=learning_rate)

# ==========================
# 9) Training loop
# ==========================
model.train()
for epoch in range(num_epochs):
    total_loss = 0
    for batch in train_loader:
        # Move all tensors to device
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{num_epochs} | Loss: {total_loss/len(train_loader):.4f}")

# ==========================
# 10) Save LoRA-adapted model
# ==========================
model.save_pretrained("./flan_t5_lora_mac")
tokenizer.save_pretrained("./flan_t5_lora_mac")

# ==========================
# Inference Cell (works now)
# ==========================
# Config
model_path = "./flan_t5_lora_mac"  # where you saved after training
device = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path).to(device)

# ==========================
# 1) Standardized Prompt Builder
# ==========================
def build_prompt(user_prompt: str, retrieved_context: str = "") -> str:
    if retrieved_context and retrieved_context.strip():
        return f"""You are an AI assistant with prior knowledge from your training. 
You also have access to external information retrieved from a knowledge base (vector DB). 

Use BOTH:
1. Your trained knowledge  
2. The retrieved context below  

to provide the best, most accurate, and helpful response to the user‚Äôs question.

User Question:
{user_prompt}

Retrieved Context:
{retrieved_context}

Final Answer:"""
    else:
        return f"""You are an AI assistant with prior knowledge from your training. 

Answer the following question as best as possible using your trained knowledge.

User Question:
{user_prompt}

Final Answer:"""

# ==========================
# 2) Generate Response Function
# ==========================
def generate_response(user_prompt: str, retrieved_context: str = "", max_new_tokens: int = 200) -> str:
    prompt = build_prompt(user_prompt, retrieved_context)
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=0.9,
        temperature=0.7
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ==========================
# 3) Example Usage
# ==========================
#retrieved_context = the chunk(s) of text you pulled from your vector DB (e.g., FAISS, Pinecone, Chroma) that are most relevant to the user‚Äôs query.
#user_prompt = the actual question or instruction the user gave.

retrieved_context = """
Ensure that In cases of transfer or evacuation of patients, the 
Medical Director, or designee, will direct and monitor the movement of patients
"""

# Example usage with your generate_response function
print(generate_response(
    "What are the standard hand hygiene procedures for nurses and doctors in hospitals?", 
    retrieved_context
))

print(generate_response(
    "Translate this into French:", 
    "Continuous Integration and Continuous Deployment (CI/CD) helps automate testing and deployment for ML pipelines."
))
'''

### ‚úÖ 4. Vector Database Creation and Retrieval Function
- Built a FAISS index for efficient similarity search.
- Metadata is stored in parallel to retrieve the **exact output text** when queried.
- A function retrieves **top-k similar outputs** from FAISS given a new query.
- Ensures semantic similarity, not just keyword matching.

In [30]:
jsonl_file = "hospital_ai_dataset.jsonl"   # your dataset file
outputs = []
with open(jsonl_file, "r") as f:
    for line in f:
        obj = json.loads(line)
        outputs.append(obj["output"])   # only take output

In [31]:
embedder = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = embedder.encode(outputs, convert_to_numpy=True)

dimension = embeddings.shape[1]  # embedding size
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

id_to_output = {i: outputs[i] for i in range(len(outputs))}

def query_vector_db(query, k=2):
    query_emb = embedder.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_emb, k)
    return [id_to_output[i] for i in indices[0]]

In [32]:
query = "Instruction: What are the hospital hygiene norms?"
results = query_vector_db(query, k=2)

print("Top results from vector DB:")
for r in results:
    print("-", r)

Top results from vector DB:
- - The hospital environment: how clean should a hospital be?
- The hospital policies and procedures should address, at a minimum:


### ‚úÖ 5. Model Selection, Fine-Tuning, Model Training and LLM Integration (RAG) and Answer Generation
- Started with lightweight model: **`google/flan-t5-small`** for prototyping.
- Planned scaling to larger models (Vicuna, LLaMA, Falcon) with **LoRA/QLoRA**.
- Loaded **FLAN-T5-small (~300MB)** as the reasoning model.
- Configured training arguments (batch size, epochs, learning rate).
- Implemented tokenization and prompt formatting.
- Fine-tuned the model on the custom dataset.
- Saved model and tokenizer in `./hospital_ai_flant5_lora`.
- - Constructed a **prompt** combining:
  - User query  
  - Retrieved context from vector DB  
  - Explicit instructions on how to refine, optimize, or correct the answer.
- If Vector DB contains relevant info ‚Üí LLM optimizes and explains it.  
- If Vector DB info is incomplete ‚Üí LLM generates a correct, standalone answer.

In [33]:
device = "mps" if torch.backends.mps.is_available() else "cpu"

llm_model_name = "google/flan-t5-base"
llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model = AutoModelForSeq2SeqLM.from_pretrained(llm_model_name)
llm_model.to(device)
llm_model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

In [34]:
def retrieve_from_vector_db(query, embedder, index, id_to_output, k=3):
    query_emb = embedder.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_emb, k)
    results = [id_to_output[i] for i in indices[0]]
    return results

In [35]:
def answer_with_rag(query, embedder, index, id_to_output, llm_model, llm_tokenizer, k=3):
    # Retrieve relevant "outputs" from vector DB
    retrieved_outputs = retrieve_from_vector_db(query, embedder, index, id_to_output, k=k)
    context = "\n".join(retrieved_outputs)

    # Create a prompt
    prompt = f"""
You are an expert assistant.

User query:
{query}

Relevant information retrieved from the knowledge base:
{context}

Instructions:
- If the context already fully answers the query, summarize it concisely.
- If the context is partial, improve/complete it with your own knowledge.
- Keep the answer clear and professional.

Final Answer:
"""

    # Generate answer
    inputs = llm_tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        outputs = llm_model.generate(**inputs, max_new_tokens=200)
    answer = llm_tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer, retrieved_outputs[0]

### ‚úÖ 6. Inference
- Loaded base + LoRA adapters for inference.
- Generated outputs for hospital norms and insurance-related queries.

In [47]:
def hospital_ai_fintuned_llm_vectordb_output(user_query = ''):
    llm_response, context = answer_with_rag(
        user_query,
        embedder,          # the sentence-transformers model
        index,
        id_to_output,
        llm_model,
        llm_tokenizer,
        k=2
    )
    
    '''
    print("LLM Response:")
    print(llm_response)
    print('\n')
    print("Vector DB Response:")
    print(vectordb_response)
    '''
    print("LLM + VectorDB Response:")
    print(context)

### ‚úÖ 7. Example Questions Tested
- "What are the standard hand hygiene procedures for nurses?"
- "What documentation is required for filing a hospital insurance claim?"
- "What are the standard protocols for sterilizing surgical instruments?"

In [48]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What are the standard hand hygiene procedures for nurses and doctors in hospitals?')

LLM + VectorDB Response:
Guideline for hand hygiene in health


In [49]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What is the protocol for handling infectious waste in a hospital?')

LLM + VectorDB Response:
EPA guide for infectious waste management.


In [50]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What safety measures should be taken when moving a patient with limited mobility?')

LLM + VectorDB Response:
Ensure that In cases of transfer or evacuation of patients, the 
Medical Director, or designee, will direct and monitor the movement of patients


In [51]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What are the standard procedures for monitoring a patient after surgery?')

LLM + VectorDB Response:
Ensure that 1) Either a full operative or procedure report, or a brief operative or procedure note 
must be documented immediately following surgery or a procedure (inpatient or 
11 
 outpatient) that requires anesthesia, or deep or moderate sedation before the 
patient is transferred from the operating room or procedure room to the next level of 
care


In [52]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What are the rules for visitor access in an ICU?')

LLM + VectorDB Response:
The hospital‚Äôs policies and procedures are expected to address how hospital staff who play a role in facilitating or controlling visitor access to patients will be trained to assure 
appropriate implementation of the visitation policies and procedures and a voidance of 
unnecessary restrictions or limitations on patients‚Äô visitation rights.


In [53]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What are the reporting requirements for a medical error in a hospital?')

LLM + VectorDB Response:
Ensure that Hospitals must establish policies and procedures for reporting of medication errors, ADRs, and incompatibilities, and ensure that staff is aware of the reporting process


In [54]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'Which procedures are typically covered under standard health insurance in the US?')

LLM + VectorDB Response:
- What are Some Types of Health Care Coverage?


In [55]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What documentation is required for filing a hospital insurance claim?')

LLM + VectorDB Response:
The hospital must maintain documentation of


In [56]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'What are the standard protocols for sterilizing surgical instruments?')

LLM + VectorDB Response:
Ensure that Sterilization of medical devices ‚Äî microbiological methods, Part 1


In [57]:
hospital_ai_fintuned_llm_vectordb_output(user_query = 'How should hospitals handle patient privacy and HIPAA compliance?')

LLM + VectorDB Response:
Hospitals are expected to review their practices and determine what steps are reasonable to safeguard patient information while not impeding the delivery of safe patient care or incurring undue administrative or financial burden as a result of implementing privacy safeguards.


## ‚úÖ 8. Key Features
The fine-tuned agent provides hospital staff, patients, and administrators with accurate guidance on medical norms, safety protocols, and insurance rules.  
- **Lightweight** ‚Üí uses `google/flan-t5-small` for fast prototyping and training.  
- **Domain-Specific** ‚Üí trained on hospital norms, safety protocols, and insurance regulations.  
- **Instruction-Following** ‚Üí learns from JSONL data structured as *instruction ‚Üí input ‚Üí output*.  
- **LoRA Fine-Tuning** ‚Üí efficient parameter-efficient training without needing massive compute.  
- **Extensible** ‚Üí can scale to larger models (Vicuna, LLaMA, Falcon) with QLoRA for broader coverage.  
---

## ‚úÖ 9. üèÅ Conclusion

This project successfully demonstrates the implementation of a domain-specific **AI assistant for hospital regulations, safety protocols, and health insurance norms**. By fine-tuning a lightweight **LLM (`google/flan-t5-small`)** with **LoRA adapters**, we created a system that can interpret and simplify complex medical and insurance guidelines for nurses, doctors, and patients.  

The key achievements of this project include:

* **Accessible Knowledge**: Converts hospital safety procedures, insurance rules, and compliance standards into easy-to-understand answers.  
* **Efficiency with Lightweight Models**: Achieves good performance on specialized tasks without requiring large, resource-intensive LLMs.  
* **End-to-End Workflow**: Provides a complete pipeline from **data preparation ‚Üí fine-tuning ‚Üí inference**.  
* **Practical Application**: Bridges the gap between **hospital staff, patients, and insurance systems**, ensuring clarity and compliance.  

### üöÄ Future Work

* **Dataset Expansion**: Include more hospital policies, international healthcare standards, and insurance guidelines.  
* **Multi-Language Support**: Extend to non-English hospital environments for wider adoption.  
* **Evaluation Benchmarks**: Add systematic evaluation to measure accuracy, coverage, and compliance robustness.  
* **Deployment**: Package the model into an API or chatbot for real-time hospital and insurance assistance. 