## Load Data

In [None]:
import pandas as pd

In [None]:
df = pd.read_excel('/content/Assignment_Data.xlsx')
print(df.shape)
df.head()

(200, 9)


Unnamed: 0,patient_id,age,gender,diagnosis_code,num_previous_admissions,medication_type,length_of_stay,readmitted_30_days,discharge_note
0,1,71,Male,D002,3,Type C,2,0,Good recovery trajectory. Follow-up scan sched...
1,2,34,Female,D002,1,Type B,3,1,Stable post-surgery. Advised to avoid physical...
2,3,80,Male,D002,2,Type C,5,1,Symptoms controlled. Monitoring for relapse ad...
3,4,40,Female,D002,2,Type C,11,0,Stable post-surgery. Advised to avoid physical...
4,5,43,Female,D001,1,Type C,8,1,Stable post-surgery. Advised to avoid physical...


# **LLM Approach**

In [None]:
# !pip install langchain transformers torch pandas
# !pip install langchain-core langchain-huggingface
# !pip install langchain-community
# !pip install --upgrade langchain langchain-core langchain-community -q
# !pip install accelerate

In [None]:
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableSequence
import json
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from datasets import Dataset
from typing import Optional
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

### Load Model

In [None]:
### Load Model (Using Phi-3)

model_id = "microsoft/Phi-3-mini-4k-instruct"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

#pipeline
llm_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.001,
    return_full_text=False,
    do_sample=False
)

# Wrapping in LangChain
llm = HuggingFacePipeline(pipeline=llm_pipeline)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


### Prompting for consistent output

In [None]:
class ClinicalEntities(BaseModel):
    Diagnosis: Optional[str] = Field(default=None)
    Treatment: Optional[str] = Field(default=None)
    Symptoms: Optional[str] = Field(default=None)
    Medications: Optional[str] = Field(default=None)
    Follow_up_actions: Optional[str] = Field(default=None, alias="Follow-up actions")

parser = PydanticOutputParser(pydantic_object=ClinicalEntities)

In [None]:
prompt_template_text = """<|user|>
You are a clinical information extraction model.
Extract the following entities from the discharge note and return valid JSON.

Discharge note: "{note}"

Return a JSON object with these exact keys. If information is missing, use `null`.
{{
  "Diagnosis": null,
  "Treatment": null,
  "Symptoms": null,
  "Medications": null,
  "Follow-up actions": null
}}
<|end|>
<|assistant|>
"""

In [None]:
structured_prompt = PromptTemplate.from_template(prompt_template_text)

### Building the Chain

In [None]:
extract_chain = structured_prompt | llm | parser

### Extraction

In [None]:
dataset = Dataset.from_pandas(df[['discharge_note']])

results = dataset.map(
    lambda batch: {
        "clinical_entities": [
            entity.model_dump_json() if entity else None
            for entity in extract_chain.batch(
                [{"note": text} for text in batch["discharge_note"]]
            )
        ]
    },
    batched=True,
    batch_size=8,
)

df["clinical_entities"] = results["clinical_entities"]

  StockPickler.save(self, obj, save_persistent_id)
  StockPickler.save(self, obj, save_persistent_id)
  StockPickler.save(self, obj, save_persistent_id)
  StockPickler.save(self, obj, save_persistent_id)


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:
df.head()

Unnamed: 0,patient_id,age,gender,diagnosis_code,num_previous_admissions,medication_type,length_of_stay,readmitted_30_days,discharge_note,clinical_entities
0,1,71,Male,D002,3,Type C,2,0,Good recovery trajectory. Follow-up scan sched...,"{""Diagnosis"":null,""Treatment"":null,""Symptoms"":..."
1,2,34,Female,D002,1,Type B,3,1,Stable post-surgery. Advised to avoid physical...,"{""Diagnosis"":null,""Treatment"":null,""Symptoms"":..."
2,3,80,Male,D002,2,Type C,5,1,Symptoms controlled. Monitoring for relapse ad...,"{""Diagnosis"":null,""Treatment"":null,""Symptoms"":..."
3,4,40,Female,D002,2,Type C,11,0,Stable post-surgery. Advised to avoid physical...,"{""Diagnosis"":null,""Treatment"":null,""Symptoms"":..."
4,5,43,Female,D001,1,Type C,8,1,Stable post-surgery. Advised to avoid physical...,"{""Diagnosis"":null,""Treatment"":null,""Symptoms"":..."


In [None]:
df.to_excel("output_llm.xlsx", index=False)

# **Spacy Approach**

In [None]:
import spacy
from spacy.matcher import Matcher
import json

### Load Model

In [None]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

### Create generic Patterns

In [None]:
diagnosis_patterns = [
    [{"LOWER": {"IN": ["diagnosed", "diagnosis"]}}, {"LOWER": "with"}, {"POS": "NOUN"}],
    [{"LEMMA": {"IN": ["infection", "pneumonia", "hypertension", "diabetes", "recovery", "relapse", "condition"]}}],
    [{"LOWER": "post-surgery"}],
    [{"LOWER": "symptoms"}, {"LOWER": "controlled"}]
]

treatment_patterns = [
    [{"LOWER": {"IN": ["therapy", "operation", "procedure", "monitoring", "check-up", "discharge"]}}],
    [{"LOWER": "follow-up"}, {"LOWER": {"IN": ["scan", "visit"]}}],
    [{"LOWER": "under"}, {"LOWER": {"IN": ["observation", "treatment"]}}],
]

symptom_patterns = [
    [{"LOWER": {"IN": ["fever", "pain", "discomfort", "fatigue", "complications"]}}],
    [{"LOWER": "signs"}, {"LOWER": "of"}, {"LOWER": "infection"}],
    [{"LOWER": "no"}, {"LOWER": {"IN": ["complications", "symptoms"]}}],
]

followup_patterns = [
    [{"LOWER": {"IN": ["advised", "recommended", "scheduled", "continue"]}}],
    [{"LOWER": "follow-up"}],
    [{"LOWER": "next"}, {"LOWER": {"IN": ["week", "month"]}}],
    [{"LOWER": "review"}, {"LOWER": {"IN": ["appointment", "visit"]}}],
]

medication_patterns = [
    [{"LOWER": "type"}, {"IS_ALPHA": True, "LENGTH": 1}],
    [{"LOWER": {"IN": ["medication", "drug", "tablet", "dose"]}}],
    [{"LOWER": "continue"}, {"LOWER": {"IN": ["medication", "treatment"]}}],
    [{"LOWER": "under"}, {"LOWER": "medication"}],
]


In [None]:
# Add to matcher
matcher.add("DIAGNOSIS", diagnosis_patterns)
matcher.add("TREATMENT", treatment_patterns)
matcher.add("SYMPTOM", symptom_patterns)
matcher.add("FOLLOW_UP", followup_patterns)
matcher.add("MEDICATION", medication_patterns)

### Model Execution

In [None]:
notes = df["discharge_note"].tolist()
all_entities = []

for note in notes:
    doc = nlp(note)
    matches = matcher(doc)

    # Initialize structured dictionary for categories
    entity_dict = {
        "DIAGNOSIS": [],
        "TREATMENT": [],
        "SYMPTOM": [],
        "FOLLOW_UP": [],
        "MEDICATION": []
    }

    for match_id, start, end in matches:
        label = nlp.vocab.strings[match_id]
        span_text = doc[start:end].text.strip()
        if span_text not in entity_dict[label]:
            entity_dict[label].append(span_text)

    # Convert to compact JSON string
    all_entities.append(json.dumps(entity_dict, ensure_ascii=False))

# Add results as new column
df["extracted_entities"] = all_entities

In [None]:
df[["patient_id", "discharge_note", "extracted_entities"]]

Unnamed: 0,patient_id,discharge_note,extracted_entities
0,1,Good recovery trajectory. Follow-up scan sched...,"{""DIAGNOSIS"": [""recovery""], ""TREATMENT"": [], ""..."
1,2,Stable post-surgery. Advised to avoid physical...,"{""DIAGNOSIS"": [], ""TREATMENT"": [], ""SYMPTOM"": ..."
2,3,Symptoms controlled. Monitoring for relapse ad...,"{""DIAGNOSIS"": [""Symptoms controlled"", ""relapse..."
3,4,Stable post-surgery. Advised to avoid physical...,"{""DIAGNOSIS"": [], ""TREATMENT"": [], ""SYMPTOM"": ..."
4,5,Stable post-surgery. Advised to avoid physical...,"{""DIAGNOSIS"": [], ""TREATMENT"": [], ""SYMPTOM"": ..."
...,...,...,...
195,196,Good recovery trajectory. Follow-up scan sched...,"{""DIAGNOSIS"": [""recovery""], ""TREATMENT"": [], ""..."
196,197,Patient discharged with minor discomfort. Advi...,"{""DIAGNOSIS"": [], ""TREATMENT"": [], ""SYMPTOM"": ..."
197,198,Discharge after recovery from pneumonia. No co...,"{""DIAGNOSIS"": [""recovery"", ""pneumonia""], ""TREA..."
198,199,Blood pressure under control. Continue current...,"{""DIAGNOSIS"": [], ""TREATMENT"": [], ""SYMPTOM"": ..."


In [None]:
# Save to new CSV
df.to_csv("output_spacy.csv", index=False)

# Results and Summary

- Both approaches successfully extracted structured information from unstructured discharge notes, though their coverage and consistency differed.

>### **1. LLM extraction approach**

- The LLM based Phi-3-mini model generated more human like and contextually relevant summaries, frequently identifying follow up instructions ("Recommend follow-up in 2 weeks", "Continue current medication", "Advised to avoid physical exertion").

- The LLM showed strong coverage of the Follow-up actions entity, appearing in almost every record with high semantic variety.

- It occasionally extracted clear Diagnosis (“pneumonia”, “mild reaction to medication”) and Treatment (“recovery”, “advised rest and hydration”), though these were less consistent across all notes.

- In contrast, Medications and Symptoms were sparsely populated, often left as null, suggesting that while the model captured explicit follow-up text well, it missed some embedded mentions.

>### **2. SpaCy extraction approach**

- The spaCy rule-based extractor provided structured but narrower outputs.

- It consistently captured keywords like “recovery”, “infection”, and “pneumonia” under Diagnosis, and surface patterns like “discomfort” or “Symptoms controlled” under Symptom.

- However, many entries included only generic terms like “Advised” or “Continue” in Follow-up, without capturing the complete instruction, showing limited contextual understanding.

- Its pattern based logic also led to repetitive captures (“complications” appearing in both positive and negated contexts such as “No complications”).

>## **Overall:**

- LLM (Phi-3): broader semantic understanding, better contextual interpretation of follow-up actions, occasional misses in less explicit mentions.

- spaCy: higher precision for explicit keyword-based extractions but limited recall and context comprehension.

>## **Risks and Limitations:**

- The LLM (Phi-3), while effective at interpreting clinical text, may  hallucinate or infer entities that are not explicitly mentioned, posing a risk in sensitive clinical settings.

- As a general purpose language model, Phi-3 lacks exposure to specialized medical corpora, which limits its accuracy in recognizing medical abbreviations, drug names, or complex diagnostic phrasing.

- The rule based spaCy approach mitigates hallucination risk but suffers from low recall and inability to capture implicit information, reducing overall completeness.

>## **Closing remarks:**

- The outputs indicate that while the LLM captured more actionable clinical insights, spaCy outputs were more interpretable and reproducible, each reflecting their strengths: semantic generalization vs rule based precision.

- Both methods highlight the need for human validation and potential fine-tuning or domain adaptation before deployment in real-world healthcare workflows.