## **Entity Extraction using a Medical-NER model**

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("blaze999/Medical-NER")
model = AutoModelForTokenClassification.from_pretrained("blaze999/Medical-NER")

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/5.14k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/736M [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

file_path = '/content/drive/MyDrive/MTS-Dialog-Augmented-TrainingSet-1-En-FR-EN-2402-Pairs.csv'
df =  pd.read_csv(file_path)

df.head()

Unnamed: 0,ID,section_header,section_text,dialogue
0,0,GENHX,The patient is a 76-year-old white female who ...,Doctor: What brings you back into the clinic t...
1,1,GENHX,The patient is a 25-year-old right-handed Cauc...,Doctor: How're you feeling today? \r\nPatient...
2,2,GENHX,"This is a 22-year-old female, who presented to...","Doctor: Hello, miss. What is the reason for yo..."
3,3,MEDICATIONS,Prescribed medications were Salmeterol inhaler...,Doctor: Are you taking any over the counter me...
4,4,CC,"Burn, right arm.","Doctor: Hi, how are you? \r\nPatient: I burned..."


In [None]:
# Define a function that applies the NER pipeline to extract entities from a given dialogue text.
def extract_entities(dialogue):
    # If dialogue is NaN or empty, return an empty list
    if not isinstance(dialogue, str) or dialogue.strip() == "":
        return []
    return ner_pipeline(dialogue)

# Create a new column in the dataframe with the extracted entities.
df['extracted_entities'] = df['dialogue'].apply(extract_entities)

# Display a few results
print(df[['ID', 'extracted_entities']].head())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


  ID                                 extracted_entities
0  0  [{'entity_group': 'MEDICATION', 'score': 0.241...
1  1  [{'entity_group': 'SIGN_SYMPTOM', 'score': 0.8...
2  2  [{'entity_group': 'SIGN_SYMPTOM', 'score': 0.7...
3  3  [{'entity_group': 'DETAILED_DESCRIPTION', 'sco...
4  4  [{'entity_group': 'BIOLOGICAL_STRUCTURE', 'sco...


In [None]:
for idx, row in df.head(5).iterrows():
    print(f"ID: {row['ID']}")
    print("Extracted Entities:")
    for entity in row['extracted_entities']:
        print(f" - {entity}")
    print("-" * 50)

ID: 0
Extracted Entities:
 - {'entity_group': 'MEDICATION', 'score': np.float32(0.24135911), 'word': 'blood', 'start': 97, 'end': 103}
 - {'entity_group': 'DIAGNOSTIC_PROCEDURE', 'score': np.float32(0.14658338), 'word': 'pressure', 'start': 103, 'end': 112}
 - {'entity_group': 'THERAPEUTIC_PROCEDURE', 'score': np.float32(0.19238539), 'word': 'medicine', 'start': 112, 'end': 121}
 - {'entity_group': 'DISEASE_DISORDER', 'score': np.float32(0.53962463), 'word': 'hypertension', 'start': 205, 'end': 218}
 - {'entity_group': 'DISEASE_DISORDER', 'score': np.float32(0.54485345), 'word': 'osteoarthritis', 'start': 219, 'end': 234}
 - {'entity_group': 'DISEASE_DISORDER', 'score': np.float32(0.6104604), 'word': 'osteoporosis', 'start': 235, 'end': 248}
 - {'entity_group': 'DISEASE_DISORDER', 'score': np.float32(0.58835787), 'word': 'hypothyroidism', 'start': 249, 'end': 264}
 - {'entity_group': 'DISEASE_DISORDER', 'score': np.float32(0.66229117), 'word': 'allergic rhinitis', 'start': 265, 'end': 

## **Entity Extraction using LLMs**

In [None]:
%xterm

In [None]:
!ollama list

NAME               ID              SIZE      MODIFIED       
gemma3:latest      a2af6cc3eb7f    3.3 GB    11 seconds ago    
meditron:latest    ad11a6250f54    3.8 GB    24 minutes ago    


In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import OllamaLLM
import pandas as pd

In [None]:
multi_shot_template = """
Below are a couple of examples of extracting medical entities from a conversation between a doctor and a patient.
Your response must be only a valid JSON object with no additional commentary.

Example 1:
Input Dialogue:
Doctor: Hello, how are you today?
Patient: I'm feeling dizzy and have a mild fever.
Output:
{{"Doctor": [], "Patient": ["dizzy", "fever"]}}

Example 2:
Input Dialogue:
Doctor: Are you experiencing any pain?
Patient: Yes, I have a severe headache and some chest pain.
Output:
{{"Doctor": [], "Patient": ["headache", "chest pain"]}}

Now, process the following dialogue exactly as in the examples, and output only a valid JSON object with the keys "Doctor" and "Patient":
Input Dialogue:
{dialogue}

Answer: Provide your answer in valid JSON only.
"""

In [None]:
prompt = ChatPromptTemplate.from_template(multi_shot_template)
model = OllamaLLM(model="gemma3:latest")
chain = prompt | model

In [None]:
def extract_entities_from_dialogue(dialogue: str) -> str:
    if not isinstance(dialogue, str) or dialogue.strip() == "":
        return "{}"
    result = chain.invoke({"dialogue": dialogue})
    return result

In [None]:
subset_df = df.head(15)
subset_df["extracted_entities"] = subset_df["dialogue"].apply(extract_entities_from_dialogue)
print(subset_df[["ID", "extracted_entities"]].head())

  ID                                 extracted_entities
0  0  ```json\n{"Doctor": ["hypertension", "osteoart...
1  1  ```json\n{"Doctor": [], "Patient": ["headache"...
2  2  ```json\n{"Doctor": [], "Patient": ["warts", "...
3  3  ```json\n{"Doctor": ["Salmeterol inhaler", "Fl...
4  4  ```json\n{"Doctor": [], "Patient": ["burned ha...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_df["extracted_entities"] = subset_df["dialogue"].apply(extract_entities_from_dialogue)


In [None]:
import json

def parse_json(text: str):
    try:
        # Check and remove markdown code fences if found.
        if text.startswith("```"):
            lines = text.splitlines()
            # Remove first line if it starts with ```json (or any code fence)
            if lines[0].strip().startswith("```"):
                lines = lines[1:]
            # Remove the last line if it is a code fence
            if lines and lines[-1].strip().startswith("```"):
                lines = lines[:-1]
            text = "\n".join(lines).strip()
        return json.loads(text)
    except Exception as e:
        print(f"Error parsing JSON: {e}")
        return {}

subset_df.loc[:, 'parsed_entities'] = subset_df['extracted_entities'].apply(parse_json)

In [None]:
for idx, row in subset_df.iterrows():
    print(f"ID: {row['ID']}")
    print("Extracted Entities:")
    # Format the JSON dictionary with indentation for readability
    print(json.dumps(row['parsed_entities'], indent=2))
    print("-" * 40)

ID: 0
Extracted Entities:
{
  "Doctor": [
    "hypertension",
    "osteoarthritis",
    "osteoporosis",
    "hypothyroidism",
    "allergic rhinitis",
    "kidney stones",
    "fever",
    "chills",
    "cough",
    "congestion",
    "nausea",
    "vomiting",
    "chest pain",
    "chest pressure"
  ],
  "Patient": [
    "blood pressure medicine",
    "fever",
    "chills",
    "cough",
    "congestion",
    "nausea",
    "vomiting",
    "chest pain",
    "chest pressure"
  ]
}
----------------------------------------
ID: 1
Extracted Entities:
{
  "Doctor": [],
  "Patient": [
    "headache",
    "blurry vision",
    "lightheadedness",
    "swollen face",
    "dizziness",
    "blind spots"
  ]
}
----------------------------------------
ID: 2
Extracted Entities:
{
  "Doctor": [],
  "Patient": [
    "warts",
    "itchiness"
  ]
}
----------------------------------------
ID: 3
Extracted Entities:
{
  "Doctor": [
    "Salmeterol inhaler",
    "Fluticasone inhaler"
  ],
  "Patient": [
    "S

Now, playing around with prompting.

In [None]:
simple_prompt = """
Extract the medical entities mentioned by the Doctor and the Patient in the following dialogue.
Return your answer as valid JSON with two keys: "Doctor" and "Patient", where the values are lists of entities.
Dialogue:
{dialogue}
Answer:"""

In [None]:
prompt = ChatPromptTemplate.from_template(simple_prompt)
model = OllamaLLM(model="gemma3:latest")
chain = prompt | model

In [None]:
for idx, row in subset_df.iterrows():
    dialogue = row['dialogue']
    result = chain.invoke({"dialogue": dialogue})
    print("ID:", row['ID'])
    print("Dialogue:")
    print(dialogue)
    print("Extraction Result:")
    print(result)
    print("="*80)

ID: 0
Dialogue:
Doctor: What brings you back into the clinic today, miss? 
Patient: I came in for a refill of my blood pressure medicine. 
Doctor: It looks like Doctor Kumar followed up with you last time regarding your hypertension, osteoarthritis, osteoporosis, hypothyroidism, allergic rhinitis and kidney stones.  Have you noticed any changes or do you have any concerns regarding these issues?  
Patient: No. 
Doctor: Have you had any fever or chills, cough, congestion, nausea, vomiting, chest pain, chest pressure?
Patient: No.  
Doctor: Great. Also, for our records, how old are you and what race do you identify yourself as?
Patient: I am seventy six years old and identify as a white female.
Extraction Result:
```json
{
  "Doctor": [
    "hypertension",
    "osteoarthritis",
    "osteoporosis",
    "hypothyroidism",
    "allergic rhinitis",
    "kidney stones"
  ],
  "Patient": [
    "blood pressure medicine",
    "seventy six years old",
    "white female"
  ]
}
```
ID: 1
Dial

KeyboardInterrupt: 

With the simple prompt, the extraction is not as accurate as the multi-shot prompting. Many of the symptoms the patient experiences is mixed with the medicines/treatments.

For example: "Doctor": [
    "high blood pressure",
    "Mavik",
    "verapamil",
    "Tarka",
    "degenerative changes",
    "rotator cuff injury",
    "humeral head",
    "glenoid",
    "acid reflux",
    "heartburn",
    "dentures",
    "Lexapro"
  ],
  "Patient": [
    "ninety",
    "high blood pressure",
    "right arm symptoms",
    "Mavik",
    "Tarka",
    "white coat high blood pressure",
    "muscle problem",
    "right shoulder blade",
    "x rays",
    "stomach pain",
    "Aleve",
    "Tylenol",
    "Tums",
    "Mylanta",
    "sores in my mouth",
    "dentures",
    "tremors",
    "upper body",
    "torso",
    "arms",
    "Lexapro"
  ]