# Assignment Overview
- Use an open-source NLP or LLM model (e.g. SpaCy, Flan-T5, BART from Hugging Face) to
extract clinical entities from discharge notes	
- Identify and label relevant medical terms: diagnosis, treatment, symptoms, medications, or
follow-up actions	
- Optionally use prompt engineering to guide the LLM for consistent output	
- Briefly discuss risks such as hallucination, entity ambiguity, or limitations of generalpurpose models in clinical contexts

In [None]:
pip3 freeze > requirements.txt

In [1]:
!pip install transformers torch

Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting torch
  Downloading torch-2.7.0-cp39-cp39-win_amd64.whl.metadata (29 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.32.4-py3-none-any.whl.metadata (14 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.0.2-cp39-cp39-win_amd64.whl.metadata (59 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2024.11.6-cp39-cp39-win_amd64.whl.metadata (41 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Using cached tokenizers-0.21.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Using cached safetensors-0.5.3-cp38-abi3-win_amd64.whl.metadata (3.9 kB)
Collecting tqdm>=4.27 (from transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)

In [3]:
pip install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp39-cp39-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp39-cp39-win_amd64.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   ------------ --------------------------- 3.7/11.6 MB 19.8 MB/s eta 0:00:01
   --------------------------- ------------ 7.9/11.6 MB 20.3 MB/s eta 0:00:01
   ---------------------------------------- 11.6/11.6 MB 19.6 MB/s eta 0:00:00
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas

   ---------------------------------------- 0/3 [pytz]
   ------------- -------------------------- 1/3 [tzdata]
   -------------------------- ------------- 2/3 [pandas]
   --

In [2]:
# import libraries
import pandas as pd
import numpy as np
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch



In [5]:
pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp39-cp39-win_amd64.whl.metadata (8.3 kB)
Downloading sentencepiece-0.2.0-cp39-cp39-win_amd64.whl (991 kB)
   ---------------------------------------- 0.0/991.5 kB ? eta -:--:--
   --------------------------------------- 991.5/991.5 kB 15.5 MB/s eta 0:00:00
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
# read in the provided data
df = pd.read_csv("data/Assignment_Data.csv")
print(df.shape)

# Load flan t5 large model and tokenizer
model_name = "google/flan-t5-large"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

(200, 9)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [3]:
# Function to extract matched medical terms from text
def extract_medical_phrases(text):
    # Create a prompt
    prompt = f"""As a doctor, extract the following medical information from the patient discharge note:
    - Diagnosis or medical condition
    - Medication or treatment
    - Recovery status
    - Follow-up recommendation or medical advice
    
    Discharge note: {text}"""
    
    # Tokenize and generate
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    output = model.generate(**inputs, max_new_tokens=100)
    
    # Decode
    result = tokenizer.decode(output[0], skip_special_tokens=True)

    return(result)

# Apply to 'discharge_note' column
df['medical_entities'] = df['discharge_note'].apply(extract_medical_phrases)

In [4]:
df[['discharge_note','medical_entities']].head(10)

Unnamed: 0,discharge_note,medical_entities
0,Good recovery trajectory. Follow-up scan sched...,Recovery status
1,Stable post-surgery. Advised to avoid physical...,Recovery status is stable post-surgery. Advise...
2,Symptoms controlled. Monitoring for relapse ad...,Recovery status
3,Stable post-surgery. Advised to avoid physical...,Recovery status is stable post-surgery. Advise...
4,Stable post-surgery. Advised to avoid physical...,Recovery status is stable post-surgery. Advise...
5,Good recovery trajectory. Follow-up scan sched...,Recovery status
6,Discharge after recovery from pneumonia. No co...,Recovery status is recovery from pneumonia. No...
7,Patient discharged in stable condition. Recomm...,Diagnosis or medical condition
8,Patient showed improvement. Prescribed antibio...,Diagnosis or medical condition
9,Blood pressure under control. Continue current...,Diagnosis or medical condition


In [5]:
# Inspect output

indices =range(10,20) # specify rows to inspect

for i in indices:
    print('Input:\n', df['discharge_note'][i])
    print('Output:\n', df['medical_entities'][i], '\n')

Input:
 Stable post-surgery. Advised to avoid physical exertion.
Output:
 Recovery status is stable post-surgery. Advised to avoid physical exertion. 

Input:
 Patient discharged with minor discomfort. Advised rest and hydration.
Output:
 Recovery status is minor discomfort. Advised rest and hydration. 

Input:
 Symptoms controlled. Monitoring for relapse advised.
Output:
 Recovery status 

Input:
 Discharge after recovery from pneumonia. No complications observed.
Output:
 Recovery status is recovery from pneumonia. No complications observed. 

Input:
 Stable post-surgery. Advised to avoid physical exertion.
Output:
 Recovery status is stable post-surgery. Advised to avoid physical exertion. 

Input:
 Discharge after recovery from pneumonia. No complications observed.
Output:
 Recovery status is recovery from pneumonia. No complications observed. 

Input:
 Discharge after recovery from pneumonia. No complications observed.
Output:
 Recovery status is recovery from pneumonia. No compli

In [6]:
medical_info_list=['Diagnosis or medical condition', 'Medication or treatment', 'Recovery status', 'Follow-up recommendation or medical advice']

print('Number of non-null results: ', len(df[~df['medical_entities'].isin(medical_info_list)]['medical_entities']))

Number of non-null results:  80


In [7]:
df[~df['medical_entities'].isin(medical_info_list)]['medical_entities'].head()

1     Recovery status is stable post-surgery. Advise...
3     Recovery status is stable post-surgery. Advise...
4     Recovery status is stable post-surgery. Advise...
6     Recovery status is recovery from pneumonia. No...
10    Recovery status is stable post-surgery. Advise...
Name: medical_entities, dtype: object

1. __Hallucination__:
- LLMs are often trained on large, unreliable datasets, leading to misrepresentation of knowledge. Models can make confident but inaccurate statements, especially if they lack domain-specific knowledge. Models may struggle with real-time, contextually relevant information, in highly specialised fields like healthcare. 
- Retrieval-augmented generation (RAG) which uses external knowledge sources, can help reduce hallucinations. 
2. __Entity Ambiguity__:
- Clinical terminology can be ambiguous, with terms having different meanings in different contexts, and patients and practitioners using different language for the same concepts.
- This ambiguity makes it difficult for LLMs to accurately understand and interpret clinical information. 
3. __Limitations of General-Purpose Models__:
- Lack of domain-specific knowledge
- General-purpose LLMs may not have sufficient training on medical data to understand complex clinical scenarios. 
4. __Inability to handle nuanced information__:
- Clinical records are full of contextual richness and nuanced information, which can be challenging for general-purpose models to interpret. 
5. __Bias__:
- LLMs can inherit biases from their training data, leading to unfair or inaccurate outputs. 