# Basic analysis

In [1]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset(
    "starmpcc/Asclepius-Synthetic-Clinical-Notes",
    split="train"
)
df = dataset.to_pandas()
print(df.head())

  from .autonotebook import tqdm as notebook_tqdm


   patient_id                                               note  \
0           0  Discharge Summary:\n\nPatient: 60-year-old mal...   
1           1  Discharge Summary:\n\nAdmission Date: [Insert ...   
2           2  Hospital Course Summary:\n\nAdmission Date: [I...   
3           3  Discharge Summary:\n\nPatient: 69-year-old mal...   
4           4  Discharge Summary:\n\nPatient Information:\n- ...   

                                            question  \
0  Can you provide a simplified paraphrase of the...   
1  Which coreferences were resolved in the hospit...   
2  What were the key improvements in the patient'...   
3  What roles did physical therapists have in the...   
4  What manual airway clearance techniques were u...   

                                              answer                    task  
0  The healthcare team used a gradual approach to...            Paraphrasing  
1  The hospital course section resolved the coref...  Coreference Resolution  
2  During the hos

# Raw data

In [2]:
sample = df['note'].iloc[0]
print(sample)

Discharge Summary:

Patient: 60-year-old male with moderate ARDS from COVID-19

Hospital Course:

The patient was admitted to the hospital with symptoms of fever, dry cough, and dyspnea. During physical therapy on the acute ward, the patient experienced coughing attacks that induced oxygen desaturation and dyspnea with any change of position or deep breathing. To avoid rapid deterioration and respiratory failure, a step-by-step approach was used for position changes. The breathing exercises were adapted to avoid prolonged coughing and oxygen desaturation, and with close monitoring, the patient managed to perform strength and walking exercises at a low level. Exercise progression was low initially but increased daily until hospital discharge to a rehabilitation clinic on day 10.

Clinical Outcome:

The patient was discharged on day 10 to a rehabilitation clinic making satisfactory progress with all symptoms resolved.

Follow-up:

The patient will receive follow-up care at the rehabilita

To clean and preprocess the dataset we havee took the sample data from the choosen dataset.

In [5]:
raw_text = sample

storing the sample text has the raw text for the future comparison.

# Lowercasing and tokenization

sentence lowercasing

In [7]:
lower_text = raw_text.lower()
print(lower_text)

discharge summary:

patient: 60-year-old male with moderate ards from covid-19

hospital course:

the patient was admitted to the hospital with symptoms of fever, dry cough, and dyspnea. during physical therapy on the acute ward, the patient experienced coughing attacks that induced oxygen desaturation and dyspnea with any change of position or deep breathing. to avoid rapid deterioration and respiratory failure, a step-by-step approach was used for position changes. the breathing exercises were adapted to avoid prolonged coughing and oxygen desaturation, and with close monitoring, the patient managed to perform strength and walking exercises at a low level. exercise progression was low initially but increased daily until hospital discharge to a rehabilitation clinic on day 10.

clinical outcome:

the patient was discharged on day 10 to a rehabilitation clinic making satisfactory progress with all symptoms resolved.

follow-up:

the patient will receive follow-up care at the rehabilita

All the lines and sentences are been lowercased 

The lowercasing for the sentence ave been generally carried out to reduce the dimensionality maintain the uniqueness and reduce the complexity of the models.

Tokenisation of the lowercased sample data

In [8]:
import re
tokens = re.findall(r'\b\w+\b', lower_text)
print(tokens)

['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'with', 'moderate', 'ards', 'from', 'covid', '19', 'hospital', 'course', 'the', 'patient', 'was', 'admitted', 'to', 'the', 'hospital', 'with', 'symptoms', 'of', 'fever', 'dry', 'cough', 'and', 'dyspnea', 'during', 'physical', 'therapy', 'on', 'the', 'acute', 'ward', 'the', 'patient', 'experienced', 'coughing', 'attacks', 'that', 'induced', 'oxygen', 'desaturation', 'and', 'dyspnea', 'with', 'any', 'change', 'of', 'position', 'or', 'deep', 'breathing', 'to', 'avoid', 'rapid', 'deterioration', 'and', 'respiratory', 'failure', 'a', 'step', 'by', 'step', 'approach', 'was', 'used', 'for', 'position', 'changes', 'the', 'breathing', 'exercises', 'were', 'adapted', 'to', 'avoid', 'prolonged', 'coughing', 'and', 'oxygen', 'desaturation', 'and', 'with', 'close', 'monitoring', 'the', 'patient', 'managed', 'to', 'perform', 'strength', 'and', 'walking', 'exercises', 'at', 'a', 'low', 'level', 'exercise', 'progression', 'was', 'low', '

The tokenisation generally converts the unstructured clinical text into the individual word units for computational processing.

# Stop word handling and lemmatisation 

stop word handling 

In [12]:
pip install nltk

Collecting nltk
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2026.1.15-cp312-cp312-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   -------------------- ------------------- 0.8/1.5 MB 11.2 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 5.3 MB/s  0:00:00
Downloading regex-2026.1.15-cp312-cp312-win_amd64.whl (277 kB)
Installing collected packages: regex, nltk

   ---------------------------------------- 0/2 [regex]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nl



In [13]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
clinical_negations = {'no', 'not', 'without', 'denies'}

filtered_tokens = [
    word for word in tokens
    if word not in stop_words or word in clinical_negations
]
print(filtered_tokens)

['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'moderate', 'ards', 'covid', '19', 'hospital', 'course', 'patient', 'admitted', 'hospital', 'symptoms', 'fever', 'dry', 'cough', 'dyspnea', 'physical', 'therapy', 'acute', 'ward', 'patient', 'experienced', 'coughing', 'attacks', 'induced', 'oxygen', 'desaturation', 'dyspnea', 'change', 'position', 'deep', 'breathing', 'avoid', 'rapid', 'deterioration', 'respiratory', 'failure', 'step', 'step', 'approach', 'used', 'position', 'changes', 'breathing', 'exercises', 'adapted', 'avoid', 'prolonged', 'coughing', 'oxygen', 'desaturation', 'close', 'monitoring', 'patient', 'managed', 'perform', 'strength', 'walking', 'exercises', 'low', 'level', 'exercise', 'progression', 'low', 'initially', 'increased', 'daily', 'hospital', 'discharge', 'rehabilitation', 'clinic', 'day', '10', 'clinical', 'outcome', 'patient', 'discharged', 'day', '10', 'rehabilitation', 'clinic', 'making', 'satisfactory', 'progress', 'symptoms', 'resolved', 'fol

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rajak\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


stop words are generally removed to reduce the noises that exist in the data. It helps to focus only on the clinically important data. As we deal with the medical data the words like no, not, without and denies are given important. As this words play some significant role in the analysis part.

Lemmatisation

In [16]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_tokens)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rajak\AppData\Roaming\nltk_data...


['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'moderate', 'ards', 'covid', '19', 'hospital', 'course', 'patient', 'admitted', 'hospital', 'symptom', 'fever', 'dry', 'cough', 'dyspnea', 'physical', 'therapy', 'acute', 'ward', 'patient', 'experienced', 'coughing', 'attack', 'induced', 'oxygen', 'desaturation', 'dyspnea', 'change', 'position', 'deep', 'breathing', 'avoid', 'rapid', 'deterioration', 'respiratory', 'failure', 'step', 'step', 'approach', 'used', 'position', 'change', 'breathing', 'exercise', 'adapted', 'avoid', 'prolonged', 'coughing', 'oxygen', 'desaturation', 'close', 'monitoring', 'patient', 'managed', 'perform', 'strength', 'walking', 'exercise', 'low', 'level', 'exercise', 'progression', 'low', 'initially', 'increased', 'daily', 'hospital', 'discharge', 'rehabilitation', 'clinic', 'day', '10', 'clinical', 'outcome', 'patient', 'discharged', 'day', '10', 'rehabilitation', 'clinic', 'making', 'satisfactory', 'progress', 'symptom', 'resolved', 'follow', 

Lemmatisation normalise the word forms while maintaining the clinical meaning. Example converting the word discharged to discharge. maintains the same meaning.