# Basic analysis

In [1]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset(
    "starmpcc/Asclepius-Synthetic-Clinical-Notes",
    split="train"
)
df = dataset.to_pandas()
print(df.head())

  from .autonotebook import tqdm as notebook_tqdm


   patient_id                                               note  \
0           0  Discharge Summary:\n\nPatient: 60-year-old mal...   
1           1  Discharge Summary:\n\nAdmission Date: [Insert ...   
2           2  Hospital Course Summary:\n\nAdmission Date: [I...   
3           3  Discharge Summary:\n\nPatient: 69-year-old mal...   
4           4  Discharge Summary:\n\nPatient Information:\n- ...   

                                            question  \
0  Can you provide a simplified paraphrase of the...   
1  Which coreferences were resolved in the hospit...   
2  What were the key improvements in the patient'...   
3  What roles did physical therapists have in the...   
4  What manual airway clearance techniques were u...   

                                              answer                    task  
0  The healthcare team used a gradual approach to...            Paraphrasing  
1  The hospital course section resolved the coref...  Coreference Resolution  
2  During the hos

# Raw data

In [13]:
sample = df['note'].iloc[0]
print(sample)

Discharge Summary:

Patient: 60-year-old male with moderate ARDS from COVID-19

Hospital Course:

The patient was admitted to the hospital with symptoms of fever, dry cough, and dyspnea. During physical therapy on the acute ward, the patient experienced coughing attacks that induced oxygen desaturation and dyspnea with any change of position or deep breathing. To avoid rapid deterioration and respiratory failure, a step-by-step approach was used for position changes. The breathing exercises were adapted to avoid prolonged coughing and oxygen desaturation, and with close monitoring, the patient managed to perform strength and walking exercises at a low level. Exercise progression was low initially but increased daily until hospital discharge to a rehabilitation clinic on day 10.

Clinical Outcome:

The patient was discharged on day 10 to a rehabilitation clinic making satisfactory progress with all symptoms resolved.

Follow-up:

The patient will receive follow-up care at the rehabilita

To clean and preprocess the dataset we havee took the sample data from the choosen dataset.

In [14]:
raw_text = sample

storing the sample text has the raw text for the future comparison.

# Lowercasing and tokenization

sentence lowercasing

In [15]:
lower_text = raw_text.lower()
print(lower_text)

discharge summary:

patient: 60-year-old male with moderate ards from covid-19

hospital course:

the patient was admitted to the hospital with symptoms of fever, dry cough, and dyspnea. during physical therapy on the acute ward, the patient experienced coughing attacks that induced oxygen desaturation and dyspnea with any change of position or deep breathing. to avoid rapid deterioration and respiratory failure, a step-by-step approach was used for position changes. the breathing exercises were adapted to avoid prolonged coughing and oxygen desaturation, and with close monitoring, the patient managed to perform strength and walking exercises at a low level. exercise progression was low initially but increased daily until hospital discharge to a rehabilitation clinic on day 10.

clinical outcome:

the patient was discharged on day 10 to a rehabilitation clinic making satisfactory progress with all symptoms resolved.

follow-up:

the patient will receive follow-up care at the rehabilita

All the lines and sentences are been lowercased 

The lowercasing for the sentence ave been generally carried out to reduce the dimensionality maintain the uniqueness and reduce the complexity of the models.

Tokenisation of the lowercased sample data

In [16]:
import re
tokens = re.findall(r'\b\w+\b', lower_text)
print(tokens)

['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'with', 'moderate', 'ards', 'from', 'covid', '19', 'hospital', 'course', 'the', 'patient', 'was', 'admitted', 'to', 'the', 'hospital', 'with', 'symptoms', 'of', 'fever', 'dry', 'cough', 'and', 'dyspnea', 'during', 'physical', 'therapy', 'on', 'the', 'acute', 'ward', 'the', 'patient', 'experienced', 'coughing', 'attacks', 'that', 'induced', 'oxygen', 'desaturation', 'and', 'dyspnea', 'with', 'any', 'change', 'of', 'position', 'or', 'deep', 'breathing', 'to', 'avoid', 'rapid', 'deterioration', 'and', 'respiratory', 'failure', 'a', 'step', 'by', 'step', 'approach', 'was', 'used', 'for', 'position', 'changes', 'the', 'breathing', 'exercises', 'were', 'adapted', 'to', 'avoid', 'prolonged', 'coughing', 'and', 'oxygen', 'desaturation', 'and', 'with', 'close', 'monitoring', 'the', 'patient', 'managed', 'to', 'perform', 'strength', 'and', 'walking', 'exercises', 'at', 'a', 'low', 'level', 'exercise', 'progression', 'was', 'low', '

The tokenisation generally converts the unstructured clinical text into the individual word units for computational processing.

# Stop word handling and lemmatisation 

stop word handling 

In [6]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.




In [28]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
clinical_negations = {'no', 'not', 'without', 'denies'}

filtered_tokens = [
    word for word in tokens
    if word not in stop_words or word in clinical_negations
]
print(filtered_tokens)

['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'moderate', 'ards', 'covid', '19', 'hospital', 'course', 'patient', 'admitted', 'hospital', 'symptoms', 'fever', 'dry', 'cough', 'dyspnea', 'physical', 'therapy', 'acute', 'ward', 'patient', 'experienced', 'coughing', 'attacks', 'induced', 'oxygen', 'desaturation', 'dyspnea', 'change', 'position', 'deep', 'breathing', 'avoid', 'rapid', 'deterioration', 'respiratory', 'failure', 'step', 'step', 'approach', 'used', 'position', 'changes', 'breathing', 'exercises', 'adapted', 'avoid', 'prolonged', 'coughing', 'oxygen', 'desaturation', 'close', 'monitoring', 'patient', 'managed', 'perform', 'strength', 'walking', 'exercises', 'low', 'level', 'exercise', 'progression', 'low', 'initially', 'increased', 'daily', 'hospital', 'discharge', 'rehabilitation', 'clinic', 'day', '10', 'clinical', 'outcome', 'patient', 'discharged', 'day', '10', 'rehabilitation', 'clinic', 'making', 'satisfactory', 'progress', 'symptoms', 'resolved', 'fol

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rajak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


stop words are generally removed to reduce the noises that exist in the data. It helps to focus only on the clinically important data. As we deal with the medical data the words like no, not, without and denies are given important. As this words play some significant role in the analysis part.

Lemmatisation

In [29]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_tokens)

['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'moderate', 'ards', 'covid', '19', 'hospital', 'course', 'patient', 'admitted', 'hospital', 'symptom', 'fever', 'dry', 'cough', 'dyspnea', 'physical', 'therapy', 'acute', 'ward', 'patient', 'experienced', 'coughing', 'attack', 'induced', 'oxygen', 'desaturation', 'dyspnea', 'change', 'position', 'deep', 'breathing', 'avoid', 'rapid', 'deterioration', 'respiratory', 'failure', 'step', 'step', 'approach', 'used', 'position', 'change', 'breathing', 'exercise', 'adapted', 'avoid', 'prolonged', 'coughing', 'oxygen', 'desaturation', 'close', 'monitoring', 'patient', 'managed', 'perform', 'strength', 'walking', 'exercise', 'low', 'level', 'exercise', 'progression', 'low', 'initially', 'increased', 'daily', 'hospital', 'discharge', 'rehabilitation', 'clinic', 'day', '10', 'clinical', 'outcome', 'patient', 'discharged', 'day', '10', 'rehabilitation', 'clinic', 'making', 'satisfactory', 'progress', 'symptom', 'resolved', 'follow', 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rajak\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatisation normalise the word forms while maintaining the clinical meaning. Example converting the word discharged to discharge. maintains the same meaning.

In [30]:
abbreviation_map = {
    'ards': 'acute respiratory distress syndrome',
    'sob': 'shortness of breath',
    'htn': 'hypertension',
    'dm': 'diabetes mellitus',
}
expanded_tokens = [
    abbreviation_map[token] if token in abbreviation_map else token
    for token in lemmatized_tokens
]

print(expanded_tokens)

['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'moderate', 'acute respiratory distress syndrome', 'covid', '19', 'hospital', 'course', 'patient', 'admitted', 'hospital', 'symptom', 'fever', 'dry', 'cough', 'dyspnea', 'physical', 'therapy', 'acute', 'ward', 'patient', 'experienced', 'coughing', 'attack', 'induced', 'oxygen', 'desaturation', 'dyspnea', 'change', 'position', 'deep', 'breathing', 'avoid', 'rapid', 'deterioration', 'respiratory', 'failure', 'step', 'step', 'approach', 'used', 'position', 'change', 'breathing', 'exercise', 'adapted', 'avoid', 'prolonged', 'coughing', 'oxygen', 'desaturation', 'close', 'monitoring', 'patient', 'managed', 'perform', 'strength', 'walking', 'exercise', 'low', 'level', 'exercise', 'progression', 'low', 'initially', 'increased', 'daily', 'hospital', 'discharge', 'rehabilitation', 'clinic', 'day', '10', 'clinical', 'outcome', 'patient', 'discharged', 'day', '10', 'rehabilitation', 'clinic', 'making', 'satisfactory', 'progress', 's

# Spell check and normalisation 

In [20]:
pip install wordfreq

Collecting wordfreq
  Downloading wordfreq-3.1.1-py3-none-any.whl.metadata (27 kB)
Collecting ftfy>=6.1 (from wordfreq)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting langcodes>=3.0 (from wordfreq)
  Downloading langcodes-3.5.1-py3-none-any.whl.metadata (30 kB)
Collecting locate<2.0.0,>=1.1.1 (from wordfreq)
  Downloading locate-1.1.1-py3-none-any.whl.metadata (3.9 kB)
Collecting msgpack<2.0.0,>=1.0.7 (from wordfreq)
  Downloading msgpack-1.1.2-cp312-cp312-win_amd64.whl.metadata (8.4 kB)
Downloading wordfreq-3.1.1-py3-none-any.whl (56.8 MB)
   ---------------------------------------- 0.0/56.8 MB ? eta -:--:--
   - -------------------------------------- 2.4/56.8 MB 19.1 MB/s eta 0:00:03
   ----------- ---------------------------- 15.7/56.8 MB 47.2 MB/s eta 0:00:01
   ------------------- -------------------- 27.8/56.8 MB 51.7 MB/s eta 0:00:01
   ---------------------------- ----------- 40.9/56.8 MB 55.3 MB/s eta 0:00:01
   ---------------------------------- ----- 



In [21]:
pip install pyenchant

Collecting pyenchant
  Downloading pyenchant-3.3.0-py3-none-win_amd64.whl.metadata (3.4 kB)
Downloading pyenchant-3.3.0-py3-none-win_amd64.whl (37.4 MB)
   ---------------------------------------- 0.0/37.4 MB ? eta -:--:--
   --- ------------------------------------ 2.9/37.4 MB 41.6 MB/s eta 0:00:01
   --------------- ------------------------ 14.2/37.4 MB 52.2 MB/s eta 0:00:01
   ---------------------------- ----------- 26.2/37.4 MB 53.6 MB/s eta 0:00:01
   ---------------------------------------  37.2/37.4 MB 56.3 MB/s eta 0:00:01
   ---------------------------------------- 37.4/37.4 MB 45.7 MB/s  0:00:00
Installing collected packages: pyenchant
Successfully installed pyenchant-3.3.0
Note: you may need to restart the kernel to use updated packages.




Downloading and initialising the library for american english conversion and the spelling check.

Conversion of british english to the american english

In [22]:
import enchant
uk_dict = enchant.Dict("en_GB")
us_dict = enchant.Dict("en_US")

In [31]:
def british_to_american(expanded_tokens, uk_dict, us_dict):
    converted_tokens = []
    for word in expanded_tokens:
        # Check if word is British but not American
        if uk_dict.check(word) and not us_dict.check(word):
            suggestions = us_dict.suggest(word)
            if suggestions:
                converted_tokens.append(suggestions[0])  # best US spelling
            else:
                converted_tokens.append(word)
        else:
            converted_tokens.append(word)
    return converted_tokens

In [32]:
american_tokens = british_to_american(expanded_tokens, uk_dict, us_dict)
print(american_tokens)

['discharge', 'summary', 'patient', '60', 'year', 'old', 'male', 'moderate', 'acute respiratory distress syndrome', 'covid', '19', 'hospital', 'course', 'patient', 'admitted', 'hospital', 'symptom', 'fever', 'dry', 'cough', 'dyspnea', 'physical', 'therapy', 'acute', 'ward', 'patient', 'experienced', 'coughing', 'attack', 'induced', 'oxygen', 'desaturation', 'dyspnea', 'change', 'position', 'deep', 'breathing', 'avoid', 'rapid', 'deterioration', 'respiratory', 'failure', 'step', 'step', 'approach', 'used', 'position', 'change', 'breathing', 'exercise', 'adapted', 'avoid', 'prolonged', 'coughing', 'oxygen', 'desaturation', 'close', 'monitoring', 'patient', 'managed', 'perform', 'strength', 'walking', 'exercise', 'low', 'level', 'exercise', 'progression', 'low', 'initially', 'increased', 'daily', 'hospital', 'discharge', 'rehabilitation', 'clinic', 'day', '10', 'clinical', 'outcome', 'patient', 'discharged', 'day', '10', 'rehabilitation', 'clinic', 'making', 'satisfactory', 'progress', 's

Spelling check for the words

In [35]:
import enchant
spell_dict = enchant.Dict("en_US")

In [36]:
def check_spelling(american_tokens, dictionary):
    misspelled = []

    for word in american_tokens:
        # Ignore numbers and very short tokens
        if word.isalpha() and len(word) > 2:
            if not dictionary.check(word):
                misspelled.append(word)
    return misspelled

In [37]:
misspelled_words = check_spelling(american_tokens, spell_dict)
print("Misspelled words:", misspelled_words)


Misspelled words: ['covid', 'desaturation', 'desaturation', 'covid']


The words spelling have been checked and updated to its correct spelling in the paragraph.

# Before Vs After preprocessing

In [38]:
after_text = " ".join(american_tokens)

In [40]:
print("===== BEFORE (Raw Clinical Text) =====\n")
print(raw_text)
print("\n===== AFTER (Preprocessed Text) =====\n")
print(after_text)

===== BEFORE (Raw Clinical Text) =====

Discharge Summary:

Patient: 60-year-old male with moderate ARDS from COVID-19

Hospital Course:

The patient was admitted to the hospital with symptoms of fever, dry cough, and dyspnea. During physical therapy on the acute ward, the patient experienced coughing attacks that induced oxygen desaturation and dyspnea with any change of position or deep breathing. To avoid rapid deterioration and respiratory failure, a step-by-step approach was used for position changes. The breathing exercises were adapted to avoid prolonged coughing and oxygen desaturation, and with close monitoring, the patient managed to perform strength and walking exercises at a low level. Exercise progression was low initially but increased daily until hospital discharge to a rehabilitation clinic on day 10.

Clinical Outcome:

The patient was discharged on day 10 to a rehabilitation clinic making satisfactory progress with all symptoms resolved.

Follow-up:

The patient will 

while comparing the before and after pre-processed data we can be able to see how the noises of the data are removed effectively. This pre-processed data only contains the clinically significant term which doctors will primarily focus.

# Pre-processing whole dataset

pre-processing the whole dataset 

In [2]:

# IMPORT LIBRARIES
import re
import pandas as pd
import nltk
import enchant
from functools import lru_cache

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# NTLK processing 
nltk.download("stopwords")
nltk.download("wordnet")
# Ensure required column exists
assert "note" in df.columns, "Column 'note' not found in dataset"
# Resource initialisation
lemmatizer = WordNetLemmatizer()
# British & American dictionaries
uk_dict = enchant.Dict("en_GB")
us_dict = enchant.Dict("en_US")
# Stopwords 
stop_words = set(stopwords.words("english"))
negations = {"no", "not", "without", "denies", "never", "before", "after"}
stop_words = stop_words - negations
# Clinical abbreviation dictionary
abbreviation_map = {
    "ards": "acute respiratory distress syndrome",
    "sob": "shortness of breath",
    "htn": "hypertension",
    "dm": "diabetes mellitus",
    "copd": "chronic obstructive pulmonary disease",
}
# American English conversion
@lru_cache(maxsize=50000)
def british_to_american(word):
    if uk_dict.check(word) and not us_dict.check(word):
        suggestions = us_dict.suggest(word)
        return suggestions[0] if suggestions else word
    return word
# preprocessing pipeline 
def preprocess_note(text):
    # Convert to string and lowercase
    text = str(text).lower()

    # 1. Tokenization
    tokens = re.findall(r"\b[a-zA-Z]+\b", text)

    # 2. Stopword removal (negations preserved)
    tokens = [t for t in tokens if t not in stop_words]

    # 3. Lemmatization
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # 4. Abbreviation expansion
    expanded_tokens = []
    for t in tokens:
        if t in abbreviation_map:
            expanded_tokens.extend(abbreviation_map[t].split())
        else:
            expanded_tokens.append(t)

    # 5. British ‚Üí American conversion
    final_tokens = [british_to_american(t) for t in expanded_tokens]

    # Reconstruct cleaned text
    return " ".join(final_tokens)
# APPLY PIPELINE TO ENTIRE DATASET
df["note_preprocessed"] = df["note"].apply(preprocess_note)
# SAVE TO NEW CSV
output_file = r"C:\Users\rajak\Downloads\AI Internship\clinical_notes_preprocessed_no_spellcheck.csv"
df.to_csv(output_file, index=False)

print("‚úÖ Preprocessing completed successfully")
print(f"üìÅ Output saved as: {output_file}")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rajak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rajak\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


‚úÖ Preprocessing completed successfully
üìÅ Output saved as: C:\Users\rajak\Downloads\AI Internship\clinical_notes_preprocessed_no_spellcheck.csv


Displaying sample data before and after pre-processing

In [3]:
# Select 3 samples (you can change indices if needed)
sample_indices = [2, 5, 10]

for i in sample_indices:
    print("=" * 80)
    print(f"SAMPLE {i+1}")
    
    print("\n--- BEFORE PREPROCESSING (RAW TEXT) ---\n")
    print(df.loc[i, "note"])
    
    print("\n--- AFTER PREPROCESSING (CLEANED TEXT) ---\n")
    print(df.loc[i, "note_preprocessed"])
    
    print("\n")


SAMPLE 3

--- BEFORE PREPROCESSING (RAW TEXT) ---

Hospital Course Summary:

Admission Date: [Insert date]
Discharge Date: [Insert date]

Patient: [Patient's Name]
Sex: Male
Age: 57 years

Admission Diagnosis: Oxygen Desaturation

Hospital Course:

The patient was admitted to the ICU one week after a positive COVID-19 result due to oxygen desaturation. Physical therapy was initiated promptly after admission, which helped improve the patient's breathing frequency and oxygen saturation. The patient was guided to achieve a prone position resulting in a significant increase in oxygen saturation from 88% to 96%. The patient continued to receive intensive physical therapy, positioning, and oxygen therapy for the next few days. Although there were challenges in achieving the prone position due to the patient's profoundly reduced respiratory capacity and high risk of symptom exacerbation, the medical team succeeded in implementing a safe and individualized approach.

After three days with this

The entire selected dataset have been pre-processed for the further usage. By this pre-processing we have effectively removed the unwated terms that exist in the summary and concentrated only on the observed clinical terms. In which the clinicians will generally concentrate.