# AI4Health - 02 - Clinical Text Classification

---

## Introduction

Clinical text data are unstructured narratives found in medical notes, discharge summaries, and patient reports. This type of text data captures the nuanced reasoning and observations of healthcare professionals. Unlike structured tabular data, clinical text contains rich contextual information that is essential for understanding patient histories, diagnoses, and treatment plans. Leveraging machine learning to classify and interpret this text can support clinicians in identifying key conditions, streamlining documentation, and improving patient care.

In this notebook, you will develop a machine learning model to classify clinical notes using the **MIMIC-III dataset**. This dataset consists of de-identified clinical narratives from real hospital admissions, providing a realistic setting for exploring **natural language processing (NLP)** in healthcare. The classification task focuses on assigning diagnostic categories to free-text notes, a common challenge in automating clinical workflows.

We use the MIMIC-III dataset because it is a widely recognised resource for clinical NLP research, offering diverse and authentic examples of medical documentation. Its breadth and complexity make it ideal for studying the challenges and opportunities of text-based classification in medicine.

You will learn how to:

- Preprocess clinical text data for machine learning
- Apply text vectorisation techniques such as **TF-IDF**
- Train and evaluate classifiers (e.g., **Logistic Regression**, **Naive Bayes**) on clinical text
- Assess model performance using metrics like accuracy, precision, recall, and confusion matrices
- Discuss the ethical and practical considerations of deploying NLP models in clinical settings

By the end of this notebook, you will gain practical experience with the full workflow of clinical text classification and develop a deeper appreciation for the impact and challenges of applying NLP in healthcare.

### Learning Objectives

- Understand the structure of clinical free-text data
- Learn text preprocessing and vectorisation (e.g., TF-IDF)
- Train and evaluate a classifier on medical text
- Interpret classification results in a clinical context
- Reflect on the practical and ethical implications of NLP in healthcare

---

## Additional Context

### What is Clinical Text Data?

Clinical text data consists of unstructured narratives written by healthcare professionals, such as progress notes, discharge summaries, and radiology reports. Unlike structured data (e.g., lab values in tables), these free-text documents capture nuanced reasoning, observations, and patient histories that are difficult to encode in fixed fields. Clinical text is rich in context but also challenging to analyse due to abbreviations, jargon, and variability in language.

In healthcare, clinical text is used for:
- **Diagnosis documentation** (e.g., describing symptoms, findings, and impressions)
- **Care coordination** (e.g., handoff notes between providers)
- **Quality improvement and research** (e.g., mining notes for trends or adverse events)

### Why Text Classification in Medicine?

Many real-world clinical tasks involve sorting or tagging text:
- Assigning diagnostic codes to notes for billing or research
- Flagging notes that mention critical findings (e.g., sepsis, stroke)
- Triage and routing of patient messages or referrals

Automating these tasks with machine learning can save time, reduce errors, and support clinical decision-making, but requires careful handling due to the sensitive and complex nature of medical language.

### Key Concepts in Clinical NLP

Clinical natural language processing (NLP) involves several foundational concepts that are essential for transforming raw clinical text into meaningful insights. Understanding these concepts helps ensure that machine learning models are both effective and reliable when applied to healthcare data.

- **Preprocessing**: Cleaning and standardising text (e.g., lowercasing, removing stop words) to reduce noise and improve model performance.
- **Vectorisation**: Converting text into numerical features (e.g., TF-IDF) so that machine learning models can process it.
- **Class Imbalance**: Some diagnoses or findings are much rarer than others, which can bias models if not addressed.
- **Interpretability**: Clinicians need to understand why a model made a particular prediction, especially in high-stakes settings.

### Why Use TF-IDF and Logistic Regression?

Selecting the right tools for text classification is crucial in clinical NLP. TF-IDF and logistic regression are commonly used because they balance interpretability and performance, making them suitable for initial experiments and baseline models in healthcare applications.

- **TF-IDF**: Highlights words that are important for distinguishing between classes, down-weighting common words and up-weighting rare but informative terms.
- **Logistic Regression**: A simple, interpretable baseline for text classification that works well with high-dimensional, sparse data like TF-IDF vectors.

### Clinical and Ethical Considerations

Applying NLP in clinical settings demands careful attention to privacy, fairness, transparency, and the potential impact of errors. Sensitive patient information in clinical notes must be securely handled and de-identified. Models should be rigorously evaluated to avoid amplifying biases and to ensure their outputs are interpretable and auditable, as errors can have serious consequences for patient care.

---

## Related Guides

- *MatPlotLib - Pyplot:* https://matplotlib.org/stable/tutorials/pyplot.html
- *Pandas - DataFrame:* https://pandas.pydata.org/docs/user_guide/dsintro.html#basics-dataframe
- *SciKit-Learn - Confusion Matrix:* https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
- *SciKit-Learn - Cross Validation (train, test, split):* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
- *SciKit-Learn - Logistic Regression:* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
- *SciKit-Learn - Text feature extraction (TF-IDF):* https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
- *WordCloud:* https://amueller.github.io/word_cloud/index.html

---

## Step 1: Load Required Libraries

Before we begin, let's import the essential Python libraries for text analysis, visualisation, and machine learning.

- **pandas**: for data manipulation and analysis
- **matplotlib** and **seaborn**: for creating informative plots
- **scikit-learn**: for text vectorisation, model building, and evaluation
- **wordcloud**: for visualising the most frequent terms in each class

Understanding the purpose of each library will help you preprocess, visualise, and model clinical text data effectively.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud

print("OK")

OK


**Questions:**

- **1.1.** Why do we need specialised libraries for text data, rather than just using standard data analysis tools?
- **1.2.** What is the role of `TfidfVectorizer` in text classification?
- **1.3.** How can visualisations like word clouds help in understanding clinical text data?

---

## Step 2: Load and Preview the Dataset

We will now load a synthetic dataset that simulates short clinical text records. Each entry in the dataset contains a brief description of a patient's condition, along with a diagnostic label. This setup mimics real-world clinical documentation, where healthcare professionals record observations, symptoms, and diagnoses in free-text form.

Previewing the dataset is a crucial first step in any data science workflow. By examining a few rows, you can get a sense of the language used, the length and structure of the notes, and the variety of diagnostic categories present. This helps you anticipate challenges such as inconsistent terminology, abbreviations, or missing information, which are common in clinical narratives.

Understanding the structure and distribution of your data will guide your preprocessing decisions and model design. For example, you may notice that some classes are underrepresented, or that certain terms appear frequently in specific diagnoses. These insights will help you tailor your approach to the unique characteristics of clinical text.

In [None]:
file_path = "./datasets/synthetic_clinical_notes.csv"
try:
    df = pd.read_csv(file_path)
except Exception as e:
    print("Failed to load dataset:", e)
    df = pd.DataFrame()

# Display the first few rows of the DataFrame
df.head()

**Questions:**

- **2.1.** What challenges might arise when working with free-text clinical data compared to structured tabular data?
- **2.2.** What ethical considerations should you keep in mind when handling clinical text data?

---

## Step 3: Visualise Class Distribution

Understanding how many samples belong to each class is essential before training a model. In clinical text classification, some diagnoses may be much more common than others, leading to class imbalance. This can cause a model to favor the majority class and overlook less frequent but important conditions.

A quick bar plot of class counts helps you spot any imbalance and decide if you need to adjust your approach, such as by resampling or using class weights.

In [None]:
# Bar plot for class distribution
df['label'].value_counts().plot(kind='bar', title='Class Distribution')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.grid(True)
plt.show()

**Questions:**

- **3.1.** Why should we examine the distribution of classes before training a model?
- **3.2.** How might class imbalance affect the performance of your classifier?
- **3.3.** What strategies could you use to address class imbalance in text classification?

---

## Step 4: Word Clouds per Class

Visualising word frequencies for each class helps you understand which terms are most associated with each diagnostic category. Word clouds provide a quick, intuitive way to see the most common words in the clinical notes for each class.

By comparing word clouds, you can identify distinctive terms that may help the model differentiate between diagnoses. This step also helps you spot potential issues, such as common words that appear across all classes or irrelevant terms that might need to be removed during preprocessing.

- *WordCloud:* https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html

In [None]:
# Generate word cloud for each class
for cls in df['label'].unique():
    text = ' '.join(df[df['label'] == cls]['text'])
    wc = WordCloud(width=600, height=400, background_color='white').generate(text)
    plt.figure()
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'Word Cloud for {cls}')
    plt.show()

**Questions:**

- **4.1.** How can word clouds help you understand the key terms associated with each class?
- **4.2.** What are the limitations of using word clouds for clinical text analysis?
- **4.3.** How might the most frequent words differ between classes, and why is this useful for classification?
- **4.4.** What additional text visualisations could provide deeper insights into your data?

---

## Step 5: Text Preprocessing and Vectorisation

To use clinical text in machine learning models, we need to convert it into numerical features. TF-IDF (Term Frequency–Inverse Document Frequency) is a common technique that transforms text into vectors, highlighting words that are important for distinguishing between classes.

Preprocessing steps like removing stop words or limiting the number of features can help focus the model on the most relevant terms and reduce noise. This process prepares the clinical notes for effective model training and

- *SciKit-Learn - TfidfVectorizer:* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- *SciKit-Learn - train_test_split:* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# TF-IDF Vectorisation
vectoriser = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectoriser.fit_transform(X_train)
X_test_tfidf = vectoriser.transform(X_test)

print("OK")

**Questions:**

- **5.1.** Why do we need to convert text into numerical features for machine learning?
- **5.2.** What does TF-IDF capture about the importance of words in clinical notes?
- **5.3.** How might preprocessing choices (like removing stop words or limiting features) impact model performance?
- **5.4.** What are some challenges unique to preprocessing clinical text compared to general text?

---

## Step 6: Train a Classifier

Now we will train a simple Logistic Regression model on the vectorised clinical text. Logistic Regression is a popular starting point for text classification because it is efficient, interpretable, and often performs well on high-dimensional data like text.

Training the model allows us to learn patterns in the clinical notes that are associated with each diagnostic class.

- *SciKit-Learn - LogisticRegression:* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
# Train the classifier
clf = LogisticRegression(max_iter=500)
clf.fit(X_train_tfidf, y_train)

print("OK")

**Questions:**

- **6.1.** Why is logistic regression a good starting point for text classification?
- **6.2.** What are the advantages and limitations of using logistic regression for multi-class clinical text data?
- **6.3.** How can you interpret the model’s coefficients in the context of clinical terms?
- **6.4.** What other classifiers might be suitable for this task, and why?

---

## Step 7: Evaluate Model Performance

After training the model, it is important to assess how well it performs on unseen data. We use a confusion matrix and classification report to evaluate accuracy and per-class performance. This helps identify which classes are predicted well and where the model may be making mistakes, which is crucial information in clinical applications where misclassification can have serious consequences.

- *SciKit-Learn - confusion_matrix:* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- *Seaborn - heatmap:* https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
# Predictions and confusion matrix
y_pred = clf.predict(X_test_tfidf)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)

# Display confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', xticklabels=clf.classes_, yticklabels=clf.classes_, cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# Classification report
print(classification_report(y_test, y_pred))

**Questions:**

- **7.1.** What does the confusion matrix tell you about your model’s strengths and weaknesses?
- **7.2.** Why is it important to look at per-class performance, especially in a clinical context?
- **7.3.** What are the potential consequences of misclassifying one clinical class as another?
- **7.4.** How could you improve your model if you notice poor performance on a particular class?

---

## Step 8: Summary and Reflection

In this notebook, you completed the full workflow of clinical text classification, from exploring and visualising synthetic clinical notes to preprocessing, vectorising, and modelling the data. You learned how to transform unstructured clinical narratives into numerical features using TF-IDF, enabling machine learning algorithms to identify patterns relevant to diagnostic categories.

By training and evaluating a logistic regression classifier, you saw the importance of both technical steps and careful assessment of model performance, especially in a clinical context where misclassifications can have serious consequences. Throughout, you also considered challenges unique to healthcare data, such as class imbalance, interpretability, and the ethical handling of sensitive patient information.

While this task was simplified, it mirrors many real-world applications of NLP in medicine — such as triage, diagnostic support, and decision augmentation.

### Summary

- Preprocessing and vectorisation are key to making text usable for machine learning.
- Word clouds help explore feature importance visually.
- Classification models must be carefully evaluated for clinical safety.

### What's next?

- **8.1** What real clinical tasks could benefit from NLP like this?
- **8.2** How would you gather and anonymise real-world clinical notes ethically?
- **8.3** What kinds of mistakes would be unacceptable in this application?
- **8.4** How could model outputs be made interpretable for clinicians?

---

## Explore Further

### Articles

- **Opportunities and Obstacles for Deep Learning in Biology and Medicine**
<br>*Journal of the Royal Society*
  - https://pubmed.ncbi.nlm.nih.gov/29618526/

- **Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review**
<br>*JMIR Med Inform*
  - https://medinform.jmir.org/2019/2/e12239/

- **Clinical Text Data in Machine Learning: Systematic Review**
<br>*JMIR Med Inform*
  - https://medinform.jmir.org/2020/3/e17984

- **Ethical and legal challenges of artificial intelligence-driven healthcare**
<br>*Journal of Responsible Technology*
  - https://www.sciencedirect.com/science/article/pii/B9780128184387000125

- **MIMIC-III, a freely accessible critical care database**
<br>*Scientific Data*
  - https://www.nature.com/articles/sdata201635