## Applied Sequence Modeling with PyTorch + ClinicalBERT
### Predict ICU Stay Length based on lab results using ClinicalBERT

Updated 09/27/2024 G. Chism, U of A InfoSci + DataLab

## Install required libraries

For this case we will import _PyTorch_, _sklearn_, _pandas_, and _numpy_.

**To execute code Notebook cells:** Press _SHIFT+ENTER_

In [2]:
#%pip install -q torch
#%pip install -q scikit-learn
#%pip install -q pandas
#%pip install -q numpy
#%pip install watermark

Note: you may need to restart the kernel to use updated packages.


It's best practice to have all of the libraries loaded at the top of the page

In [3]:
# Import specific classes from PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Import transformers
from transformers import BertTokenizer, BertModel, BertForSequenceClassification, AdamW

# Import preprocessing from Scikit-Learn
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error

# Import pandas and numpy
import pandas as pd
import numpy as np

import itertools


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/gchism/Library/r-miniconda/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/gchism/Library/r-miniconda/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/gchism/Library/r-miniconda/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io

Check if we have GPUs available (hint, we won't...)

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


## Data Loading and Preprocessing:

### Converting Data Types:

`pd.to_numeric()` converts columns like los and valuenum to numeric types, coercing any errors (e.g., invalid strings) to NaN.
`pd.to_datetime()` ensures that the charttime and dob columns are properly treated as datetime objects for time-series modeling.

### Handling Missing Values:

After converting the data types, we check for missing values and handle them (in this case, dropping rows with missing values in critical columns like `los` and `valuenum`).


### Check Data Info and Head:

This ensures that the data is now clean and ready for modeling, with no incorrect data types or missing values in critical columns.

In [5]:
# Assuming the data is already preprocessed via mimic-iii-demo-subset.py and saved as mimic_data.csv
mimic_data = pd.read_csv('data/mimic_data.csv')

mimic_data['los'] = pd.to_numeric(mimic_data['los'], errors='coerce')

# Handle missing values (example: drop rows with missing valuenum or los)
clean_data = mimic_data.dropna(subset=['los', 'valuenum'])

# Convert los to float64
clean_data['los'] = np.float64(clean_data['los'])

scaler = MinMaxScaler()
clean_data['los'] = scaler.fit_transform(clean_data[['los']])

print(clean_data.head())

  subject_id icustay_id       los itemid            charttime value valuenum  \
0      10006     206504  0.043246  50912  2164-09-24 20:21:00   7.0      7.0   
1      10006     206504  0.043246  50931  2164-09-24 20:21:00   126    126.0   
2      10006     206504  0.043246  51222  2164-09-24 20:21:00  11.2     11.2   
3      10006     206504  0.043246  50912  2164-09-25 05:25:00   7.4      7.4   
4      10006     206504  0.043246  50931  2164-09-25 05:25:00   106    106.0   

  valueuom gender                  dob  
0    mg/dL      F  2094-03-05 00:00:00  
1    mg/dL      F  2094-03-05 00:00:00  
2     g/dL      F  2094-03-05 00:00:00  
3    mg/dL      F  2094-03-05 00:00:00  
4    mg/dL      F  2094-03-05 00:00:00  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data['los'] = np.float64(clean_data['los'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data['los'] = scaler.fit_transform(clean_data[['los']])


## Group Data by Patients and Time
Sort the data by patient and charttime

In [6]:
clean_data.sort_values(['subject_id', 'charttime'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data.sort_values(['subject_id', 'charttime'], inplace=True)


## Create a Dataset Suitable for ClinicalBERT

- Convert numerical lab values to a string that ClinicalBERT can use as input. (Already okay) 

- Each patient’s lab values are concatenated into a single text, simulating a clinical note.

In [7]:
# Concatenate lab values for each patient to simulate a clinical note
def create_text_representation(df):
    df['text'] = df.groupby('subject_id')['valuenum'].transform(lambda x: ' '.join(map(str, x)))
    df = df.drop_duplicates(subset=['subject_id'])
    return df[['subject_id', 'text', 'los']]

# Apply the transformation
data_for_clinicalbert = create_text_representation(mimic_data)

## Prepare Tokenizer and Dataset for ClinicalBERT

In [8]:
tokenizer = BertTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')

class ClinicalBERTDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels.astype(float)
        self.tokenizer = tokenizer 
        self.max_len = max_len 

    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, item):
        text = str(self.texts[item])
        label = float(self.labels[item])  

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.float)
        }
    
# Create Dataset and DataLoader
MAX_LEN = 128
BATCH_SIZE = 16

dataset = ClinicalBERTDataset(
    texts=data_for_clinicalbert['text'].to_numpy(),
    labels=data_for_clinicalbert['los'].to_numpy(),
    tokenizer=tokenizer,
    max_len=MAX_LEN
)

dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)


## Define ClinicalBERT Model for Regression

In [9]:
model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=1).to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Set Up Optimizer and Training Loop


In [10]:
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = nn.MSELoss()

# Training Loop
EPOCHS = 5

for epoch in range(EPOCHS):
    model.train()  # Set model to training mode
    total_loss = 0

    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()  # Clear previous gradients

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits.squeeze(), labels)  # Squeeze to match shape
        
        if torch.isnan(loss):  # Check for NaN loss
            print("NaN loss detected")
            continue
        
        total_loss += loss.item()

        loss.backward()  # Backpropagation
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Clip gradients
        optimizer.step()  # Update model parameters

    avg_train_loss = total_loss / len(dataloader)
    print(f'Epoch {epoch + 1}/{EPOCHS}, Loss: {avg_train_loss:.4f}')

NaN loss detected
Epoch 1/5, Loss: 45.3497
NaN loss detected
Epoch 2/5, Loss: 33.2112
NaN loss detected
Epoch 3/5, Loss: 28.5100
NaN loss detected
Epoch 4/5, Loss: 36.8177
NaN loss detected
Epoch 5/5, Loss: 28.8239


## Evaluation

In [15]:
model.eval()

predictions = []
actuals = []

with torch.no_grad():
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        preds = outputs.logits.squeeze().cpu().tolist()
        predictions.extend(preds)
        actuals.extend(labels.cpu().tolist())

# Check for NaN values in predictions and actuals
predictions = np.array(predictions)
actuals = np.array(actuals)

# Remove NaN values from predictions and actuals
mask = ~np.isnan(predictions) & ~np.isnan(actuals)  # Only keep non-NaN values
predictions = predictions[mask]
actuals = actuals[mask]

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actuals, predictions)
print(f'Mean Absolute Error: {mae:.2f}')

Mean Absolute Error: 3.44
