## Applied Sequence Modeling with PyTorch + ClinicalBERT
### Predict ICU Stay Length based on lab results using ClinicalBERT

Updated 09/27/2024 G. Chism, U of A InfoSci + DataLab

## Install required libraries

For this case we will import _PyTorch_, _sklearn_, _pandas_, and _numpy_.

**To execute code Notebook cells:** Press _SHIFT+ENTER_

In [None]:
#!pip install -q torch
#!pip install -q scikit-learn
#!pip install -q pandas
#!pip install -q numpy
#!pip install watermark

It's best practice to have all of the libraries loaded at the top of the page

In [2]:
# Import specific classes from PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Import transformers
from transformers import BertTokenizer, BertModel, BertForSequenceClassification, AdamW

# Import preprocessing from Scikit-Learn
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error

# Import pandas and numpy
import pandas as pd
import numpy as np

import itertools

  from .autonotebook import tqdm as notebook_tqdm


Check if we have GPUs available (hint, we won't...)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Data Loading and Preprocessing:

### Converting Data Types:

`pd.to_numeric()` converts columns like los and valuenum to numeric types, coercing any errors (e.g., invalid strings) to NaN.
`pd.to_datetime()` ensures that the charttime and dob columns are properly treated as datetime objects for time-series modeling.

### Handling Missing Values:

After converting the data types, we check for missing values and handle them (in this case, dropping rows with missing values in critical columns like `los` and `valuenum`).


### Check Data Info and Head:

This ensures that the data is now clean and ready for modeling, with no incorrect data types or missing values in critical columns.

In [3]:
# Assuming the data is already preprocessed via mimic-iii-demo-subset.py and saved as mimic_data.csv
mimic_data = pd.read_csv('data/mimic_data.csv')

# Handle missing values (example: drop rows with missing valuenum or los)
clean_data = mimic_data.dropna(subset=['los', 'valuenum'])

print(clean_data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 29981 entries, 0 to 29986
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   subject_id  29981 non-null  object
 1   icustay_id  29981 non-null  object
 2   los         29981 non-null  object
 3   itemid      29981 non-null  object
 4   charttime   29981 non-null  object
 5   value       29981 non-null  object
 6   valuenum    29981 non-null  object
 7   valueuom    29981 non-null  object
 8   gender      29981 non-null  object
 9   dob         29981 non-null  object
dtypes: object(10)
memory usage: 2.5+ MB
None


## Group Data by Patients and Time
Sort the data by patient and charttime

In [None]:
clean_data.sort_values(['subject_id', 'charttime'], inplace=True)

## Create a Dataset Suitable for ClinicalBERT

- Convert numerical lab values to a string that ClinicalBERT can use as input. (Already okay) 

- Each patient’s lab values are concatenated into a single text, simulating a clinical note.