# Before you use this template

This template is just a recommended template for project Report. It only considers the general type of research in our paper pool. Feel free to edit it to better fit your project. You will iteratively update the same notebook submission for your draft and the final submission. Please check the project rubriks to get a sense of what is expected in the template.

---

# FAQ and Attentions
* Copy and move this template to your Google Drive. Name your notebook by your team ID (upper-left corner). Don't eidt this original file.
* This template covers most questions we want to ask about your reproduction experiment. You don't need to exactly follow the template, however, you should address the questions. Please feel free to customize your report accordingly.
* any report must have run-able codes and necessary annotations (in text and code comments).
* The notebook is like a demo and only uses small-size data (a subset of original data or processed data), the entire runtime of the notebook including data reading, data process, model training, printing, figure plotting, etc,
must be within 8 min, otherwise, you may get penalty on the grade.
  * If the raw dataset is too large to be loaded  you can select a subset of data and pre-process the data, then, upload the subset or processed data to Google Drive and load them in this notebook.
  * If the whole training is too long to run, you can only set the number of training epoch to a small number, e.g., 3, just show that the training is runable.
  * For results model validation, you can train the model outside this notebook in advance, then, load pretrained model and use it for validation (display the figures, print the metrics).
* The post-process is important! For post-process of the results,please use plots/figures. The code to summarize results and plot figures may be tedious, however, it won't be waste of time since these figures can be used for presentation. While plotting in code, the figures should have titles or captions if necessary (e.g., title your figure with "Figure 1. xxxx")
* There is not page limit to your notebook report, you can also use separate notebooks for the report, just make sure your grader can access and run/test them.
* If you use outside resources, please refer them (in any formats). Include the links to the resources if necessary.

# Mount Notebook to Google Drive
Upload the data, pretrianed model, figures, etc to your Google Drive, then mount this notebook to Google Drive. After that, you can access the resources freely.

Instruction: https://colab.research.google.com/notebooks/io.ipynb

Example: https://colab.research.google.com/drive/1srw_HFWQ2SMgmWIawucXfusGzrj1_U0q

Video: https://www.youtube.com/watch?v=zc8g8lGcwQU

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Introduction
This is an introduction to your report, you should edit this text/mardown section to compose. In this text/markdown, you should introduce:

*   Background of the problem
  * what type of problem: disease/readmission/mortality prediction,  feature engineeing, data processing, etc
  * what is the importance/meaning of solving the problem
  * what is the difficulty of the problem
  * the state of the art methods and effectiveness.
*   Paper explanation
  * what did the paper propose
  * what is the innovations of the method
  * how well the proposed method work (in its own metrics)
  * what is the contribution to the reasearch regime (referring the Background above, how important the paper is to the problem).


In [None]:
# code comment is used as inline annotations for your coding

# Scope of Reproducibility:

List hypotheses from the paper you will test and the corresponding experiments you will run.


1.   Hypothesis 1: xxxxxxx
2.   Hypothesis 2: xxxxxxx

You can insert images in this notebook text, [see this link](https://stackoverflow.com/questions/50670920/how-to-insert-an-inline-image-in-google-colaboratory-from-google-drive) and example below:

![sample_image.png](https://drive.google.com/uc?export=view&id=1g2efvsRJDxTxKz-OY3loMhihrEUdBxbc)



You can also use code to display images, see the code below.

The images must be saved in Google Drive first.


In [None]:
# no code is required for this section
'''
if you want to use an image outside this notebook for explanaition,
you can upload it to your google drive and show it with OpenCV or matplotlib
'''
# mount this notebook to your google drive
drive.mount('/content/gdrive')

# define dirs to workspace and data
img_dir = '/content/gdrive/My Drive/Colab Notebooks/<path-to-your-image>'

import cv2
img = cv2.imread(img_dir)
cv2.imshow("Title", img)


# Methodology

This methodology is the core of your project. It consists of run-able codes with necessary annotations to show the expeiment you executed for testing the hypotheses.

The methodology at least contains two subsections **data** and **model** in your experiment.

In [None]:
# import  packages you need
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models.keyedvectors import KeyedVectors
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from google.colab import drive


##  Data
Data includes raw data (MIMIC III tables), descriptive statistics (our homework questions), and data processing (feature engineering).
  * Source of the data: where the data is collected from; if data is synthetic or self-generated, explain how. If possible, please provide a link to the raw datasets.
  * Statistics: include basic descriptive statistics of the dataset like size, cross validation split, label distribution, etc.
  * Data process: how do you munipulate the data, e.g., change the class labels, split the dataset to train/valid/test, refining the dataset.
  * Illustration: printing results, plotting figures for illustration.
  * You can upload your raw dataset to Google Drive and mount this Colab to the same directory. If your raw dataset is too large, you can upload the processed dataset and have a code to load the processed dataset.

In [None]:
data_dir = '/content/drive/My Drive/Deep Learning Data/'

def load_csv(file_name):
    # Function to load CSV and lower case column names
    df = pd.read_csv(f'{data_dir}{file_name}', low_memory=False)
    df.columns = df.columns.str.lower()  # Convert columns to lowercase
    return df

def load_data():
    # Loading all the necessary datasets
    datasets = {
        # 'admissions': load_csv('admissions_filtered.csv'),
        # 'chartevents': load_csv('filtered_CHARTEVENTS.csv'),
        # 'labevents': load_csv('filtered_LABEVENTS.csv'),
        # 'patients': load_csv('patients_filtered.csv'),
        # 'icustays': load_csv('icustays_filtered.csv'),
        # 'procedures_icd': load_csv('procedures_icd_full.csv'),
        # 'diagnoses_icd': load_csv('diagnoses_icd_full.csv'),
        'merged_data': load_csv('merged_data.csv')  # Loading the merged dataset
    }
    return datasets

def calculate_stats(df):
    print("\nData Statistics:")
    print(f"Total rows: {df.shape[0]}")
    print(f"Total columns: {df.shape[1]}")
    print(f"Columns: {df.columns.tolist()}")
    try:
        print(df.describe())  # Simplified to ensure compatibility across different pandas versions
        print(df.info())  # Provides data type for each column
    except Exception as e:
        print("Error in describing the data:", e)

# Load all datasets
datasets = load_data()

# Calculate statistics for all datasets to ensure consistency and understand the data
for name, dataset in datasets.items():
    print(f"{name.upper()} Data:")
    calculate_stats(dataset)

In [None]:
def missing_data_percentage(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data

# Load the merged data
merged_data = datasets['merged_data']
print(missing_data_percentage(merged_data))

We have a few instances where information has not been given based on the dataset we have crafted, going to do some imputation based handling for these cases to ensure the data is properly structured before model evaluations begin

In [None]:
# Handling missing values more contextually
merged_data['deathtime'].fillna('Not Applicable', inplace=True)  # Appropriate for non-existence of a death event

# Assuming ICU-related missing data means no ICU stay. Fill time with admittime for continuity
merged_data['icu_los'].fillna(0, inplace=True)  # Zero length for no ICU stay
merged_data['icu_outtime'].fillna(merged_data['admittime'], inplace=True)  # Assuming no ICU stay ends at admission time
merged_data['icu_intime'].fillna(merged_data['admittime'], inplace=True)  # Assuming no ICU stay starts at admission time
merged_data['icustay_id'].fillna(-1, inplace=True)  # Use -1 as a placeholder for 'No ICU stay'

# Diagnosis missing values are filled with 'Unknown'
merged_data['diagnosis'].fillna('Unknown', inplace=True)

# Convert date columns to datetime and handle potential errors
date_columns = ['admittime', 'dischtime', 'icu_intime', 'icu_outtime']
for col in date_columns:
    merged_data[col] = pd.to_datetime(merged_data[col], errors='coerce')

# Calculating lengths only after date imputations to avoid negative or zero values unexpectedly
merged_data['hospital_stay_length'] = (merged_data['dischtime'] - merged_data['admittime']).dt.days.clip(lower=0)
merged_data['icu_stay_length'] = (merged_data['icu_outtime'] - merged_data['icu_intime']).dt.total_seconds() / 86400
merged_data['icu_stay_length'] = merged_data['icu_stay_length'].clip(lower=0)

# Calculate time from admission to ICU
merged_data['time_to_icu'] = (merged_data['icu_intime'] - merged_data['admittime']).dt.total_seconds() / 3600
merged_data['time_to_icu'] = merged_data['time_to_icu'].clip(lower=0)

# Check if any NaN values remain and print the updated stats
print(merged_data.isnull().sum())

# Check and correct gender inconsistencies
print("Unique gender values before:", merged_data['gender'].unique())
merged_data['gender_male'] = (merged_data['gender'] == 'M').astype(int)
print("Unique gender values after encoding:", merged_data['gender_male'].unique())

# Recalculate hospital and ICU stay lengths to correct potential errors
merged_data['hospital_stay_length'] = (merged_data['dischtime'] - merged_data['admittime']).dt.days
merged_data['icu_stay_length'] = (merged_data['icu_outtime'] - merged_data['icu_intime']).dt.total_seconds() / 86400
merged_data['hospital_stay_length'] = merged_data['hospital_stay_length'].clip(lower=0)
merged_data['icu_stay_length'] = merged_data['icu_stay_length'].clip(lower=0)

# Create features
merged_data['age_times_icu_length'] = merged_data['age_at_admission'] * merged_data['icu_stay_length']

# Check new statistics after cleaning and feature engineering
print(merged_data.describe())

Preprocessing data that comes in from the noteevents data table from MIMIC 3 dataset

In [None]:
# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Stopwords and lemmatizer initialization
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    # Remove punctuations and numbers
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenization
    tokens = text.split()
    # Removing stopwords and lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

# Apply text preprocessing on the 'all_notes' column
merged_data['all_notes'] = merged_data['all_notes'].fillna('not available').apply(preprocess_text)

# Example to check the preprocessing output
print(merged_data['all_notes'].head())


In [None]:
# Save the processed data to a CSV file
data_dir = '/content/drive/My Drive/Deep Learning Data/'  # Specify your data directory
merged_data.to_csv(data_dir + 'merged_data_processed.csv', index=False)

In [None]:
merged_data = pd.read_csv(data_dir + 'merged_data_processed.csv')

Converting the text data in the form of strings into numerical format that neural network can process. I'm going to be using word2vec to achieve this by doing the following

1. Load pretrained embeddings
2. Vecotrize the text
3. Prepare text data for the model

In [None]:
# Assuming you've already converted the GloVe file to Word2Vec format
glove_input_file = data_dir + 'glove.6B.100d.txt'  # Update the file name if needed
word2vec_output_file = data_dir + 'glove.6B.100d.word2vec.txt'  # Update the file name if needed
glove2word2vec(glove_input_file, word2vec_output_file)

# Load the GloVe model
model = KeyedVectors.load_word2vec_format(glove_input_file, binary=False, no_header=True)

# Function to vectorize a single note
def vectorize_note(note, embedding_model):
    vectors = [embedding_model[word] for word in note.split() if word in embedding_model]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(embedding_model.vector_size)

# Apply the vectorization to all notes and store in a list
note_vectors = [vectorize_note(note, model) for note in merged_data['all_notes']]

# Convert the list of vectors into a numpy array
note_vectors_array = np.array(note_vectors)

# Now, 'note_vectors_array' can be used as part of your input features

In [None]:
# Save the array of vectors to a binary file in NumPy `.npy` format
np.save(data_dir + 'note_vectors_array.npy', note_vectors_array)

# load this array directly without reprocessing the text later to save time
note_vectors_array = np.load(data_dir + 'note_vectors_array.npy')

Exploratory Data Analysis, before continuing further we're looking at the following:
* Distribution Checks: Analyze the distribution of key metrics like hospital_stay_length, icu_stay_length, time_to_icu, and age_at_admission.
* Correlation Analysis: Determine the relationships between the different features, particularly how various features like age, ICU stay, and hospital stay length correlate with the mortality label.
* Visualization: Utilize histograms, box plots, scatter plots, and heatmaps to visually explore the data and uncover patterns or anomalies.





In [None]:
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Histograms
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.histplot(merged_data['hospital_stay_length'], bins=30, ax=axes[0], kde=True)
axes[0].set_title('Histogram of Hospital Stay Length')
sns.histplot(merged_data['icu_stay_length'], bins=30, ax=axes[1], kde=True)
axes[1].set_title('Histogram of ICU Stay Length')
sns.histplot(merged_data['age_at_admission'], bins=30, ax=axes[2], kde=True)
axes[2].set_title('Histogram of Age at Admission')

# Box plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.boxplot(x=merged_data['hospital_stay_length'], ax=axes[0])
axes[0].set_title('Box Plot of Hospital Stay Length')
sns.boxplot(x=merged_data['icu_stay_length'], ax=axes[1])
axes[1].set_title('Box Plot of ICU Stay Length')
sns.boxplot(x=merged_data['age_at_admission'], ax=axes[2])
axes[2].set_title('Box Plot of Age at Admission')

# Bar chart for gender
plt.figure(figsize=(6, 4))
sns.countplot(x='gender', data=merged_data)
plt.title('Gender Distribution')

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(merged_data[['hospital_stay_length', 'icu_stay_length', 'age_at_admission', 'mortality_label']].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Heatmap')

plt.show()

In [None]:
# Text Length Distribution
merged_data['note_length'] = merged_data['all_notes'].apply(lambda x: len(x.split()))
plt.figure(figsize=(10, 5))
sns.histplot(merged_data['note_length'], bins=50, kde=True)
plt.title('Distribution of Note Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

# Common word analysis (could be more complex with NLP libraries for more insights)
from collections import Counter
all_words = Counter(" ".join(merged_data['all_notes']).split())
most_common_words = all_words.most_common(20)
words, counts = zip(*most_common_words)
plt.figure(figsize=(10, 5))
sns.barplot(x=list(words), y=list(counts))
plt.title('Most Common Words in Clinical Notes')
plt.xticks(rotation=45)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.show()


##   Model
The model includes the model definitation which usually is a class, model training, and other necessary parts.
  * Model architecture: layer number/size/type, activation function, etc
  * Training objectives: loss function, optimizer, weight of each loss term, etc
  * Others: whether the model is pretrained, Monte Carlo simulation for uncertainty analysis, etc
  * The code of model should have classes of the model, functions of model training, model validation, etc.
  * If your model training is done outside of this notebook, please upload the trained model here and develop a function to load and test it.

In [None]:
import torch
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.model_selection import train_test_split

# Assuming 'merged_data' and 'note_vectors_array.npy' are loaded
note_vectors_array = np.load(data_dir + 'note_vectors_array.npy')

# Define feature columns explicitly based on the previous descriptions and outputs
feature_columns = [
    'age_at_admission', 'icu_los', 'hospital_stay_length', 'time_to_icu',
    'age_times_icu_length', 'gender_male'  # 'gender_male' added as it's created during preprocessing
]

# Prepare the numerical data
numerical_features = merged_data[feature_columns].values.astype(np.float32)

# Combine the numerical features and the note vectors
combined_features = np.hstack((numerical_features, note_vectors_array))

# Convert features and targets into torch tensors
features_tensor = torch.tensor(combined_features, dtype=torch.float32)
targets_tensor = torch.tensor(merged_data['mortality_label'].values, dtype=torch.float32).view(-1, 1)

# Split data into train and test sets
features_train, features_test, targets_train, targets_test = train_test_split(
    features_tensor, targets_tensor, test_size=0.2, random_state=42
)

# Dataset and DataLoader setup for both train and test sets
train_dataset = TensorDataset(features_train, targets_train)
test_dataset = TensorDataset(features_test, targets_test)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Define the model
class MyModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MyModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # Ensure x is 3D with sequence length of 1 if not provided
        if x.dim() == 2:
            x = x.unsqueeze(1)  # Add a sequence length dimension
        lstm_out, _ = self.lstm(x)
        out = self.fc(lstm_out[:, -1, :])  # Use the output of the last LSTM unit
        return out

# Setting up the model, loss function, and optimizer
input_size = combined_features.shape[1]  # Make sure this matches your actual input feature size
model = MyModel(input_size, 64, 1)
loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training function for one epoch
def train_model_one_epoch(model, train_loader, loss_func, optimizer):
    model.train()
    total_loss = 0
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = loss_func(outputs, targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(train_loader)

# Execute the training loop
num_epochs = 2
for epoch in range(num_epochs):
    train_loss = train_model_one_epoch(model, train_loader, loss_func, optimizer)
    print(f"Epoch {epoch+1}/{num_epochs} - Train Loss: {train_loss:.4f}")


# Results
In this section, you should finish training your model training or loading your trained model. That is a great experiment! You should share the results with others with necessary metrics and figures.

Please test and report results for all experiments that you run with:

*   specific numbers (accuracy, AUC, RMSE, etc)
*   figures (loss shrinkage, outputs from GAN, annotation or label of sample pictures, etc)


In [None]:
# Function to evaluate the model
def evaluate_model(model, test_loader):
    model.eval()  # Set the model to evaluation mode
    predictions, actuals = [], []
    with torch.no_grad():
        for inputs, targets in test_loader:
            outputs = model(inputs)
            predicted_classes = (torch.sigmoid(outputs) > 0.5).int()  # Convert probabilities to binary output
            predictions.extend(predicted_classes.view(-1).cpu())
            actuals.extend(targets.view(-1).cpu())

    predictions = [p.item() for p in predictions]
    actuals = [a.item() for a in actuals]

    # Calculate metrics
    accuracy = accuracy_score(actuals, predictions)
    precision = precision_score(actuals, predictions)
    recall = recall_score(actuals, predictions)
    f1 = f1_score(actuals, predictions)
    auc = roc_auc_score(actuals, predictions)

    return accuracy, precision, recall, f1, auc

# Run evaluation
accuracy, precision, recall, f1, auc = evaluate_model(model, test_loader)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"AUC: {auc:.2f}")

In [None]:
# Example assuming you have stored train_loss values from each epoch in a list
train_losses = [0.3011,  0.2474]  # Example losses from training epochs

plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Train Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Loss Over Epochs')
plt.legend()
plt.show()

## Model comparison

In [None]:
# compare you model with others
# you don't need to re-run all other experiments, instead, you can directly refer the metrics/numbers in the paper

# Discussion

In this section,you should discuss your work and make future plan. The discussion should address the following questions:
  * Make assessment that the paper is reproducible or not.
  * Explain why it is not reproducible if your results are kind negative.
  * Describe “What was easy” and “What was difficult” during the reproduction.
  * Make suggestions to the author or other reproducers on how to improve the reproducibility.
  * What will you do in next phase.



In [None]:
# no code is required for this section
'''
if you want to use an image outside this notebook for explanaition,
you can read and plot it here like the Scope of Reproducibility
'''

# References

1.   Sun, J, [paper title], [journal title], [year], [volume]:[issue], doi: [doi link to paper]



# Feel free to add new sections