# <center> CS 184A Final Project </center>
## **Task**: RSNA MICCAI Brain Tumor Radiogenomic Classification Kaggle Competition [(link)](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/overview)

## **Team**:
#### Pratyush Muthukumar, 66495041, muthukup@uci.edu
#### Gabriella Vass, 17792084, gvass@uci.edu
#### Ramya Sai Swathi Mangu, 50021174, rmangu@uci.edu


## Introduction
The goal of our project is a binary classification task to classify brain tumor MRI imagery of glioblastoma patients into those with the MGMT promoter gene sequence and those without. The brain tumor MRI imagery is provided as 3D DICOM imagery of multi-parametric MRI (mpMRI) scans. Each patient's data comprised of 4 MRI sequences: 
- Fluid Attenuated Inversion Recovery (FLAIR)
- T1-weighted pre-contrast (T1w)
- T1-weighted post-contrast (T1Gd)
- T2-weighted (T2)

For this Kaggle competition, the goal is to provide the softmax probabilities for the prevalence of the MGMT promoter gene sequence given binary labels and the MRI imagery. The competition then evaluates your results using the AUC error metric ranging from 0 - 1 where the optimal AUC score is 1. 

In this notebook, we outline our data preprocessing/augmentation pipeline, the implementation of a baseline 2D CNN classifier model, the implementation of our custom fully-convolutional U-Net model, the results of both the models, and our final submission. This notebook provides the raw source code to all results and figures in our final report and video. 

We ran this notebook on Kaggle using the hosted competition dataset and the GPU accelerator. If you are planning on running this notebook, we suggest you do the same.

---

## Data Preprocessing/Augmentation

In [None]:
# Imports
import os
import json
import glob
import random
import collections

import numpy as np
import pandas as pd
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
import cv2
import matplotlib.pyplot as plt
import seaborn as sns
import random
from tqdm.notebook import tqdm

import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import layers

In [None]:
# MRI sequence types
TYPES = ["FLAIR", "T1w", "T2w", "T1wCE"]
# Augmentation mask pixel threshold parameter
WHITE_THRESHOLD = 10 # out of 255
# The Kaggle competition encourages us to exclude these samples because of inconsistencies
EXCLUDE = [109, 123, 709]

train_df = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv")
train_df = train_df[~train_df.BraTS21ID.isin(EXCLUDE)]
test_df = pd.read_csv('../input/rsna-miccai-brain-tumor-radiogenomic-classification/sample_submission.csv')

In [None]:
# Data loading using the PyDicom Python package
def load_dicom(path, size = 224):
    ''' 
    Reads a DICOM image, standardizes so that the pixel values are between 0 and 1, then rescales to 0 and 255
    '''
    dicom = pydicom.read_file(path)
    data = dicom.pixel_array
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return cv2.resize(data, (size, size))

In [None]:
def get_all_image_paths(brats21id, image_type, folder='train'): 
    '''
    Returns an array of all the images of a particular type for a particular patient ID
    '''
    assert(image_type in TYPES)
    
    patient_path = os.path.join(
        "../input/rsna-miccai-brain-tumor-radiogenomic-classification/%s/" % folder, 
        str(brats21id).zfill(5),
    )

    paths = sorted(
        glob.glob(os.path.join(patient_path, image_type, "*")), 
        key=lambda x: int(x[:-4].split("-")[-1]),
    )
    
    num_images = len(paths)
    
    start = int(num_images * 0.25)
    end = int(num_images * 0.75)

    interval = 3
    
    if num_images < 10: 
        interval = 1
    
    return np.array(paths[start:end:interval])

In [None]:
def get_all_images(brats21id, image_type, folder='train', size=225):
    return [load_dicom(path, size) for path in get_all_image_paths(brats21id, image_type, folder)]
IMAGE_SIZE = 128

# Creating training data
def get_all_data_for_train(image_type):
    global train_df
    
    X = []
    y = []
    train_ids = []

    for i in tqdm(train_df.index):
        x = train_df.loc[i]
        images = get_all_images(int(x['BraTS21ID']), image_type, 'train', IMAGE_SIZE)
        label = x['MGMT_value']

        X += images
        y += [label] * len(images)
        train_ids += [int(x['BraTS21ID'])] * len(images)
        assert(len(X) == len(y))
    return np.array(X), np.array(y), np.array(train_ids)

# Creating testing data
def get_all_data_for_test(image_type):
    global test_df
    
    X = []
    test_ids = []

    for i in tqdm(test_df.index):
        x = test_df.loc[i]
        images = get_all_images(int(x['BraTS21ID']), image_type, 'test', IMAGE_SIZE)
        X += images
        test_ids += [int(x['BraTS21ID'])] * len(images)

    return np.array(X), np.array(test_ids)
X_t1p, y_t1p, trainidt_t1p = get_all_data_for_train('T1wCE')
X_test_t1p, testidt_t1p = get_all_data_for_test('T1wCE')

X_flair, y_flair, trainidt_flair = get_all_data_for_train('FLAIR')
X_test_flair, testidt_flair = get_all_data_for_test('FLAIR')

X_t1, y_t1, trainidt_t1 = get_all_data_for_train('T1w')
X_test_t1, testidt_t1 = get_all_data_for_test('T1w')

X_t2, y_t2, trainidt_t2 = get_all_data_for_train('T2w')
X_test_t2, testidt_t2 = get_all_data_for_test('T2w')

For each MRI sequence, there are only 582 samples for training and 87 samples for testing, so clearly we need to do some data augmentation to generate more features or else our models won't be able to learn.

In [None]:
# Keras Data Augmentation Layer
# Perform random flips and random rotations on data to generate more features
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
data_augmentation = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),
  tf.keras.layers.experimental.preprocessing.RandomRotation(0.2),
])

In [None]:
# Split into training and valid
X_train, X_valid, y_train, y_valid, trainidt_train, trainidt_valid = train_test_split(X_t1p, y_t1p, trainidt_t1p, test_size=0.2, random_state=40)

X_train = tf.expand_dims(X_train, axis=-1)
X_valid = tf.expand_dims(X_valid, axis=-1)

y_train = to_categorical(y_train)
y_valid = to_categorical(y_valid)

In [None]:
print("X_train", X_train.shape)
print("y_train", y_train.shape)

print("X_valid", X_valid.shape)
print("y_valid", y_valid.shape)

Through augmenting and preprocessing the DICOM data, we have created 12956 training samples and 3240 validation samples of 128x128 images for each of the 4 MRI sequences. The labels are binary matrices for the binary classification labels of each patient's MRI data.

In [None]:
# Visualization of processed dataset
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4)
fig.suptitle('No MGMT Promoter (y=0)')
fig.set_size_inches(18, 4)
ax1.imshow(X_t1p[-1,:,:], cmap="gray")
ax1.set_title("T1 Pre-contrast")
ax2.imshow(X_t1[-1,:,:], cmap="gray")
ax2.set_title("T1 Post-contrast")
ax3.imshow(X_t2[-1,:,:], cmap="gray")
ax3.set_title("T2")
ax4.imshow(X_flair[-1,:,:], cmap="gray")
ax4.set_title("FLAIR")
plt.show()

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4)
fig.suptitle('MGMT Promoter (y=1)')
fig.set_size_inches(18, 4)
ax1.imshow(X_t1p[0,:,:], cmap="gray")
ax1.set_title("T1 Pre-contrast")
ax2.imshow(X_t1[0,:,:], cmap="gray")
ax2.set_title("T1 Post-contrast")
ax3.imshow(X_t2[0,:,:], cmap="gray")
ax3.set_title("T2")
ax4.imshow(X_flair[0,:,:], cmap="gray")
ax4.set_title("FLAIR")
plt.show()

# 2D CNN Baseline

Here' we utilize the [DenseNet121](http://https://arxiv.org/abs/1608.06993) architecture, a pretrained dense CNN model, as a baseline model. Apart from using this pre-trained model, we train this model as a standard CNN classifier model using tensorflow and keras. 

In [None]:
from tensorflow.keras import Input, Model 
from tensorflow.keras.layers import Conv2D, GlobalAveragePooling2D, Dense
from tensorflow.keras.applications import *

input_tensor = keras.Input(shape=(128, 128, 1))
efnet = DenseNet121(weights=None, include_top = False, input_shape=(128, 128, 3))
mapping3feat = Conv2D(3, (3, 3), padding='same', use_bias=False)(input_tensor)

output = efnet(mapping3feat)
output = GlobalAveragePooling2D()(output)
output = Dense(2, activation='sigmoid')(output)

print(output.shape)

tf.keras.backend.clear_session()
model = Model(input_tensor, output)
model.summary()

In [None]:
# Cross entropy loss
# Also show the training/validation AUC throughout training
model.compile(loss='categorical_crossentropy',
             optimizer=tf.keras.optimizers.SGD(learning_rate =0.0001),
             metrics=[tf.keras.metrics.AUC()])

history = model.fit(x=X_train, y = y_train, epochs=25, validation_data= (X_valid, y_valid))

In [None]:
# Plotting training loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
# Plotting AUC score
plt.plot(history.history['auc_1'])
plt.plot(history.history['val_auc_1'])
plt.title('Model accuracy')
plt.ylabel('AUC')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
# Create submission file
y_pred=model.predict(X_test_t1p)
pred = np.argmax(y_pred, axis=1)
result=pd.DataFrame(testidt_t1p)
result[1] = pred
result.columns=['BraTS21ID','MGMT_value']
result2 = result.groupby('BraTS21ID',as_index=False).mean()
sample = pd.read_csv('../input/rsna-miccai-brain-tumor-radiogenomic-classification/sample_submission.csv')
result2['BraTS21ID'] = sample['BraTS21ID']
result2.to_csv('cnn_submission.csv',index=False)

In [None]:
result2.head()

## U-Net Implementation

Here, we implement a fully-convolutional U-Net architecture with 4 contracting Conv2D layers, 4 expanding Conv2D layers, and 1 bottleneck layer. A modification we implemented is the residual and convolutional connections implemented in place of the skip connections connecting each pair of layers in the expanding arm. Instead of the concat operation, our custom U-Net model implements residual connection followed by a strided convolution operation for each skip connection, as proposed by the famous [ResNet](https://arxiv.org/abs/1512.03385) architecture. 

In [None]:
# Fully-convolutional U-net with 4 contracting, 4 expanding arms and residual skip connections
from tensorflow.keras.applications import *
import os, numpy as np, pandas as pd, matplotlib.pyplot as plt
from tensorflow import losses, optimizers
from tensorflow.keras import Input, Model, models, layers, callbacks, utils, metrics

# --- Define kwargs dictionary
kwargs = {
    'kernel_size': (3, 3),
    'padding': 'same'}

# --- Define lambda functions
conv = lambda x, filters, strides : layers.Conv2D(filters=filters, strides=strides, **kwargs)(x)
norm = lambda x : layers.BatchNormalization()(x)
relu = lambda x : layers.ReLU()(x)
tran = lambda x, filters, strides : layers.Conv2DTranspose(filters=filters, strides=strides, **kwargs)(x)

# --- Define stride-1, stride-2 blocks
conv1 = lambda filters, x : relu(norm(conv(x, filters, strides=1)))
conv2 = lambda filters, x : relu(norm(conv(x, filters, strides=2)))
tran2 = lambda filters, x : relu(norm(tran(x, filters, strides=2)))

In [None]:
# --- Define contracting layers
l1 = conv1(32, input_tensor) # 128x128x32
l2 = conv1(48, norm(conv2(48, l1))) #64x64x48
l3 = conv1(64, norm(conv2(64, l2))) #32x32x64
l4 = conv1(128, norm(conv2(128, l3))) #16x16x128
# --- Bottleneck layer
l5 = conv1(256, norm(conv2(256, l4))) # 8x8x256
# --- Define expanding layers
l6  = tran2(128, l5)# 16x16x128
l7  = tran2(64, norm(conv1(128, l4+l6)))#32x32x64
l8  = tran2(48, norm(conv1(64, l3+l7)))#64x64x48
l9 = tran2(32,  norm(conv1(48, l2+l8)))#128x128x32

# --- Define survival prediction
h0 = layers.Dropout(0.3)(l9)
h1 = layers.Flatten()(h0)
h2 = layers.Dense(128, activation='relu')(h1)
output = layers.Dense(2, activation='sigmoid')(h2)

# --- Create model
tf.keras.backend.clear_session()
model = Model(inputs=input_tensor, outputs=output)

model.summary()

As we can see from the model summary, our custom U-Net model is more complex than the baseline 2D CNN model with nearly 10x the amount of trainable parameters.

In [None]:
# --- Compile model
model.compile(loss='categorical_crossentropy',
             optimizer=tf.keras.optimizers.SGD(learning_rate=0.0001),
             metrics=[tf.keras.metrics.AUC()])
# --- Train model
history = model.fit(x=X_train, y = y_train, epochs=25, validation_data= (X_valid, y_valid))

In [None]:
# Plotting training loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
# Plotting AUC score
plt.plot(history.history['auc_1'])
plt.plot(history.history['val_auc_1'])
plt.title('Model accuracy')
plt.ylabel('AUC')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()