# RSNA-MICCAI Brain Tumor Radiogenomic Classification
Task: Detect the presence of MGMT promoter methylation from the MRI images

### Domain Knowledge:
* **Glioblastoma** is the most common form of brain cancer and considered the deadliest human cancer.
* **MGMT promoter methylation** is the key mechanism of MGMT gene silencing and predicts a favorable outcome in patients with glioblastoma who are exposed to alkylating agent chemotherapy.
* **MGMT gene** - ($ O_6$-methylguanine-DNA methyltransferase) is a DNA repair enzyme that protects normal and glioma cells from alkylating chemotherapeutic agents.
* **Methylation** - presence of it means that the gene is repressed (not expressed)

Thus, the goal of this competition is not on detecting tumors (since all the images in train and test sets have tumors/glioblastomas) but on predicting whether or not chemotherapy will be an effective treatment by predicting if the brain is methylated (MGMT_value=0) or not (MGMT_value=1). If successful, doctors will have better clinical decisions on the type of treatments to be given to their cancer patients with glioblastoma. 

### Published Papers:
* ([levner2009](https://pubmed.ncbi.nlm.nih.gov/20426152/)) - MGMT methylation of glioblastomas can be predicted from the MRI. Diffused tumor (texture) could be attributed to MGMT methylation. 
   * Thus, we can try to first localize the tumor in the MRI then extract its texture features then classify if it is diffused or not
* ([crisi2020](https://pubmed.ncbi.nlm.nih.gov/32374045/)) - MGMT methylation of glioblastomas can be predicted from perfusion dynamic susceptibility contrast magnetic resonance imaging (DSC-MRI). 92 quantitative image features were obtained from relative cerebral blood volume and relative cerebral blood flow maps
    * Can we extract these features from a mere MRI?

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import pydicom
from tqdm import tqdm
from multiprocessing import Pool
from sklearn.preprocessing import OrdinalEncoder

In [None]:
sample_submission = pd.read_csv('../input/rsna-miccai-brain-tumor-radiogenomic-classification/sample_submission.csv')
train_labels = pd.read_csv('../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv')

In [None]:
train_labels

The train set seems balanced in terms of the target labels (MGMT_value).

In [None]:
with plt.xkcd():
    sns.countplot(x=train_labels.MGMT_value)
    
train_labels.MGMT_value.value_counts()    

The sample_submission is only composed of 87 data points which is 22% of the total test set. This means that the total test set is only composed of ~395 data points. As explained by some of the competitors in this [discussion thread](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/268066), one reason why this competition is not crowded is because the test set is very small such that the final standings could be a lottery.

In [None]:
sample_submission

Let's explore the folder structure of the new dataset with .png files converted from the original .dcm files. The rsna-miccai-png folder contains 2 main subfolders: `test` and `train`. The test folder contains 87 main subfolders which correspond to the BraTS21ID or simply the case ID written as a 5-digit number (e.g. `00114`). Each case ID  folder contains 4 subfolders: `T2w`, `T1wCE`, `T1w`, and `FLAIR`, which corresponds to the image contrast. The tricky part is that the 4 subfolders contain varying number of images (roughly around 19 to 300 per folder).

### Folder: rsna-miccai-png

    |--test
    |   |
    |   |--00114
    |   |   |
    |   |   |--T2w
    |   |   |   |
    |   |   |   |--Image-4.png
    |   |   |   |--Image-19.png
    |   |   |   |...
    |   |   |
    |   |   |--T1wCE
    |   |   |   |
    |   |   |   |--Image-79.png
    |   |   |   |--Image-64.png
    |   |   |   |...
    |   |   |
    |   |   |--T1w
    |   |   |   |
    |   |   |   |--Image-4.png
    |   |   |   |--Image-5.png
    |   |   |   |...
    |   |   |
    |   |   |--FLAIR
    |   |       |
    |   |       |--Image-4.png
    |   |       |--Image-19.png
    |   |       |...
    |   |       
    |   |--00013
    |   | ...
    |   | ...
    |   | ...
    |    
    |--train
        |
        |...
        |...
    

### Count of images per subfolder of the test set

Let's create a new dataframe `test_df` that has the count of images per subfolder.

In [None]:
TEST_DIR = '../input/rsna-miccai-png/test/'

caseID = []
T2w = []
T1wCE = []
T1w = []
FLAIR = []
for dirname, _, filenames in os.walk(f'{TEST_DIR}'):
    for filename in filenames:
        case_id = dirname.split('/')[-2]
        folder = dirname.split('/')[-1]
        count = len(os.listdir(dirname))

        if case_id not in caseID:
            caseID.append(case_id)
        
        if folder == 'T2w':
            T2w.append(count)
        elif folder == 'T1wCE':
            T1wCE.append(count)
        elif folder == 'T1w':
            T1w.append(count)
        elif folder == 'FLAIR':
            FLAIR.append(count)
        break

sample_submission['caseID'] = sample_submission['BraTS21ID'].astype(str).str.zfill(5)      
        
test_df = pd.DataFrame({
                        'caseID':caseID,
                        'T2w_count':T2w,
                        'T1wCE_count':T1wCE,
                        'T1w_count':T1w,
                        'FLAIR_count':FLAIR,
                       })        

test_df = test_df.merge(sample_submission[['caseID','BraTS21ID']], on='caseID',how='left')
test_df

### Count of images per subfolder of the train set

In the same way, let's create a new dataframe `train_df` that has the count of images per subfolder. Note that according to the competition host ([discussion thread](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/262046)), there are three case ids (`00109`, `00123`, `00709`) in the train set that should be excluded because they contain unexpected errors (e.g. missing images). 

In [None]:
TRAIN_DIR = '../input/rsna-miccai-png/train/'

caseID = []
T2w = []
T1wCE = []
T1w = []
FLAIR = []
for dirname, _, filenames in os.walk(f'{TRAIN_DIR}'):
    for filename in filenames:
        case_id = dirname.split('/')[-2]
        folder = dirname.split('/')[-1]

        if case_id not in ('00109', '00123', '00709'):
            count = len(os.listdir(dirname))

            if case_id not in caseID:
                caseID.append(case_id)

            if folder == 'T2w':
                T2w.append(count)
            elif folder == 'T1wCE':
                T1wCE.append(count)
            elif folder == 'T1w':
                T1w.append(count)
            elif folder == 'FLAIR':
                FLAIR.append(count)
            break

train_labels['caseID'] = train_labels['BraTS21ID'].astype(str).str.zfill(5)      
        
train_df = pd.DataFrame({
                        'caseID':caseID,
                        'T2w_count':T2w,
                        'T1wCE_count':T1wCE,
                        'T1w_count':T1w,
                        'FLAIR_count':FLAIR,
                       })        

train_df = train_df.merge(train_labels, on='caseID',how='left')
train_df

### Correlation between the number of images vs. the MGMT value

There seems to be no correlation between the number of images per MRI view per case id vs. the MGMT value.

In [None]:
corr = train_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

### Get the best image per MRI view per case_id

There is a lot of MRI images per folder but not all of them are useful. In fact, majority of the scans are mostly black. Let's try to choose only the image with the most number of nonzero (nonblack) pixels per folder per case id. Currently, I do not know how to make use of the other images so I will only use these chosen images for now. 

In [None]:
test_best_images = pd.read_csv(f'../input/images-with-the-most-nonzero-pixels/test_df.csv')
test_df = test_df.merge(test_best_images[['BraTS21ID','T1w','T1wCE','T2w','FLAIR']], on='BraTS21ID', how='left')

In [None]:
test_df

In [None]:
train_best_images = pd.read_csv(f'../input/images-with-the-most-nonzero-pixels/train_df.csv')
train_df = train_df.merge(train_best_images[['BraTS21ID','T1w','T1wCE','T2w','FLAIR']], on='BraTS21ID', how='left')

In [None]:
train_df

### Let's look at some images with Methylation (MGMT_value=0)

In [None]:
case_id = '00688'
filename = train_df[train_df.caseID==case_id].T1w.iloc[0]
t1w = plt.imread(f'{TRAIN_DIR}{case_id}/T1w/{filename}')
filename = train_df[train_df.caseID==case_id].T1wCE.iloc[0]
t1wce = plt.imread(f'{TRAIN_DIR}{case_id}/T1wCE/{filename}')
filename = train_df[train_df.caseID==case_id].T2w.iloc[0]
t2w = plt.imread(f'{TRAIN_DIR}{case_id}/T2w/{filename}')
filename = train_df[train_df.caseID==case_id].FLAIR.iloc[0]
flair = plt.imread(f'{TRAIN_DIR}{case_id}/FLAIR/{filename}')

In [None]:
fig = plt.figure(figsize=(26,6))
plt.gray()
ax1 = fig.add_subplot(141)
plt.imshow(t1w, aspect='auto')
ax2 = fig.add_subplot(142)
plt.imshow(t1wce, aspect='auto')
ax3 = fig.add_subplot(143)
plt.imshow(t2w, aspect='auto')
ax4 = fig.add_subplot(144)
plt.imshow(flair, aspect='auto')

### Let's look at some images without Methylation (MGMT_value=1)

In [None]:
case_id = '00058'
filename = train_df[train_df.caseID==case_id].T1w.iloc[0]
t1w = plt.imread(f'{TRAIN_DIR}{case_id}/T1w/{filename}')
filename = train_df[train_df.caseID==case_id].T1wCE.iloc[0]
t1wce = plt.imread(f'{TRAIN_DIR}{case_id}/T1wCE/{filename}')
filename = train_df[train_df.caseID==case_id].T2w.iloc[0]
t2w = plt.imread(f'{TRAIN_DIR}{case_id}/T2w/{filename}')
filename = train_df[train_df.caseID==case_id].FLAIR.iloc[0]
flair = plt.imread(f'{TRAIN_DIR}{case_id}/FLAIR/{filename}')

In [None]:
fig = plt.figure(figsize=(26,6))
plt.gray()
ax1 = fig.add_subplot(141)
plt.imshow(t1w, aspect='auto')
ax2 = fig.add_subplot(142)
plt.imshow(t1wce, aspect='auto')
ax3 = fig.add_subplot(143)
plt.imshow(t2w, aspect='auto')
ax4 = fig.add_subplot(144)
plt.imshow(flair, aspect='auto')

### Extract Metadata from DICOM

Thanks to ([this code](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/252942)) by @pestipeti

There are two features (`WindowWidth` and `WindowCenter`) in the metadata that are difficult to include in the correlation matrix due to different data types: (string, float, int) and different values (1, 1.0, '1.0', '[1.0, 1.0]').

In [None]:
metadata = pd.read_csv('../input/extract-metadata-from-dicom/dicom_meta_train.csv')

image = []
dicom_src = metadata['dicom_src'].tolist()
for src in dicom_src:
    image.append(src.split('/')[-1].replace('dcm','png'))

metadata['image'] = image

# Check if there's a data leak in timestamp
metadata['new_timestamp'] = metadata['timestamp'] - np.mean(metadata['timestamp'])
metadata = metadata.merge(train_df[['BraTS21ID','MGMT_value']], on='BraTS21ID', how='left')
metadata

In [None]:
# Convert object data into numerical

object_cols = metadata.select_dtypes('object').columns
object_cols = [col for col in object_cols if col not in ('WindowWidth', 'WindowCenter')]

ordinal_encoder = OrdinalEncoder()
metadata[object_cols] = metadata[object_cols].fillna('NA')
metadata[object_cols] = ordinal_encoder.fit_transform(metadata[object_cols])
metadata.head()

### Correlation between MGMT value and Metadata features

There seems to be **no correlation** between MGMT value and any other metadata features. Note: MGMT_value is at the last row/column and the feature with the best correlation with it is `SpatialResolution` which mainly contains *NaN* values.

In [None]:
# Exclude metadata features that are empty and irrelevant
notnull_cols = [col for col in metadata.columns if col not in ('dataset',
                                                               'ImageDimensions', 
                                                               'ImageLocation',
                                                               'SOPClassUID',
                                                               'SpacingBetweenSlices',
                                                               'type',
                                                               'dicom_src',
                                                               'timestamp',
                                                               )]

corr = metadata[notnull_cols].corr()
sns.set(rc={'figure.figsize':(30,26)})
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            )