# Predicting Genetic Biomarker in Brain Tumor. 

## This Notebook only contains EDA and training data prep

#### Problem 
In this competition you will predict the genetic subtype of glioblastoma using MRI (magnetic resonance imaging) scans to train and test your model to detect for the presence of MGMT promoter methylation.

#### Glossary 

- MGMT promoter methylation  - The presence of a specific genetic sequence in the tumor known as MGMT promoter methylation has been shown to be a favorable predictive factor and a strong predictor of responsiveness to chemotherapy.
- Radio genomics - the field of predicting the genetics of the cancer through imaging
- Types of mpMRI scans:
    - Fluid Attenuated Inversion Recovery (FLAIR)
    - T1-weighted pre-contrast (T1w)
    - T1-weighted post-contrast (T1Gd)
    - T2-weighted (T2)

** IMPORTANT ** - This notebook removes blank (black) images, which do not add any value to the processing. The train.csv datasheet contains the list of files as per the original dataset.

#### Notebooks Referred. 
- https://www.kaggle.com/ayuraj/train-brain-tumor-as-video-classification-w-b
- https://www.kaggle.com/ihelon/brain-tumor-eda-with-animations-and-modeling
- https://www.kaggle.com/smoschou55/advanced-eda-brain-tumor-data/comments#Main-Competition-Workflow

In [None]:
import os
import re 
import glob
import numpy as np
import pandas as pd
from PIL import Image
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

# Pydicom related imports
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
import SimpleITK as sitk

# Deep learning packages
import tensorflow as tf

# For gif creation
import imageio

import warnings
warnings.filterwarnings('ignore')

## Data Visualization

The training data contains 585 values each corresponding to a patient/subject. 
Each row is marked with target MGMT_value for each subject (BraTS21ID) in the training data (e.g. the presence of MGMT promoter methylation).
From the training set 307 subjects reported presence of MGMT promoter, and 278 reported absence. 
The imbalance in the training data set is acceptable. 


In [None]:
train_df = pd.read_csv('../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv')
print('Number of rows: ', len(train_df))
train_df['MGMT_value'].value_counts()

In [None]:
plt.figure(figsize=(5, 5))
print(train_df.MGMT_value.value_counts())
sns.countplot(data=train_df, x="MGMT_value");

Let us look at the volume of training data.

In [None]:
train_files = glob.glob('../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/*/*/*')
print(f'There are {len(train_files)} dicom files in the training data')

In [None]:
test_files = glob.glob('../input/rsna-miccai-brain-tumor-radiogenomic-classification/test/*/*/*')
print(f'There are {len(test_files)} dicom files in the test data')

In [None]:
df_train_labels = pd.read_csv('../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv')
df_train_labels = df_train_labels.rename(columns={'BraTS21ID': 'PatientId'})
df_train_labels['PatientId'] = [format(x, '05d') for x in df_train_labels.PatientId]
df_train_labels['PatientId'] = df_train_labels['PatientId'].astype(str)
df_train_labels.describe()

In [None]:
patients = glob.glob('../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/*')
print(f'There are {len(patients)} patients in the training data')

In [None]:
patients = glob.glob('../input/rsna-miccai-brain-tumor-radiogenomic-classification/test/*')
print(f'There are {len(patients)} patients in the test data')

In [None]:
keys = ['FLAIR', 'T1w', 'T1wCE', 'T2w']

label_dict = {
    'FLAIR': [],
    'T1w': [],
    'T1wCE': [],
    'T2w': []
}

label_dict_counts = {}

for filename in tqdm(train_files):
    
    scan = filename.split('/')[-2]
    
    if scan=='FLAIR':
        label_dict['FLAIR'].append(filename)
        
    elif scan=='T1w':
        label_dict['T1w'].append(filename)

    elif scan=='T1wCE':
        label_dict['T1wCE'].append(filename)

    else:
        label_dict['T2w'].append(filename)
    
for key in keys:
    label_dict_counts[key] = len(label_dict[key])

values = label_dict_counts.values()
sns.barplot(x=keys, y=list(values))

In [None]:
# Number of files per patient per Key.
train_folders = '../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/'
df_patient_records_train = pd.DataFrame(columns=['PatientId'] + keys)
df_patient_records_train.set_index('PatientId')
for f in tqdm(os.listdir(train_folders)):
    patientId = f
    df_patient_records_train = df_patient_records_train.append({'PatientId': patientId, 'FLAIR': 0, 'T1w': 0, 'T1wCE': 0, 'T2w' : 0}, ignore_index=True)
    for key in keys:
        patientId_key_path = f'../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/{patientId}/{key}/*.dcm'
        df_patient_records_train.loc[df_patient_records_train['PatientId'] == patientId, [key]] = len(glob.glob(patientId_key_path))
df_patient_records_train.head()

In [None]:
# Number of files per patient per Key.
test_folders = '../input/rsna-miccai-brain-tumor-radiogenomic-classification/test/'
df_patient_records_test = pd.DataFrame(columns=['PatientId'] + keys)
df_patient_records_test.set_index('PatientId')
for f in tqdm(os.listdir(test_folders)):
    patientId = f
    df_patient_records_test = df_patient_records_test.append({'PatientId': patientId, 'FLAIR': 0, 'T1w': 0, 'T1wCE': 0, 'T2w' : 0}, ignore_index=True)
    for key in keys:
        patientId_key_path = f'../input/rsna-miccai-brain-tumor-radiogenomic-classification/test/{patientId}/{key}/*.dcm'
        df_patient_records_test.loc[df_patient_records_test['PatientId'] == patientId, [key]] = len(glob.glob(patientId_key_path))
df_patient_records_test.head()

In [None]:
for key in keys:
    df_patient_records_train[key] = df_patient_records_train[key].astype(int)
df_patient_records_train['PatientId'] = df_patient_records_train['PatientId'].astype(str)
df_patient_records_train["TotalFiles"] = df_patient_records_train[keys].sum(axis=1)
assert df_patient_records_train.TotalFiles.sum() == len(train_files)
df_patient_records_train.head()

In [None]:
for key in keys:
    df_patient_records_test[key] = df_patient_records_test[key].astype(int)
df_patient_records_test['PatientId'] = df_patient_records_test['PatientId'].astype(str)
df_patient_records_test["TotalFiles"] = df_patient_records_test[keys].sum(axis=1)
assert df_patient_records_test.TotalFiles.sum() == len(test_files)
df_patient_records_test.head()

In [None]:
df_patient_records_train = pd.merge(df_patient_records_train, df_train_labels, on=['PatientId'])

In [None]:
df_patient_records_train.head()

In [None]:
df_patient_records_train.sort_values(by='TotalFiles', ascending=False).head(50)[keys].plot(kind='bar',figsize=(20, 8), stacked=True)

In [None]:
df_patient_records_test.sort_values(by='TotalFiles', ascending=False).head(50)[keys].plot(kind='bar',figsize=(20, 8), stacked=True)

In [None]:
boxprops = dict(linestyle='-', linewidth=4, color='r')
medianprops = dict(linestyle='-', linewidth=4, color='b')
df_patient_records_train[keys].plot(kind='box', figsize=(10, 4), showfliers=True, showmeans=True,
                boxprops=boxprops,
                medianprops=medianprops)
plt.suptitle("Distribution of files per patient")
plt.xlabel("Types")
plt.ylabel("Count of files")

- The images that belong to T2w are higher in number, the images that belong to T1wcE are lowest in number
- More outliers observed for T1wCE Kind

In [None]:
boxprops = dict(linestyle='-', linewidth=4, color='r')
medianprops = dict(linestyle='-', linewidth=4, color='b')
df_patient_records_test[keys].plot(kind='box', figsize=(10, 4), showfliers=True, showmeans=True,
                boxprops=boxprops,
                medianprops=medianprops)
plt.suptitle("Distribution of files per patient")
plt.xlabel("Types")
plt.ylabel("Count of files")

In [None]:
boxprops = dict(linestyle='-', linewidth=4, color='r')
medianprops = dict(linestyle='-', linewidth=4, color='b')
df_patient_records_train['TotalFiles'].plot(kind='box', figsize=(8, 5), showfliers=True, showmeans=True,
                boxprops=boxprops,
                medianprops=medianprops)
plt.suptitle("Distribution of Total files per patient")

In [None]:
boxprops = dict(linestyle='-', linewidth=4, color='r')
medianprops = dict(linestyle='-', linewidth=4, color='b')
df_patient_records_test['TotalFiles'].plot(kind='box', figsize=(8, 5), showfliers=True, showmeans=True,
                boxprops=boxprops,
                medianprops=medianprops)
plt.suptitle("Distribution of Total files per patient")

In [None]:
round(pd.DataFrame.from_dict(label_dict_counts, orient='index')/len(train_files)*100, 2).plot(kind='bar')
plt.suptitle("Percentage Data by Type")
plt.xlabel("Types")
plt.ylabel("Percentage")

In [None]:
df_patient_records_train.describe()

### Read DICOM images

In [None]:
# Reference: https://www.kaggle.com/xhlulu/siim-covid-19-convert-to-jpg-256px
def ReadMRI(path, voi_lut = True, fix_monochrome = True):
    
    # Original from: https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
    dicom = pydicom.read_file(path)
    
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to 
    # "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
               
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data

def resize(array, size, keep_ratio=False, resample=Image.LANCZOS):
    # Original from: https://www.kaggle.com/xhlulu/vinbigdata-process-and-resize-to-image
    im = Image.fromarray(array)
    if keep_ratio:
        im.thumbnail((size, size), resample)
    else:
        if (im.size != (size, size)):
            im = im.resize((size, size), resample)
    return im

In [None]:
data = ReadMRI(train_files[1])
print('Shape of data: ', data.shape)
plt.rcdefaults()
plt.figure(figsize=(5, 5))
plt.imshow(data, cmap='gray');

### Animate MRI images for a patient

In [None]:
patientIds = os.listdir('../input/rsna-miccai-brain-tumor-radiogenomic-classification/train')
patientId = np.random.choice(patientIds)
key = np.random.choice(keys)

output_dir_path_train = '/kaggle/working/output/images/train' 
os.makedirs(output_dir_path_train, exist_ok=True)

output_dir_path_test = '/kaggle/working/output/images/test' 
os.makedirs(output_dir_path_test, exist_ok=True)

def convert_dicom_to_png(patientId, key, ds_type = 'train'):
    if ds_type == 'train':
        mgmt_value = df_patient_records_train.loc[df_patient_records_train['PatientId'] == patientId]["MGMT_value"].item()
    files_path = f'../input/rsna-miccai-brain-tumor-radiogenomic-classification/{ds_type}/{patientId}/{key}/*.dcm'
#     print(len(files_path))
    for file in glob.glob(files_path):
        file_name = file.split('/')[-1].split('.')[0]
        img_data = ReadMRI(file)
        # skipping blank images
        if (np.count_nonzero(img_data) > 0):
            img_data = resize(img_data, size=224)
            if "train" == ds_type:
                os.makedirs(f'{output_dir_path_train}/{patientId}/{key}', exist_ok=True)
                img_data.save(f'{output_dir_path_train}/{patientId}/{key}/{file_name}-{mgmt_value}.png')
            else:
                os.makedirs(f'{output_dir_path_test}/{patientId}/{key}', exist_ok=True)
                img_data.save(f'{output_dir_path_test}/{patientId}/{key}/{file_name}.png')

convert_dicom_to_png(patientId, key)

In [None]:
anim_file = 'brain_scan.gif'
with imageio.get_writer(anim_file, mode='I') as writer:
    filenames = glob.glob(f'{output_dir_path_train}/{patientId}/{key}/Image*.png')
    filenames = sorted(filenames)
    for filename in filenames:
        image = imageio.imread(filename)
        writer.append_data(image)

In [None]:
!pip install git+https://github.com/tensorflow/docs

In [None]:
import tensorflow_docs.vis.embed as embed
print(f'Showing Animated gif for patient: {patientId}, for key: {key}')
embed.embed_file(anim_file)

### Visualize Images per type

In [None]:
patient_path = f'../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/{patientId}/{key}'
for p in list(df_patient_records_train.sample(n=5).PatientId):
    for i, key in enumerate(keys, 1):
        patient_path = f'../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/{p}/'
        t_paths = sorted(glob.glob(os.path.join(patient_path, key, "*")), key=lambda x: int(x[:-4].split("-")[-1]))
        data = ReadMRI(t_paths[int(len(t_paths)*0.5)])
        plt.subplot(1, 4, i)
        plt.imshow(data, cmap="gray")
        plt.title(f"{key}", fontsize=12)
        plt.axis("off")
    mgmt_value = df_patient_records_train.loc[df_patient_records_train['PatientId'] == p]["MGMT_value"]
    plt.suptitle(f"MGMT_value: {mgmt_value.item()}, patient Id: {p}", fontsize=12)
    plt.show()

## Convert DICOM to Images

In [None]:
for patientId in tqdm(list(df_patient_records_train.PatientId)[:100]):
    for key in keys:
        convert_dicom_to_png(patientId, key)
        
        
for patientId in tqdm(list(df_patient_records_test.PatientId)[:100]):
    for key in keys:
        convert_dicom_to_png(patientId, key, 'test')

In [None]:
df_patient_records_train.to_csv(f'{output_dir_path_train}/train.csv')
df_patient_records_test.to_csv(f'{output_dir_path_test}/test.csv')

In [None]:
%%time
!mkdir /kaggle/tmp
!tar -zcf train.tar.gz -C "./output/images/train" .
!tar -zcf test.tar.gz -C "./output/images/test" .
!rm -r ./output