# EXPLORATORY DATA ANALYSIS (EDA)

**Objective:** predict the genetic subtype of glioblastoma using MRI (magnetic resonance imaging) scans to train and test your model to detect for the presence of MGMT promoter methylation.

---
### **Glioblastoma**

[Glioblastoma](https://www.mayoclinic.org/diseases-conditions/glioblastoma/cdc-20350148) (GLB) is an aggresive type of cancer that occurs in the brain or spinalcord. It's formed from astrocytes (specialized glial cells that outnumber neurons by fivefold). Occurs more often in older adults, symptoms include worsening headaches, nausea, vomiting & seizures.

<img src="https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2019/01/11/10/47/glioblastoma-8col-3802786-002-0.jpg" width="250px"/> <img src="https://media.springernature.com/m685/springer-static/image/art%3A10.1038%2Fs41593-019-0367-6/MediaObjects/41593_2019_367_Fig1_HTML.png" width="350px"/>

**Diagnosis:** Neurological exam, Imaging tests, Biopsy.

**Treatment:** Surgery to remove it, Radiation therapy, Chemotherapy, Tumor treating fields (TTF) therapy, Targeted drug therapy, Clinical trials & Supportive care.

---
### **MGMT**

O6-methylguanine-DNA methyl-transferase ([MGMT](https://www.frontiersin.org/articles/10.3389/fonc.2019.01547/full)) it's an enzyme involved in DNA repair. Located in chromosome 10q26.3, this enzyme removes alkyl addicts from the O6 position of Guanine. [Methyl moiety](https://en.wikipedia.org/wiki/Moiety_(chemistry)) of the O6-methylguanine adduct is transferred to the MGMT protein, then it undergoes irreversible inhibition.

<img src="https://media.springernature.com/lw785/springer-static/image/prt%3A978-3-540-29623-2%2F1/MediaObjects/978-3-540-29623-2_1_Part_Fig1-2580_HTML.jpg" width="250px"/>

Defective MGMT results into persistence of the O6-methylguanine adduct, causing base misrepairing and mismatch repair futile. This leads to cell cycle arrest and apoptosis. In cases with **monosomy** of chromosome 10 methylation of the remaning allele completely blocks MGMT-mediated DNA repair.

---
### **MGMT in Glioma**

[Methylation](https://www.nature.com/scitable/topicpage/the-role-of-methylation-in-gene-expression-1070/) of the MGMT gene promoter has been observed in approx. 50% of grade IV gliomas. MGMT gene [promoter methylation](https://oncologypro.esmo.org/education-library/factsheets-on-biomarkers/mgmt-promoter-methylation-in-glioma) has been investigated as a potential biomarker of sensitivity to alkylating chemotherapy.

<img src="https://www.researchgate.net/profile/Walter-Taal/publication/275358605/figure/fig1/AS:418919660703747@1476889638840/Classification-of-glioma-in-types-and-WHO-grades.png" width="500px"/>

MGMT promotor methylated glioblastoma is [**_likely to show less aggressive image features_**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7410549/) than unmethylated glioblastoma (Suh, C. _et al._, 2018):

* **Less edema:** [Swelling](https://my.clevelandclinic.org/health/diseases/12564-edema) caused by fluid trapped in the body's tissue, Glioma prognosis may be independently predicted by identifying the degree of peritumoral edema on MRI.
* **Apparent diffusion coeficient (ADC):** [Measure of the magnitude of diffusion](https://radiopaedia.org/articles/apparent-diffusion-coefficient-1) (of water molecules) within tissue. Commonly calculated using MRI with diffusion-weighted imaging (DWI).
* **Perfusion:** The process in which [blood is forced to flow](https://www.sciencedirect.com/topics/immunology-and-microbiology/perfusion) through a network of microscopic vessels with biological tissue, allowing exchange of oxygen and other molecules across semipermeable microvascular walls.
    * GLB exhibits profound intratumoral heterogeneity in perfusion, low perfusion may induce treatment resistance. Therefore, imaging approaches that define low perfusion compartments are crucial for clinical management. 

---
### **Data Description**
Three structured cohorts: Training (Public), Validation (Public) and Testing (Private during and after competition).

In this dataset the MGMT promoter status is defined as a binary label: 0 for unmethylated and 1 for methylated.

Each independent case has a dedicated folder idenfied by a 5-digit number, the folder is composed by a 4 sub-folders corresponding to each of the structural multi-parametric MRI ([mpMRI](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5832385/)). Differences in [MRI imaging methods](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/252843) by @dlbsabu:
* T1-weighted pre-contrast (T1w) 

Fat is depicted in white and water in black. The shape of the brain can be clearly seen and morphological abnormalities are easy to detect (Atrophy, tumors, etc.).
* T1-weighted post-contrast (T1Gd)

T1Gd is T1 imaging with contrast medium and is the method that best reflects the location, size, and shape of the mass.
* T2-weighted (T2)

Water is painted white and lesions appear white, suitable for lesion evaluation.
* Fluid Attenuated Inversion Recovery (FLAIR)

In T2, the spinal fluid (water) is white and the lesion is also white, so you have to look for the white in the white, which is difficult to understand. FLAIR can be roughly thought of as T2, in which the water is also black, making it easier to find the lesion.

---
### **DICOM**
The data is given in [DICOM](https://www.dicomstandard.org/about-home) (Digital Imaging and Communications in Medicine) format. It's the international standard for medical images and related information (first publication in 1993). 

---
### **Files**
* train/ - folder containing the training files, with each top-level folder representing a subject. NOTE: There are some unexpected issues with the following three cases in the training dataset, participants can exclude the cases during training: [00109, 00123, 00709]. We have checked and confirmed that the testing dataset is free from such issues.
* train_labels.csv - file containing the target MGMT_value for each subject in the training data (e.g. the presence of MGMT promoter methylation).
* test/ - the test files, which use the same structure as train/; your task is to predict the MGMT_value for each subject in the test data. NOTE: the total size of the rerun test set (Public and Private) is ~5x the size of the Public test set.
* sample_submission.csv - a sample submission file in the correct format

---
### **Evaluation**
Submissions are evaluated on the [area under the ROC curve between](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) the predicted probability and the observed target.

For each "BraTS21ID" in the test set, you must predict a probability for the target MGMT_value. The file should contain a header and have the following format:

    BraTS21ID,MGMT_value
    00001,0.5
    00013,0.5
    00015,0.5
    etc.

---
# **Loading data & visualization**

Code from @ihelon 's notebok:

https://www.kaggle.com/ihelon/brain-tumor-eda-with-animations-and-modeling

---
### **Import Libraries**

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import os
import json
import glob
import random
import collections
import cv2

import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut

from matplotlib import animation, rc
rc('animation', html='jshtml')

---
### **Loading & Visualization of Data**

In [None]:
train_df = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv")
train_df.head()

In [None]:
plt.figure(figsize=(4, 4))
sns.countplot(data=train_df, x="MGMT_value");
print(train_df['MGMT_value'].value_counts())

In [None]:
def load_dicom(path):
    dicom = pydicom.read_file(path)
    data = dicom.pixel_array
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data



def visualize_sample(
    brats21id, 
    slice_i,
    mgmt_value,
    types=("FLAIR", "T1w", "T1wCE", "T2w")
):
    plt.figure(figsize=(10, 3))
    patient_path = os.path.join(
        "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/", 
        str(brats21id).zfill(5),
    )
    for i, t in enumerate(types, 1):
        t_paths = sorted(
            glob.glob(os.path.join(patient_path, t, "*")), 
            key=lambda x: int(x[:-4].split("-")[-1]),
        )
        data = load_dicom(t_paths[int(len(t_paths) * slice_i)])
        plt.subplot(1, 4, i)
        plt.imshow(data, cmap="gray")
        plt.title(f"{t}", fontsize=10)
        plt.axis("off")
    
    plt.suptitle(f"MGMT_value: {mgmt_value}", fontsize=10)
    plt.show()
    

In [None]:
for i in random.sample(range(train_df.shape[0]), 4):
    _brats21id = train_df.iloc[i]["BraTS21ID"]
    _mgmt_value = train_df.iloc[i]["MGMT_value"]
    visualize_sample(brats21id=_brats21id, mgmt_value=_mgmt_value, slice_i=0.5)

---
## **Animation**

In [None]:
def create_animation(ims):
    fig = plt.figure(figsize=(6, 6))
    plt.axis('off')
    im = plt.imshow(ims[0], cmap="gray")
    plt.close()

    def animate_func(i):
        im.set_array(ims[i])
        return im

    return animation.FuncAnimation(fig, animate_func, frames = len(ims), interval = 1000//24)



In [None]:
def load_dicom_line(path):
    t_paths = sorted(
        glob.glob(os.path.join(path, "*")), 
        key=lambda x: int(x[:-4].split("-")[-1]),
    )
    images = []
    for filename in t_paths:
        data = load_dicom(filename)
        if data.max() == 0:
            continue
        images.append(data)
        
    return images

---
#### **Case 00000, FLAIR**
Fluid Attenuated Inversion Recovery (FLAIR)

In T2, the spinal fluid (water) is white and the lesion is also white, so you have to look for the white in the white, which is difficult to understand. FLAIR can be roughly thought of as T2, in which the water is also black, making it easier to find the lesion.

In [None]:
images = load_dicom_line("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000/FLAIR")
create_animation(images)

---
#### **Case 00000, T1w**

T1-weighted pre-contrast (T1w) 

Fat is depicted in white and water in black. The shape of the brain can be clearly seen and morphological abnormalities are easy to detect (Atrophy, tumors, etc.).


In [None]:
images = load_dicom_line("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000/T1w")
create_animation(images)

---
#### **Case 00000, T1wCE**

T1-weighted post-contrast (T1Gd)

T1Gd is T1 imaging with contrast medium and is the method that best reflects the location, size, and shape of the mass.

In [None]:
images = load_dicom_line("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000/T1wCE")
create_animation(images)

---
#### **Case 00000, T2w**

T2-weighted (T2)

Water is painted white and lesions appear white, suitable for lesion evaluation.

In [None]:
images = load_dicom_line("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000/T2w")
create_animation(images)