# EDA 1 for Brain Tumor Classification



In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pathlib import Path

sns.set_context("talk", font_scale=1.4)
sns.set_style("whitegrid")

# Step 1: Goal

To predict the genetics of the cancer through imaging (i.e., radiogenomics) to minimize the number of surgeries and refine the type of therapy required for patients. Specifically, predict the genetic subtype of glioblastoma using MRI (magnetic resonance imaging) scans to train and test your model to detect for the presence of MGMT promoter methylation. This problem can be treated as a binary classification problem to predict a probability for the target MGMT_value.

* The training data should generalize to the test data but that depends on how the split was done? Each data point is one patient. Does the data represent the whole population, all patients in the world or just NA?
    * multiple institutions, variety of instruments (paper)
* Is each entry independent (five-digit number per unique patient) or do we have duplicates, e.g. same patient different time?
* How was the negative set created? Imagery of random selected people?
* ***Submission rules***:
    * According to the submission (button) extra explanation: "In this competition, we will privately re-run your selected Notebook Version with a hidden test set substituted into the competition dataset. We then extract your chosen Output File from the re-run and use that to determine your score."
        * this means the submitted notebook needs to work with the raw dicom files (additionally provided with private testset)
    * Notebook has to be without internet access.


## Metric

* The metric to optimize is the AUC ROC curve.
* AUC Still prone to class imbalance to some degree.
* based on MGMT_value=1/0


# Step 2: Gather the data.

* The data was already provided by Kaggle and split in training and test set. How was the split done? Was ensured that same distribution of data in train and test?
* Is there alternative open data which could help to improve performance?
* Data Described in paper: https://arxiv.org/pdf/2107.02314.pdf


# Step 3: Extract the Data

Organizers suggest to exclude the following training data: [00109, 00123, 00709]

In [None]:
! tree /kaggle/input/ -L 2

In [None]:
! ls /kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/train | head -n 5

In [None]:
! ls /kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000

Different RMI imagery:

* Fluid Attenuated Inversion Recovery (FLAIR)
* T1-weighted pre-contrast (T1w)
* T1-weighted post-contrast (T1Gd)
* T2-weighted (T2)

In [None]:
! ls /kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000/FLAIR | head -n 5

In [None]:
! ls /kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/test

> * Small test set!
> * IDs seem unique, also between training and test set

# Step 4: Meet and Greet the data

Train - Test Split: 87% train, rest test


In [None]:
! ls /kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/train | wc -l

In [None]:
! ls /kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/test | wc -l

In [None]:
data_path = Path("../input/rsna-miccai-brain-tumor-radiogenomic-classification")

In [None]:
labels = pd.read_csv(data_path/"train_labels.csv")
labels.shape

In [None]:
labels.head()

In [None]:
labels.sample(10, random_state=42)

In [None]:
labels.info()

### Missing Values

In [None]:
labels.isna().sum()

### Duplication

In [None]:
labels.duplicated().sum()

In [None]:
folder_struct = []
for path in (data_path).rglob('*.dcm'):
    folder_struct.append(str(path).split("/")[3:]+[str(path)])

## Image Metadata



In [None]:
image_meta = pd.DataFrame(folder_struct, columns=["type", "id", "mri", "file", "path"])
image_meta.head()

In [None]:
image_meta['img_num'] = image_meta['file'].str.extract(r'Image-(\d+).dcm').astype(int)

In [None]:
image_meta.shape

In [None]:
image_file_counts =image_meta.groupby(["type","id", "mri"]).count()['file'].reset_index().rename(columns={"file": "counts"})
image_file_counts.head(20)

> Number of .dcm files not consistent,maybe due to different methdos in hospitals?

In [None]:
fig, ax = plt.subplots(figsize=(16,4))
sns.histplot(data=image_file_counts, x="counts", stat="count", hue='mri', ax=ax)
ax.set_xlabel("# of .dcm files per person")

In [None]:
image_file_counts['counts'].describe()

> At maximum we have 514 images per MRI type.

In [None]:
image_file_counts.groupby("type")['id'].nunique(), image_file_counts['id'].shape

### Missing Values 2

Are there persons which have a MRI type missing?

In [None]:
image_meta_miss = image_meta[['type', 'id', 'mri', 'path']].pivot_table(index=['type', 'id'], columns=["mri"], values=['path'], aggfunc=
                                                                        lambda x: len(x) if len(x)>0 else np.nan)
image_meta_miss.head()

In [None]:
image_meta_miss.isna().sum()

> Train and Test Data all have all MRI types.

Are there missing labels for some persons?

In [None]:
labels['id'] = labels["BraTS21ID"].apply(lambda num: f"{num:05d}")
labels['label'] = labels['MGMT_value']
labels.drop(columns=["BraTS21ID", 'MGMT_value'], inplace=True)
image_meta = pd.merge(image_meta, labels, on = "id", how="left")

In [None]:
image_meta.head()

In [None]:
image_meta[image_meta['type'] == "train"].isna().sum()

> No missing values for label in train data.

### Look at Specific Person in Detail

In [None]:
! pip install pydicom

In [None]:
from pydicom import dcmread
from pydicom.data import get_testdata_file
from pydicom import filereader
import matplotlib.pyplot as plt

In [None]:
sample_person = image_meta.query("id=='00688'")
sample_person.head()

In [None]:
dcmread(sample_person.at[51473, 'path'])

In [None]:
dataset = filereader.dcmread(sample_person.at[51473, 'path'])
img = dataset.pixel_array

In [None]:
img.shape, img.dtype

In [None]:
fig, ax = plt.subplots()
ax.imshow(img, cmap='gray')
ax.set_axis_off()
plt.show()


In [None]:
first_imgs = sample_person.sort_values(['mri', 'img_num']).groupby("mri").first()
first_imgs

In [None]:
def get_img_array(image_path):
    dataset = filereader.dcmread(image_path)
    return dataset.pixel_array

In [None]:

fig,axes = plt.subplots(nrows=2, ncols=2)
axes= axes.flatten()
for i,ax in enumerate(axes):
    ax.imshow(get_img_array(first_imgs.iloc[i]['path']), cmap='gray')
    ax.set_axis_off()
    ax.set_title(first_imgs.index[i])
    #plt.show()

In [None]:
sample_person.sort_values(['mri', 'img_num']).groupby("mri")['img_num'].apply(lambda lst: lst.to_list())

In [None]:
def compare_img_types(img_num):
    select_img = sample_person[sample_person['img_num'] == img_num]
    print(select_img['mri'])
    
    fig,axes = plt.subplots(nrows=2, ncols=2, figsize=(16,16))
    axes= axes.flatten()
    for i,ax in enumerate(axes):
        ax.imshow(get_img_array(select_img.iloc[i]['path']), cmap='gray')
        ax.set_axis_off()
        ax.set_title(first_imgs.index[i])

In [None]:
compare_img_types(150)

In [None]:
last_imgs = sample_person.sort_values(['mri', 'img_num']).groupby("mri").last()
last_imgs

In [None]:

fig,axes = plt.subplots(nrows=2, ncols=2)
axes= axes.flatten()
for i,ax in enumerate(axes):
    ax.imshow(get_img_array(last_imgs.iloc[i]['path']), cmap='gray')
    ax.set_axis_off()
    ax.set_title(first_imgs.index[i])
    #plt.show()

# Step 5: Data Distribution




## Time-series Analyis of Single Person
Cross-setional analysis of brain

In [None]:
def compare_img_mri(mri_type, interval=10):
    
    select_img = sample_person[sample_person['mri'] == "FLAIR"].sort_values('img_num').reset_index()
    
    
    interval_imgs = select_img[::interval]
    
    #print(select_img['mri'])
    
    fig,axes = plt.subplots(nrows=len(interval_imgs), ncols=6, figsize=(16,26))
    axes= axes.flatten()
    for i,ax in enumerate(axes):
        ax.imshow(get_img_array(select_img.iloc[i]['path']), cmap='gray', aspect='auto')
        ax.set_axis_off()
        #ax.set_title(select_img.at[i, 'img_num'])

In [None]:
compare_img_mri(mri_type="FLAIR", interval=10)

In [None]:
compare_img_mri(mri_type="T1w", interval=10)

In [None]:
compare_img_mri(mri_type="T1wCE", interval=10)

In [None]:
compare_img_mri(mri_type="T2", interval=10)

> * types of scans similar between mri types, scan in multiple cross sectional layers
> * Do we need all layers for predictions or is one shot in the middle enough?
> * Which image has been used for predictions of the class?
