# DAT 490 - Exploratory Data Analysis
*by Max Murphy*

## Import All External Packages
The categories of library and the packages we are using are as follows:

In [None]:
# Python System Libraries
import os
import random
import re
import glob

# Libraries to Import Medical Data
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut

# Data Analysis Libraries
import numpy as np
import pandas as pd

# Data Visualization Libraries
import matplotlib.pyplot as plt

## Breaking Down the Key Components of the Given Data

There are three main components to the RSNA-MICCIA Dataset:
1. A list of labels with patient IDs that indicate whether or not the patient has the MGMT genetic marker
2. A file structure containing visual data delineated by patient ID
3. Lastly, the actual visual information

### Component 1: MGMT Values

In [None]:
train_label_df = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv")
train_label_df.head()

To clarify some of the training data, the "BraTS21ID" is the patient identifier, and a value of 1 in the "MGMT_value" means the patient actually has the MGMT marker and 0 is where the patient does not.

#### Key Point 1: Looking at the distribution of MGMT in our patient training set

In [None]:
label_replacement = {1: 'Yes', 0: 'No'}
plt.figure(figsize=(5,7))
ax = train_label_df["MGMT_value"].replace(label_replacement).value_counts().plot(kind="bar", rot=0, xlabel="Whether or not the patient has the MGMT genotype", ylabel="Patient Count")
ax.set_title("Training Data from RSNA-MICCAI Brain Tumor Radiogenomic Classification Dataset \n Patient Counts based on Genotype")
rects = ax.patches
# Make some labels.
labels = [f"{i}" for i in train_label_df["MGMT_value"].value_counts()]

for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(
        rect.get_x() + rect.get_width() / 2, height + 5, label, ha="center", va="bottom"
    )

In [None]:
train_label_df["MGMT_value"].replace(label_replacement).value_counts()

Here we can see there is a small skew towards a patient having the marker. This may just be due to some randomness when the training sample was created for this dataset. This distribution may change in both sets of our training data.

### Component 2: Patient File Structure

In [None]:
n = 5
rootdir = "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train"
for file in os.listdir(rootdir)[:n]:
    d = os.path.join(rootdir, file)
    if os.path.isdir(d):
        print(d)

This is the file structure of the train dataset, as we can see that "BraTS21ID" is the title of each folder in the training folder. These folders contain the 4 different brain scans of the patient with that ID.

#### Key Aspect 1 : How many patients are included in the training dataset?

Next let's figure out how many patients are included in our training data set. This can be found by counting the number of directories in the "train" folder of the dataset.

In [None]:
path = "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train"
sum(os.path.isdir(os.path.join(path, i)) for i in os.listdir(path))

#### Key Aspect 2: Folder Structure for Each Patient

Another thing we should note is the internal structure of each of the patient folders for our future reference.

In [None]:
rootdir = "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000"
for file in os.listdir(rootdir):
    d = os.path.join(rootdir, file)
    if os.path.isdir(d):
        print(d)

Here we can see that in a given folder there will be 4 sub folders covering the 4 different types of MRI scans that we talked about earlier in the literature review, and will go into more detail in the later sections of this document.

### Component 3: Patient MRI Scans

Within the patient MRI scans there are important differences that I outlined in the literature review. I would like to briefly go through some of the key points of information about these scans.

#### Key Point 1 : Varying MRI Types

Within each patient MRI folder, we have 4 subfolders. These folders are named according to the identifier associated with which MRI scan type the radiologist used to obtain the images within the folder. Let's take a look at the visual differences of these scans first.

In [None]:
def load_dicom(path):
    dicom = pydicom.read_file(path)
    data = dicom.pixel_array
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data

def visualize_sample(
    brats21id, 
    slice_i,
    mgmt_value,
    types=("FLAIR", "T1w", "T1wCE", "T2w")
):
    plt.figure(figsize=(16, 5))
    patient_path = os.path.join(
        "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/", 
        str(brats21id).zfill(5),
    )
    for i, t in enumerate(types, 1):
        t_paths = sorted(
            glob.glob(os.path.join(patient_path, t, "*")), 
            key=lambda x: int(x[:-4].split("-")[-1]),
        )
        data = load_dicom(t_paths[int(len(t_paths) * slice_i)])
        plt.subplot(1, 4, i)
        plt.imshow(data, cmap="gray")
        plt.title(f"{t}", fontsize=16)
        plt.axis("off")

    plt.suptitle(f"Training Data from RSNA-MICCAI Brain Tumor Radiogenomic Classification Dataset \n MGMT_value: {mgmt_value}", fontsize=16)
    plt.show()

for i in random.sample(range(train_label_df.shape[0]), 1):
    _brats21id = train_label_df.iloc[i]["BraTS21ID"]
    _mgmt_value = train_label_df.iloc[i]["MGMT_value"]
    visualize_sample(brats21id=_brats21id, mgmt_value=_mgmt_value, slice_i=0.5)

> Code Attribution: https://www.kaggle.com/ihelon/brain-tumor-eda-with-animations-and-modeling

Here we can see the 4 MRI types clearly. There are 3 T1 MRI scans and a single T2 MRI scan. You'll notice there are small differences in the level of contrast and brightness on different areas of the photo. These differences are because of the different imaging techniques used to take the physical photos.

**Important Aspect: Differing Image Counts**

Something that we should also check for is whether or not we have a standard number of images in each MRI type. We can do this by deriving some additional features in our data frame and using those features to answer our question. We need to derive some information about the folder names, the file paths, and then the number of images in each of the patient's subfolders to figure out if there are differing counts of MRI images across our patient base.

In [None]:
# Shorten the data frame name down because we are going to use it a lot.
train = train_label_df

# Add an image folder feature to the dataframe
train["image_folder_name"] = ['{0:05d}'.format(s) for s in train["BraTS21ID"]]

# Define where the training data is located
train_file_path = "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train"

# Create a concatenated path in the dataframe for the image location of the data.
train["image_folder_path"] = [os.path.join(train_file_path, x) for x in train['image_folder_name']]

# Scan Types
mri_types = ["FLAIR", "T1w", "T1wCE", "T2w"]

# Find the number of files in each of the scan type folder
for mri_type in mri_types:
    train[mri_type + "_count"] = [len(os.listdir(os.path.join(train['image_folder_path'].iloc[s], mri_type))) for s in range(len(train))]
    
# Print the current training dataframe
train.head(5)

As you can see my suspicions are true, there are differing numbers of images between MRI types *and* there are differing numbers of images across patients in the same MRI type. Let's dive a bit deeper into how widely these vary, because we may need to adjust our methodology to account for these differences.

In [None]:
rename_dict = {"FLAIR_count": "FLAIR", "T1w_count": "T1w", "T1wCE_count": "T1wCE", "T2w_count": "T2w"}
plt.figure(figsize=(5,7))
ax = train[['FLAIR_count', 'T1w_count', 'T1wCE_count', "T2w_count"]].rename(columns=rename_dict).plot(kind="box")
ax.set_xlabel("MRI Imaging Type")
ax.set_ylabel("Image Count per Patient")
ax.set_title("Training Data from RSNA-MICCAI Brain Tumor Radiogenomic Classification Dataset \n Box Plot of Amount of Images of Varying Type in Patient MRI data")

#### Key Point 2 : MRI Imaging Planes

Another point that we must take a look at is the differing planes with which the images are taken. There are three planes as described in the literature review: Axial (top to bottom), Coronal (front to back), and Sagittal (left to right). These types are all included in arbitrary patterns among each of the different MRI scan types for the patients, and may serve as an important reason why the counts for each of the scan types are so radically different. Luckily the patient image orientations are included in the DiCOM data and we will extract that from the file below.

![MRI Axial Type Diagram](https://my-ms.org/images/mri_planes_gnu.jpg)

> Image Retrieved from : [https://my-ms.org/images/mri_planes_gnu.jpg](https://my-ms.org/images/mri_planes_gnu.jpg)

In [None]:
def findImageOrientation(image):
    # Pull the image orientation data from the DiCOM file
    (x1,y1,_,x2,y2,_) = [round(value) for value in image.ImageOrientationPatient]
    if (x1,x2,y1,y2) == (1,0,0,0):
        return "Coronal"
    elif (x1,x2,y1,y2) == (1,0,0,1):
        return "Axial"
    elif (x1,x2,y1,y2) == (0,0,1,0):
        return "Sagittal"
    else:
        return "Unknown"

# Scan Types
mri_types = ["FLAIR", "T1w", "T1wCE", "T2w"]

for mri_type in mri_types:
    train[mri_type + "_axis"] = [findImageOrientation(pydicom.read_file(os.path.join(os.path.join(train['image_folder_path'].iloc[s], mri_type), os.listdir(os.path.join(train['image_folder_path'].iloc[s], mri_type))[0]))) for s in range(len(train))]

print(train[['FLAIR_axis', 'T1w_axis', 'T1wCE_axis', 'T2w_axis']].head())

> Code Inspiration: https://www.kaggle.com/davidbroberts/determining-mr-image-planes

Now that we have the imaging axis extracted from the DiCOM, let's see the distribution of axes across the scan types. We are going to create bar charts to show this.

In [None]:
plt.figure()
rename = {"FLAIR_axis": "FLAIR", "T1w_axis": "T1w", "T1wCE_axis": "T1wCE", "T2w_axis": "T2w"}
mri_types_df = train[[mri_type + '_axis' for mri_type in mri_types]].apply(pd.Series.value_counts).rename(columns=rename)
mri_types_df.plot(kind="bar", rot=0, xlabel="MRI Scanning Axis", ylabel="Image Count", title="Training Data from RSNA-MICCAI Brain Tumor Radiogenomic Classification Dataset \n Image Count Grouped by MRI Scan Axis and Colored by MRI Type")

As you can see some of the scan types vary wildly between axes while there are a few exceptions. The most notable is T1w, which was relatively consistent with the axial axis. The two imaging types with the most variation across axis were T2w and FLAIR. What's notable about these pairs is that they have the same distribution of axes. This may be because they are correlated across patients. 

#### Key Point 3 : Usable Images

Another point that I would like to note about the data is the amount of unusable or non-important images in each of the patient-imaging-type sets. The sets typically start and end with large amounts of black space and climax with images that are denoted with a "brilliant zone." These brilliant zones show the tumor really well and will likely be the most useful images to look at because they show the subject of our analysis with the greatest fidelity of detail.

So how do we actually accomplish this kind of analysis? We can use a useful tool from photography called image histograms. A typical image histogram looks like this and shows each of the color channels of the image and the distribution of pixel luminosity across all pixels in the image. 

![Example of Image Histogram with 3 color channels](https://www.dummies.com/wp-content/uploads/332882.image1.jpg)

We can do the same using our existing data analysis tools, and a nice thing that I'm sure you noticed in the image set is that there is a single color channel making the image monochromatic and cutting the amount of dimensions in our histogram. First let's see the raw pixel data and work towards a meaningful analysis of the image.

In [None]:
example_file_path = "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000/FLAIR/Image-200.dcm"
print("Raw Pixel Array")
print(pydicom.read_file(example_file_path).pixel_array)
print("Max Pixel Value: " + str(pydicom.read_file(example_file_path).pixel_array.max()))

Here we can see the pixel values in the raw pixel array. Each sub-array signifies a row of pixels in an image, the rows are all of a standard length. The higher the pixel value the more luminosity the pixel has. Now let's unravel these arrays and plot the imaging values on a histogram.

In [None]:
def plot_image_histogram(image):
    ## Get the pixels all on to a single array
    pixels = np.array(image.pixel_array).ravel()
    
    ## Create a two sided figure
    fig, (ax_image, ax_hist) = plt.subplots(1, 2, figsize = (20,4), gridspec_kw={'width_ratios': [1, 4]})
    fig.suptitle(f'Training Data from RSNA-MICCAI Brain Tumor Radiogenomic Classification Dataset \n scan # ({image.InstanceNumber})')
    
    ## Begin working on the histogram portion of the graph
    ### Filter the zero pixels (greatly messes with the scale of the graph)
    non_zero_pixels = np.nonzero(pixels)
    ### Run basic statistics to normalize the pixel data (makes data presentable)
    mean = np.mean(non_zero_pixels)
    std = np.std(non_zero_pixels)
    ### Normalize the pixels
    norm_non_zero_pixels = (non_zero_pixels - mean)/std
    if len(non_zero_pixels[0]) != 0:
        ax_hist.hist(norm_non_zero_pixels.flatten(), bins=200)
        ax_hist.set_xlim(-5,5)
        ax_hist.set_ylim(0, 350)
    
    ax_image.imshow(image.pixel_array, cmap = plt.cm.gray)
    ax_image.grid(False)
    ax_image.axis('off')
    
    plt.show()

sample_files_list = os.listdir("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00049/FLAIR")[::40]
sample_files_list.sort(key=lambda f: int(re.sub('\D', '', f)))
for file in range(len(sample_files_list)):
    sample_files_list[file] = os.path.join("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00049/FLAIR", sample_files_list[file])
    
for file in sample_files_list:
    plot_image_histogram(pydicom.read_file(file))

Here we can see some important information about the data. 
* First, there contains whole images with no visual information whatsoever.
* Second, when there is an image with little visual information we see the variance of the visual data is higher.
* Finally, of the images with plentiful visual data, images that showed the tumor well skewed slightly right.

As you can see we are not done. We need more visual data features to make this data actionable. The question now is which data shows the tumor the best?

We can figure this out by acknowledging something that was done to the visual data to make the histogram. We normalized the visual data so it all fit on the histogram with a similar x-axis scale. When normalizing this data we are able to see common differences among the pictures. One difference that would be helpful is the number of lumiensent pixels above a common normal. So for example in scan 156, we can see a large mass above the common normal of the image around it. That is the tumor! So now we can mathematically find the images that best capture the tumor.

First let's establish a level. I'll arbitrarily put this at 2x the standard deviation of our normalized luminosity and place it on our histogram

In [None]:
file = "../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00049/FLAIR/Image-156.dcm"
def find_normal_level(mean, std):
    return mean + 1.75 * std
def plot_image_histogram_level(image):
    ## Get the pixels all on to a single array
    pixels = np.array(image.pixel_array).ravel()
    
    ## Create a two sided figure
    fig, (ax_image, ax_hist) = plt.subplots(1, 2, figsize = (20,4), gridspec_kw={'width_ratios': [1, 4]})
    fig.suptitle(f'Training Data from RSNA-MICCAI Brain Tumor Radiogenomic Classification Dataset \n scan # ({image.InstanceNumber})')
    
    ## Begin working on the histogram portion of the graph
    ### Filter the zero pixels (greatly messes with the scale of the graph)
    non_zero_pixels = np.nonzero(pixels)
    ### Run basic statistics to normalize the pixel data (makes data presentable)
    mean = np.mean(non_zero_pixels)
    std = np.std(non_zero_pixels)
    ### Normalize the pixels
    norm_non_zero_pixels = (non_zero_pixels - mean)/std
    mean = np.mean(norm_non_zero_pixels)
    std = np.std(norm_non_zero_pixels)
    ## Graph histogram
    if len(non_zero_pixels[0]) != 0:
        ax_hist.hist(norm_non_zero_pixels.flatten(), bins=200)
        ax_hist.set_xlim(-5,5)
        ax_hist.set_ylim(0, 350)
        ax_limits = ax_hist.get_ylim()
        ax_hist.vlines(find_normal_level(mean, std), ymin=ax_limits[0], ymax=ax_limits[1], colors='b', linestyles='dashed')
        
    ax_image.imshow(image.pixel_array, cmap = plt.cm.gray)
    ax_image.grid(False)
    ax_image.axis('off')
    
    plt.show()

sample_files_list = os.listdir("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00049/FLAIR")[::40]
sample_files_list.sort(key=lambda f: int(re.sub('\D', '', f)))
for file in range(len(sample_files_list)):
    sample_files_list[file] = os.path.join("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00049/FLAIR", sample_files_list[file])
    
for file in sample_files_list:
    plot_image_histogram_level(pydicom.read_file(file))

Focus your attention to the y-axis, this shows the amount of data that exceeds the normal in a better light. Because the y-axis is so much larger in the vast majority of these images we can see which image is showing a tumor based on how many pixels have a luminosity above the common mean luminosity. We can see in these graphs that this is the case.

Now let's try to **maximize** this measure and see if we get images that appropriately match our goal. 

In [None]:
def calc_brilliance(image):
    ## Get the pixels all on to a single array
    pixels = np.array(image.pixel_array).ravel()
    non_zero_pixels = np.array(image.pixel_array).ravel()[np.nonzero(np.array(image.pixel_array).ravel())]
    print(image.InstanceNumber)
    print(non_zero_pixels)
    if len(non_zero_pixels) == 0:
        return 0
    elif len(non_zero_pixels) < (0.1 * len(pixels)):
        return 0
    ### Run basic statistics to normalize the pixel data (makes data presentable)
    mean = np.mean(non_zero_pixels)
    std = np.std(non_zero_pixels)
    ### Normalize the pixels
    norm_non_zero_pixels = (non_zero_pixels - mean)/std
#     mean = np.mean(norm_non_zero_pixels)
#     std = np.std(norm_non_zero_pixels)
    return np.count_nonzero(image.pixel_array > find_normal_level(mean,std))

def top_brilliant_image(images):
    brilliancelist = [calc_brilliance(image) for image in images]
    pd.set_option("display.max_rows", None)
    print(pd.DataFrame(brilliancelist))
    print(np.argsort(brilliancelist)[::-1][:10])
    top_image = np.argsort(brilliancelist)[::-1][:10]
    return top_image

sample_files = []

for root, dirs, files in os.walk(os.path.abspath("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00049/FLAIR")):
    for file in files:
        sample_files.append(os.path.join(root, file))

top_brilliant_ids = top_brilliant_image([pydicom.read_file(img) for img in sample_files])

for top_brilliant_id in top_brilliant_ids:
    path_to_top_brilliant_image = os.path.join("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00049/FLAIR", "Image-" + str(top_brilliant_id) + ".dcm")
    plot_image_histogram_level(pydicom.read_file(path_to_top_brilliant_image))

Clearly this needs more refining, but the principal more or less holds. In the top 2 of images with the highest cumulitave normalized luminosity above 2 standard deviations of said luminosity, we were able to find images with tumors over 50% of the time. When considering that there is 296 images in this folder alone. This is a great start for cleaning the data for our future model. A future refined version of this heuristic model may help cull the amount of images we have to model. We can simply take out images of less or no importance, which will help make the model more efficient and effective.