# Embolism Detection EDA

This is an EDA (exploratory data analysis) for the newly launched RNSA **Embolism Detection** competition on Kaggle that we're going to be working on today. An **embolism** is caused when your arteries are blocked off in your lung, preventing blood flow and stopping your lung from getting the oxygen it needs to carry out respiration. It is the most fatal cardiovascular disease in the United States of America (60,000 to 100,000 deaths per annum).

In [None]:
import numpy as np, pandas as pd, pydicom as dcm
import keras
import tensorflow as tf
import matplotlib.pyplot as plt, seaborn as sns
import os, glob
tr = pd.read_csv("../input/rsna-str-pulmonary-embolism-detection/train.csv")
te = pd.read_csv("../input/rsna-str-pulmonary-embolism-detection/test.csv")
TRAIN_PATH = "../input/rsna-str-pulmonary-embolism-detection/train/"
files = glob.glob('../input/rsna-str-pulmonary-embolism-detection/train/*/*/*.dcm')
def dicom_to_image(filename):
    im = dcm.dcmread(filename)
    img = im.pixel_array
    img[img == -2000] = 0
    return img

# Starting the EDA

First of all, let's do a small preliminary check to our data. Take a look at how many scans we have per folder (how many CT slices we have per patient, in essence). Kaggle says this dataset is almost 1 TB, so training would be quite the conundrum on this full dataset.

In [None]:
print('Total patients {}'.format(len(os.listdir(TRAIN_PATH))))

So it seems like we have an abundance of patients and an (over?)abundance of training data that we can utilize to feed our EfficientNets. Let's take a look at one of our DICOM files, just to take a brief look.

In [None]:
plt.axis('off')
plt.imshow(-dcm.dcmread("../input/rsna-str-pulmonary-embolism-detection/train/0003b3d648eb/d2b2960c2bbf/00ac73cfc372.dcm").pixel_array);

Note that the colors in the image above have been inverted by me for maximum visibility. This data format seems close to the images in the Data Science Bowl 2017 and the recent OSIC competition, and not too close to other lung-based competitions (the SIIM-ACR Pneumothorax Competition used larger images and was a segmentation task).

# Meta-exploration

In [None]:
tr.head()

Now this competition is a little (or a lot depending on your point of view) odd because of the submission scoring and the weighted log loss. In essence, our submission.csv file should have a number of rows equal to $n_{img} + (n_{studies} * n_{labels})$ where our $n$ basically serves as an indicator of "number of" and our metric is weighted on the basis of there being more important labels than others. They are giving the most importance with their weighted metric to "Central PE". Now what is this central PE you might ask?


These are the principal data fields given by Kaggle on the competition's data page:
```
+ StudyInstanceUID - unique ID for each study (exam) in the data.
+ SeriesInstanceUID - unique ID for each series within the study.
+ SOPInstanceUID - unique ID for each image within the study (and data).
+ pe_present_on_image - image-level, notes whether any form of PE is present on the image.
+ negative_exam_for_pe - exam-level, whether there are any images in the study that have PE present.
+ qa_motion - informational, indicates whether radiologists noted an issue with motion in the study.
+ qa_contrast - informational, indicates whether radiologists noted an issue with contrast in the study.
+ flow_artifact - informational
+ rv_lv_ratio_gte_1 - exam-level, indicates whether the RV/LV ratio present in the study is >= 1
+ rv_lv_ratio_lt_1 - exam-level, indicates whether the RV/LV ratio present in the study is < 1
+ leftsided_pe - exam-level, indicates that there is PE present on the left side of the images in the study
+ chronic_pe - exam-level, indicates that the PE in the study is chronic
+ true_filling_defect_not_pe - informational, indicates a defect that is NOT PE
+ rightsided_pe - exam-level, indicates that there is PE present on the right side of the images in the study
+ acute_and_chronic_pe - exam-level, indicates that the PE present in the study is both acute AND chronic
+ central_pe - exam-level, indicates that there is PE present in the center of the images in the study
+ indeterminate -exam-level, indicates that while the study is not negative for PE, an ultimate set of exam-level labels could not be created, due to QA issues
```

Now, with the following data, it's possible to see that there are defects in a lung that are not a pulmonary embolism and are perhaps the effects of another pulmonary disease.

# Image-based analysis

Time to get what all those who came here were looking for - the images.

In [None]:
f, plots = plt.subplots(4, 5, sharex='col', sharey='row', figsize=(10, 8))
for i in range(20):
    plots[i // 5, i % 5].axis('off')
    plots[i // 5, i % 5].imshow(dicom_to_image(np.random.choice(files[:1000])), cmap=plt.cm.bone) # last 1k images

The images all look roughly similar to the Data Science Bowl and to the OSIC competition's images, which is helpful as we can apply existing preprocessing techniques(see "Full Preprocessing Tutorial" and "Pulmonary Fibrosis Preprocessing".)