# * Pulmonary Embolism Detection - EDA [Beginner-Friendly] *
![LEGO](https://a360-rtmagazine.s3.amazonaws.com/wp-content/uploads/2019/10/lung-pulmo-embolism-1500-1200x799.jpg)

**If you like this kernel, an up-vote would be appreciated**

# Introduction

This is an EDA (exploratory data analysis) for the newly launched RNSA Embolism Detection competition on Kaggle that we're going to be working on today. An embolism is caused when your arteries are blocked off in your lung, preventing blood flow and stopping your lung from getting the oxygen it needs to carry out respiration. It is the most fatal cardiovascular disease in the United States of America (60,000 to 100,000 deaths per annum).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import glob
import pydicom as dcm

# 1- Loading the dataset

In [None]:
train = pd.read_csv("../input/rsna-str-pulmonary-embolism-detection/train.csv")
train.head()

# 2- Exploring columns

In [None]:
train.dtypes

| Field Name | Feature Type | Description |
| :--- | :--- | :---- |
| **StudyInstanceUID** | UID | unique ID for each study (exam) in the data. |
| **SeriesInstanceUID** | UID | unique ID for each series within the study. |
| **SOPInstanceUID** | UID |  unique ID for each image within the study (and data) |
| **pe_present_on_image**|  image-level | notes whether any form of PE is present on the image.|
| **negative_exam_for_pe**|  exam-level | whether there are any images in the study that have PE present.|
| **qa_motion** |  informational | indicates whether radiologists noted an issue with motion in the study. | 
|**qa_contrast** |  informational | indicates whether radiologists noted an issue with contrast in the study.|
| **flow_artifact** | informational | ---|
| **rv_lv_ratio_gte_1** | exam-level| indicates whether the RV/LV ratio present in the study is >= 1|
| **rv_lv_ratio_lt_1** | exam-level| indicates whether the RV/LV ratio present in the study is < 1|
| **leftsided_pe** | exam-level | indicates that there is PE present on the left side of the images in the study| 
| **chronic_pe**  | exam-level | indicates that the PE in the study is chronic|
| **true_filling_defect_not_pe** | informational | indicates a defect that is NOT PE|
| **rightsided_pe** | exam-level | indicates that there is PE present on the right side of the images in the study|
| **acute_and_chronic_pe** | exam-level| indicates that the PE present in the study is both acute AND chronic|
| **central_pe** | exam-level| indicates that there is PE present in the center of the images in the study|
| **indeterminate**  | exam-level| indicates that while the study is not negative for PE, an ultimate set of exam-level labels could not be created, due to QA issues|


**So, What are we going to predict?**

> Every study / exam has a row for each label that is scored (detailed in the Data page). It is uniquely indicated by the StudyInstanceUID. Every image, further, has a row for the PE Present on Image label and is uniquely indicated by the SOPInstanceUID. Your prediction file should have a number of rows equal to: (number of images) + (number of studies * number of scored labels).

In other words:
- For each image, we're going to predict the column "pe_present_on_image"
- For each "StudyInstanceUID" we're goin to predict:

1. negative_exam_for_pe
2. rv_lv_ratio_gte_1
3. rv_lv_ratio_lt_1
4. chronic_pe
5. true_filling_defect_not_pe
6. acute_and_chronic_pe
7. rightsided_pe
8. leftsided_pe
9. central_pe

# 3- Data visualization

In [None]:
train_cols = ['pe_present_on_image', 'negative_exam_for_pe', 'qa_motion',
       'qa_contrast', 'flow_artifact', 'rv_lv_ratio_gte_1', 'rv_lv_ratio_lt_1',
       'leftsided_pe', 'chronic_pe', 'true_filling_defect_not_pe',
       'rightsided_pe', 'acute_and_chronic_pe', 'central_pe', 'indeterminate']

def plot_grid(cols = train_cols):
    fig=plt.figure(figsize=(12, 12))
    columns = 3
    rows = 5
    for i in range(1, columns*rows):
        col = cols[i-1]
        fig.add_subplot(rows, columns, i)
        train[col].value_counts().plot(kind = "bar")
        indices = train[col].value_counts().index.tolist()
        count_0 = train[col].value_counts()[0]
        count_1 = train[col].value_counts()[1]
        plt.xlabel(f"{col}\n {indices[0]}: {count_0}\n {indices[1]}: {count_1}")
    plt.tight_layout()
    plt.show()

plot_grid()

We can note the following observations:

- By looking at the images, only 5.4% of patients have active cases of PE
- But exam shows that 32.4% of the images involve active cases of PE, which means that image-based prediction gives many false-negatives
- There are almost no issues related to motion (0.8%) / contrast in the studies (1.6%)
- RV/LV values are not consistent
- Most defects are right-sided or left-sided. There are only few cases of central cases of PE

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F115173%2Fa2a5ee66b5799274141dd547cc3ea466%2FPE%20figure.jpg?generation=1599575183749576&alt=media)

Credits for this schema goes to @redwankarimsony from his amazing notebook: [RSNA-STR Pulmonary Embolism](https://www.kaggle.com/redwankarimsony/rsna-str-pulmonary-embolism-eda)

**Correlation matrix between the trainable columns**

In [None]:
corr_mat = train[train_cols].corr()
mask = np.triu(np.ones_like(corr_mat, dtype=bool))
f, ax = plt.subplots(figsize=(14, 12))
sns.heatmap(corr_mat, mask = mask, annot = True, vmax = 0.3, square = False, linewidths = 0.5, center = 0)

# 4- Image visualization and interpretation

In [None]:
def read_dicom(file_path, show = False, cmap = 'gray'):
    im = dcm.dcmread(file_path)
    image = im.pixel_array
    if show:
        plt.imshow(image, cmap = 'gray')
    return image

In [None]:
def show_images_with_specific_condition(column_name):
    
    rows = 2
    cols = 5
    
    train_with_condition = train[train[column_name] == 0]
    train_with_condition = train_with_condition.sample(n = rows*cols) 
    train_image_file_paths = []
    for _, entry in train_with_condition.iterrows():
        train_image_file_paths.append('../input/rsna-str-pulmonary-embolism-detection/train/'+str(entry['StudyInstanceUID'])+'/'+str(entry['SeriesInstanceUID'])+'/'+str(entry['SOPInstanceUID'])+'.dcm')
    counter  = 0
    fig = plt.figure(figsize=(25,15))
    fig.suptitle('Samples with ' + column_name + ' = 0', fontsize=40)
    for path in train_image_file_paths:
        fig.add_subplot(rows, cols, counter+1)
        plt.imshow(read_dicom(path), cmap='gray')
        plt.axis(False)
        fig.add_subplot
        counter += 1
    
    
    train_with_condition = train[train[column_name] == 1]
    train_with_condition = train_with_condition.sample(n = rows*cols) 
    train_image_file_paths = []
    for _, entry in train_with_condition.iterrows():
        train_image_file_paths.append('../input/rsna-str-pulmonary-embolism-detection/train/'+str(entry['StudyInstanceUID'])+'/'+str(entry['SeriesInstanceUID'])+'/'+str(entry['SOPInstanceUID'])+'.dcm')
    counter  = 0
    fig = plt.figure(figsize=(25,15))
    fig.suptitle('Samples with ' + column_name + ' = 1', fontsize=40)
    for path in train_image_file_paths:
        fig.add_subplot(rows, cols, counter+1)
        plt.imshow(read_dicom(path), cmap='gray')
        plt.axis(False)
        fig.add_subplot
        counter += 1

**4-1. Images with qa_motion**

In [None]:
show_images_with_specific_condition('qa_motion')

**4-2. Images with qa_contrast**

In [None]:
show_images_with_specific_condition('qa_contrast')

**4-3. Images with indeterminate**

In [None]:
show_images_with_specific_condition('indeterminate')

**4-4. Images with rightsided_pe**

In [None]:
show_images_with_specific_condition('rightsided_pe')

**4-5. Images with leftsided_pe**

In [None]:
show_images_with_specific_condition('leftsided_pe')

**4-6. Images with central_pe**

In [None]:
show_images_with_specific_condition('central_pe')

# **[KERNEL UNDER CONSTRUCTION]**