## What is pulmonary embolism
Pulmonary embolism occurs when a clump of material, most often a blood clot, gets wedged into an artery in your lungs. These blood clots most commonly come from the deep veins of your legs, a condition known as deep vein thrombosis (DVT).

In [None]:
from IPython.display import HTML
HTML('<center><iframe width="560" height="315" src="https://www.youtube.com/embed/8UnPPZlnfbk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>')

## What is RSNA STR Pulmonary Embolism Detection Competition?
- In this competition you have to classify Pulmonary Embolism cases in chest CT scans.
- This competition is inference-only, meaning that your submitted kernels will not have access to the training set.
- Also note that the private test set is approximately 3x larger than the public test set (230GB vs. 70GB), so ensure that your kernels have enough time to finish their re-run. The training set includes 7279 studies, the public set 650, and the private set has 1517.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pydicom as dicom
import matplotlib.pyplot as plt
from os import listdir,mkdir
import plotly.express as px
import seaborn as sns
import os

In [None]:
basepath = "../input/rsna-str-pulmonary-embolism-detection/"
listdir(basepath)

## Read train and test data

In [None]:
train = pd.read_csv("/kaggle/input/rsna-str-pulmonary-embolism-detection/train.csv")
test = pd.read_csv("/kaggle/input/rsna-str-pulmonary-embolism-detection/test.csv")
sub = pd.read_csv("/kaggle/input/rsna-str-pulmonary-embolism-detection/sample_submission.csv")

In [None]:
print("Training Data Size")
train.shape

In [None]:
print("Test Data Size")
test.shape

In [None]:
train.head(10)

In [None]:
train.tail(10)

In [None]:
test.head(10)

In [None]:
test.tail(10)

In [None]:
sub.head(10)

In [None]:
sub.tail(10)

In [None]:
train.info()

In [None]:
train.describe()

### Lets understand the training dataset

- StudyInstanceUID: unique ID for each study (exam) in the data.
- SeriesInstanceUID: unique ID for each series within the study.
- SOPInstanceUID: unique ID for each image within the study (and data).
- pe_present_on_image: image-level, notes whether any form of PE(Pulmonary Embolism) is present on the image. Possible value:0 or 1. Zero means there is no sign of PE in the image and one stands for PE presents in the image.
- negative_exam_for_pe: exam-level, whether there are any images in the study that have PE present. Possible value: 0 or 1. Zero means there is no image with PE present in the exam and one stands for there is at least one or more images in the study with PE.
- qa_motion: informational, indicates whether radiologists noted an issue with motion in the study. This field is informational that means no prediction required for this fields. 
- qa_contrast: informational, indicates whether radiologists noted an issue with contrast in the study.
- flow_artifact: informational

About RV/LV : Assessment of right ventricular strain using computed tomography (CT) in patients with pulmonary embolism (PE) has often relied on the ratio between the diameter of the right and left ventricles (RV/LV) in axial slices. The RV/LV ratio, measured in this manner, may be an unreliable marker of strain, due in part to inconsistencies in where measurements are taken and the complex three-dimensional nature of the RV. 

- rv_lv_ratio_gte_1: exam-level, indicates whether the RV/LV ratio present in the study is >= 1. Possible value: 0 or 1. One stands for RV/LV >=1 otherwise zero.
- rv_lv_ratio_lt_1: exam-level, indicates whether the RV/LV ratio present in the study is < 1. Possible value: 0 or 1. One stands for RV/LV < 1 otherwise zero.
- leftsided_pe: exam-level, indicates that there is PE present on the left side of the images in the study. Possible value 0 or 1. One stands for PE present on the left side of the images, zero otherwise.
- chronic_pe: exam-level, indicates that the PE in the study is chronic. Possible value 0 or 1. One stands for PE in the study is chronic_pe, zero otherwise.
- true_filling_defect_not_pe: informational, indicates a defect that is NOT PE. Possible value 0 or 1. One stands for there is defect but that is not a PE.
- rightsided_pe: exam-level, indicates that there is PE present on the right side of the images in the study. Possible value 0 or 1. One stands for there is PE present on the right side, zero otherwise.
- acute_and_chronic_pe: exam-level, indicates that the PE present in the study is both acute AND chronic. Possible value 0 or 1. One stands for PE present in the study is both acute and chronic otherwise zero.
- central_pe: exam-level, indicates that there is PE present in the center of the images in the study. Possible value 0 or 1. One stands for PE present in the center of the images in the study and zero otherwise.
- indeterminate: exam-level, indicates that while the study is not negative for PE, an ultimate set of exam-level labels could not be created, due to QA issues. Possible value 0 or 1. One stands for there is some issues(Motion or Contrast issue) in the CT images for which radiologist can't determine PE present or not.

In [None]:
test.info()

### Lets understand the test dataset
In the test dataset, the unique ids for each study ,series and individual images are given and you have to predict the probability for the bilow labels value:
- Negative for PE
- Indeterminate
- Chronic
- Acute & Chronic
- Central PE
- Left PE	
- Right PE	
- RV/LV Ratio >= 1
- RV/LV Ratio < 1


In [None]:
sub.info()

In [None]:
print('Check missing value in train data')
train.isnull().sum()

In [None]:
print('Chack missing value in test data')
test.isnull().sum()

**So there is no missing value in train and test data set**

**Check Non Zero value for each column in training set**

In [None]:
x = train.drop(['StudyInstanceUID', 'SeriesInstanceUID', 'SOPInstanceUID'], axis=1).sum(axis=0).sort_values().reset_index()
x.columns = ['column', 'nonzero_records']

fig = px.bar(
    x, 
    x='nonzero_records', 
    y='column', 
    orientation='h', 
    title='Columns and non zero samples', 
    height=800, 
    width=800
)

fig.show()


## Dicom processsing and example 

In [None]:
def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices


In [None]:
def set_lungwin(img, hu=[-1200., 600.]):
    lungwin = np.array(hu)
    newimg = (img-lungwin[0]) / (lungwin[1]-lungwin[0])
    newimg[newimg < 0] = 0
    newimg[newimg > 1] = 1
    newimg = (newimg * 255).astype('uint8')
    return newimg


In [None]:
def get_pixels_hu(slices):
    image = np.stack([s.pixel_array for s in slices])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)

    # Set outside-of-scan pixels to 0
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    
    # Convert to Hounsfield units (HU)
    for slice_number in range(len(slices)):
        
        intercept = slices[slice_number].RescaleIntercept
        slope = slices[slice_number].RescaleSlope
        
        if slope != 1:
            image[slice_number] = slope * image[slice_number].astype(np.float64)
            image[slice_number] = image[slice_number].astype(np.int16)
            
        image[slice_number] += np.int16(intercept)
    
    return np.array(image, dtype=np.int16)

In [None]:
scans = load_scan('../input/rsna-str-pulmonary-embolism-detection/train/0003b3d648eb/d2b2960c2bbf/')
scan_array = set_lungwin(get_pixels_hu(scans))

In [None]:
import matplotlib.animation as animation

fig = plt.figure()

ims = []
for image in scan_array:
    im = plt.imshow(image, animated=True, cmap="Greys")
    plt.axis("off")
    ims.append([im])

ani = animation.ArtistAnimation(fig, ims, interval=100, blit=False,
                                repeat_delay=1000)
HTML(ani.to_jshtml())


In [None]:
def transform_to_hu(slices):
    images = np.stack([file.pixel_array for file in slices])
    images = images.astype(np.int16)

    # convert ouside pixel-values to air:
    # I'm using <= -1000 to be sure that other defaults are captured as well
    images[images <= -1000] = 0
    
    # convert to HU
    for n in range(len(slices)):
        
        intercept = slices[n].RescaleIntercept
        slope = slices[n].RescaleSlope
        
        if slope != 1:
            images[n] = slope * images[n].astype(np.float64)
            images[n] = images[n].astype(np.int16)
            
        images[n] += np.int16(intercept)
    
    return np.array(images, dtype=np.int16)

def load_slice(path):
    slices = [dicom.read_file(path + '/' + s) for s in listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices


In [None]:
first_patient = load_slice('../input/rsna-str-pulmonary-embolism-detection/train/0003b3d648eb/d2b2960c2bbf')
first_patient_pixels = transform_to_hu(first_patient)

def sample_stack(stack, rows=6, cols=6, start_with=0, show_every=5):
    fig,ax = plt.subplots(rows,cols,figsize=[18,20])
    for i in range(rows*cols):
        ind = start_with + i*show_every
        ax[int(i/rows),int(i % rows)].set_title(f'slice {ind}')
        ax[int(i/rows),int(i % rows)].imshow(stack[ind],cmap='bone')
        ax[int(i/rows),int(i % rows)].axis('off')
    plt.show()

sample_stack(first_patient_pixels)

### Light EDA

In [None]:
cols = train.copy()
cols.drop(['StudyInstanceUID','SeriesInstanceUID','SOPInstanceUID'],axis=1,inplace=True)


In [None]:
corr = cols.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(12, 12))
    ax = sns.heatmap(corr,mask=mask,square=True,linewidths=.8,cmap="summer",annot=True)


## Acknowledgement

- [Pulmonary Embolism Dicom preprocessing & EDA](https://www.kaggle.com/nitindatta/pulmonary-embolism-dicom-preprocessing-eda)
- [Pulmonary Embolism Detection EDA](https://www.kaggle.com/isaienkov/pulmonary-embolism-detection-eda)
- [Pulmonary embolism: The route to recovery](https://www.youtube.com/watch?v=8UnPPZlnfbk&ab_channel=naturevideo)
