# Hey There, Welcome!

Please familiarize yourself with the table of contents for easy navigation. <br>

### Table of contents
* [Understanding Pulmonary Emolism](#upe)
* [About The Competition](#atc)
* [Data](#data)
* [EDA](#eda)
* [DICOM Data Handling](#dicom)
* [Image Data Handling](#image)
* [TODO And Future Improvements](#todo)

# Understanding Pulmonary Embolism <a class="anchor" id="upe"></a>

**Pulmonary embolism (PE)** is a blockage of an artery in the lungs by a substance that has moved from elsewhere in the body through the bloodstream (embolism). It is a condition in which one or more arteries in the lungs become blocked by a blood clot. Most times, a pulmonary embolism is caused by blood clots that travel from the legs, or rarely, other parts of the body (deep vein thrombosis or DVT).

<center> 
    <table><tr><td><img src="https://i.imgur.com/wmFMDEe.png"></td><td><img src="https://i.imgur.com/TvZLM2v.png"></td></tr></table>
    </center>
<br>
<br>

***SYMPTOPMS:*** <br>
**Pain circumstances:** can occur in the chest while breathing <br>
**Whole body:** light-headedness or low oxygen in the body <br>
**Heart:** fast heart rate or palpitations <br>
**Respiratory:** fast breathing or shortness of breath <br>
**Also common:** dry cough <br>

***TREATMENT:*** <br>
*Treatment consists of blood thinners:* Prompt treatment to break up the clot greatly reduces the risk of death. This can be done with blood thinners and drugs or procedures. Compression stockings and physical activity can help prevent clots from forming in the first place.

<center> 
    <table><tr><td><img src="https://i.imgur.com/E3UR0Mt.png" ></td><td><img src="https://i.imgur.com/aV5WMyj.png"></td></tr></table>
    </center>
<br>
<br>

***How It Occurs:*** <br>
Pulmonary embolism is caused by a blocked artery in the lungs. The most common cause of such a blockage is a blood clot that forms in a deep vein in the leg and travels to the lungs, where it gets lodged in a smaller lung artery. Almost all blood clots that cause pulmonary embolism are formed in the deep leg veins. <br>

***How It Kills:*** <br>
Without treatment, VTE can restrict or block blood flow and oxygen, which can damage the body's tissue or organs. This can be especially serious in the case of a pulmonary embolism, which blocks blood flow to the lungs. If a blood clot is large or there are many clots, a pulmonary embolism can cause death. <br>

**Mortality Rate:** 1 in 10 Patients <br>
**Recurrence Rate:** 1 in 5 Recoveries <br>

Watch this short video to get a better idea about this disease:

In [None]:
from IPython.display import HTML

HTML('<center><iframe width="560" height="315" src="https://www.youtube.com/embed/1IBvrOBQ268" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>')

## About The Competition <a class="anchor" id="atc"></a>
Currently, CT pulmonary angiography (CTPA), is the most common type of medical imaging to evaluate patients with suspected PE. These CT scans consist of hundreds of images that require detailed review to identify clots within the pulmonary arteries. As the use of imaging continues to grow, constraints of radiologists’ time may contribute to delayed diagnosis. <br>
<br>
The Radiological Society of North America (RSNA®) has teamed up with the Society of Thoracic Radiology (STR) to help improve the use of machine learning in the diagnosis of PE. <br>
<br>
In this competition, you’ll detect and classify PE cases. In particular, you'll use chest CTPA images (grouped together as studies) and your data science skills to enable more accurate identification of PE. If successful, you'll help reduce human delays and errors in detection and treatment. <br>
<br>
With 60,000-100,000 PE deaths annually in the United States, it is among the most fatal cardiovascular diseases. Timely and accurate diagnosis will help these patients receive better care and may also improve outcomes.

In [None]:
!conda install -c conda-forge gdcm -y
!pip install pandas-profiling -y

In [None]:
# Setup
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib import animation, rc
import seaborn as sns

sns.set_style('darkgrid')
import pydicom as dcm
import scipy.ndimage
import gdcm
import glob
import imageio
from IPython import display

from skimage import measure 
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
from skimage.morphology import disk, opening, closing
from tqdm import tqdm

from IPython.display import HTML
from PIL import Image

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from os import listdir, mkdir

path = "../input/rsna-str-pulmonary-embolism-detection/"
files = glob.glob(path+'/train/*/*/*.dcm')

## Let's See How Our Data Looks <a class="anchor" id="data"></a>

* *StudyInstanceUID* - unique ID for each study (exam) in the data. <br>
* *SeriesInstanceUID* - unique ID for each series within the study. <br>
* *SOPInstanceUID* - unique ID for each image within the study (and data). <br>
* *pe_present_on_image* - image-level, notes whether any form of PE is present on the image. <br>
* *negative_exam_for_pe* - exam-level, whether there are any images in the study that have PE present. <br>
* *qa_motion* - informational, indicates whether radiologists noted an issue with motion in the study. <br>
* *qa_contrast* - informational, indicates whether radiologists noted an issue with contrast in the study. <br>
* *flow_artifact* - informational <br>
* *rv_lv_ratio_gte_1* - exam-level, indicates whether the RV/LV ratio present in the study is >= 1 <br>
* *rv_lv_ratio_lt_1*- exam-level, indicates whether the RV/LV ratio present in the study is < 1 <br>
* *leftsided_pe* - exam-level, indicates that there is PE present on the left side of the images in the study <br>
* *chronic_pe* - exam-level, indicates that the PE in the study is chronic <br>
* *true_filling_defect_not_pe* - informational, indicates a defect that is NOT PE <br>
* *rightsided_pe* - exam-level, indicates that there is PE present on the right side of the images in the study <br>
* *acute_and_chronic_pe* - exam-level, indicates that the PE present in the study is both acute AND chronic <br>
* *central_pe* - exam-level, indicates that there is PE present in the center of the images in the study <br>
* *indeterminate* -exam-level, indicates that while the study is not negative for PE, an ultimate set of exam-level labels could not be created, due to QA issues <br>


In [None]:
# Reading Data
train = pd.read_csv(path + "train.csv")
test = pd.read_csv(path + "test.csv")
print("Train Data Shape:",train.shape)
print("Test Data Shape:",test.shape)

In [None]:
train.head(5).T

In [None]:
train.info()

In [None]:
test.info()

In [None]:
train.describe()

In [None]:
test.describe()

In [None]:
print('Missing values in train data:',train.isnull().sum().sum())
print('Missing values in test data:',test.isnull().sum().sum())

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(train, title='Training Data Report')

## EDA <a class="anchor" id="eda"></a>

In [None]:
profile.to_notebook_iframe()

In [None]:
cols = train.copy()
cols.drop(['StudyInstanceUID','SeriesInstanceUID','SOPInstanceUID'],axis=1,inplace=True)
columns = cols.columns

corr = cols.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(12, 12))
    ax = sns.heatmap(corr,mask=mask,square=True,linewidths=.8,cmap="gnuplot",annot=True)

In [None]:
fig, ax = plt.subplots(7,2,figsize=(16,28))
for i,col in enumerate(columns): 
    plt.subplot(7,2,i+1)
    sns.countplot(cols[col],palette="gnuplot")   

## What is a DICOM? <a class="anchor" id="dicom"></a>

**D**igital **I**maging and **Co**mmunications in **M**edicine (DICOM) - an international standard related to the exchange, storage and communication of digital medical images. Prior to this format, there was no standardized way to transfer medical scans. So loading up a single patient's study outside the hospital, in older formats took about 10-30 minutes for a single scan! 

While DICOM 16-bit images (with values ranging from -32768..32767), other 8-bit greyscale images store values 0 - 255. These value ranges in DICOM are useful, as they correlate with the [Hounsfield Scale](https://en.wikipedia.org/wiki/Hounsfield_scale). Each voxel can store a large amount of information.

Read more about it here: https://www.dicomstandard.org/

![ct](https://media.giphy.com/media/2kWaWmhoFfFkI/giphy.gif)

Now, we will see dicom data of a random sample from our data.

In [None]:
from random import randint
# This function reads dicom images from the given path
def load_dicom(path):
    files = listdir(path)
    f = [dcm.dcmread(path + "/" + str(file)) for file in files]
    return f

random_integer = randint(0,len(train))
example = path + "train/" + train.StudyInstanceUID.values[random_integer] +'/'+ train.SeriesInstanceUID.values[random_integer]
scans = load_dicom(example)
scans[randint(0,len(scans))]

## Looking at the Image Data <a class="anchor" id="image"></a>

In [None]:
f, plots = plt.subplots(3, 3, sharex='col', sharey='row', figsize=(15, 15))
for i in range(9):
    plots[i // 3, i % 3].axis('off')
    plots[i // 3, i % 3].imshow(dcm.dcmread(np.random.choice(files[:3000])).pixel_array,cmap='gist_earth_r')

### A closer look...

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(dcm.dcmread(np.random.choice(files[:3000])).pixel_array, cmap="gist_earth_r")

## For a Single Person...

credit: https://www.kaggle.com/nitindatta/pulmonary-embolism-dicom-preprocessing-eda

In [None]:
def load_slice(path):
    slices = [dcm.read_file(path + '/' + s) for s in listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

def transform_to_hu(slices):
    images = np.stack([file.pixel_array for file in slices])
    images = images.astype(np.int16)

    # convert ouside pixel-values to air:
    # I'm using <= -1000 to be sure that other defaults are captured as well
    images[images <= -1000] = 0
    
    # convert to HU
    for n in range(len(slices)):
        
        intercept = slices[n].RescaleIntercept
        slope = slices[n].RescaleSlope
        
        if slope != 1:
            images[n] = slope * images[n].astype(np.float64)
            images[n] = images[n].astype(np.int16)
            
        images[n] += np.int16(intercept)
    
    return np.array(images, dtype=np.int16)

In [None]:
first_patient = load_slice(path+'/train/0003b3d648eb/d2b2960c2bbf')
first_patient_pixels = transform_to_hu(first_patient)

fig, plots = plt.subplots(8, 10, sharex='col', sharey='row', figsize=(20, 16))
for i in range(80):
    plots[i // 10, i % 10].axis('off')
    plots[i // 10, i % 10].imshow(first_patient_pixels[i], cmap="gray") 

In [None]:
imageio.mimsave("/tmp/gif.gif", first_patient_pixels, duration=0.08)
display.Image(filename="/tmp/gif.gif", format='png')

## What is CT(Computed Tomography) and How it Works?
<center>
    <img src = "https://i.imgur.com/0wZOIKt.jpg" width = "80%">
</center>
<br>
**Computed tomography (CT)** scanning, also known as, especially in the older literature and textbooks, computerized axial tomography (CAT) scanning, is a diagnostic imaging procedure that uses x-rays to build cross-sectional images ("slices") of the body. Cross-sections are reconstructed from measurements of attenuation coefficients of x-ray beams in the volume of the object studied.

CT is based on the fundamental principle that the density of the tissue passed by the x-ray beam can be measured from the calculation of the attenuation coefficient. Using this principle, CT allows the reconstruction of the density of the body, by two-dimensional section perpendicular to the axis of the acquisition system.

The CT x-ray tube (typically with energy levels between 20 and 150 keV), emits N photons (monochromatic) per unit of time. The emitted x-rays form a beam which passes through the layer of biological material of thickness Δx. A detector placed at the exit of the sample, measures N + ΔN photons, ΔN smaller than 0. Attenuation values of the x-ray beam are recorded and data used to build a 3D representation of the scanned object/tissue.

There are basically two processes of the absorption: **the photoelectric effect and the Compton effect**. This phenomenon is represented by a single coefficient, mju.

In the particular case of the CT, the emitter of x-rays rotates around the patient and the detector, placed in diametrically opposite side, picks up the image of a body section (beam and detector move in synchrony).

Unlike x-ray radiography, the detectors of the CT scanner do not produce an image. They measure the transmission of a thin beam (1-10 mm) of x-rays through a full scan of the body. The image of that section is taken from different angles, and this allows to retrieve the information on the depth (in the third dimension).

In order to obtain tomographic images of the patient from the data in "raw" scan, the computer uses complex mathematical algorithms for image reconstruction.

If the x-ray at the exit of the tube is made monochromatic or quasi-monochromatic with the proper filter, one can calculate the attenuation coefficient corresponding to the volume of irradiated tissue by the application of the general formula of absorption of the x-rays in the field (see Figure 1).

The outgoing intensity I(x) of the beam of photons measured will depend on the location. In fact, I(x) is smaller where the body is more radiopaque.

**Hounsfield** chose a scale that affects the four basic densities, with the following values:

    air = -1000
    fat = -60 to -120
    water = 0
    compact bone = +1000

The image of the section of the object irradiated by the x-ray is reconstructed from a large number of measurements of attenuation coefficient. It gathers together all the data coming from the elementary volumes of material through the detectors. Using the computer, it presents the elementary surfaces of the reconstructed image from a projection of the data matrix reconstruction, the tone depending on the attenuation coefficients.

The image by the CT scanner is a digital image and consists of a square matrix of elements (pixel), each of which represents a voxel (volume element) of the tissue of the patient.

In conclusion, a measurement made by a detector CT is proportional to the sum of the attenuation coefficients.

The typical CT image is composed of **512 rows, each of 512 pixels**, i.e., a square matrix of **512 x 512 = 262,144 pixels** (one for each voxel). In the process of the image, the value of the attenuated coefficient for each voxel corresponding to these pixel needs to be calculated.

Each image point is surrounded by a halo-shaped star that degrades the contrast and blurs the boundary of the object. To avoid this, the method of filtered back projection is used. The action of the filter function is such that the negative value created is the filtered projection, when projected backwards, is removed, and an image is produced, which is the accurate representation of the original object.

The CT scan deals with the attenuation of the x-rays during the passage through the body segment. However, several features distinguish it from conventional radiology: the image is reconstructed from a large number of measurements of attenuation coefficient.

Before the data are presented on the screen, the conventional rescaling was made into CT numbers, expressed in Hounsfield Units (HU), as mentioned before. CT numbers based on measurements with the EMI scanner invented by Sir Godfrey Hounsfield 6, a Nobel prize winner for his work in 1979, related the linear attenuation coefficient of a localized region with the attenuation coefficient of water, the multiplication factor of 1000 is used for CT number integers. 

So, the signal transmitted by the detector is processed by the PC in the form of the digital information, the CT image reconstruction.


In [None]:
HTML('<center><iframe width="560" height="315" src="https://www.youtube.com/embed/SdYUniRMtz4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>')

# ⚠️Work In Progress⚠️ <a class="anchor" id="todo"></a>

## TODO:
1. <del> What are CT scans and how they work?
2. Understanding competition evaluation metric.
3. Potential ways to approach this competition.
4. Models and Techniques that can be used.

All this and more is on the way so stay tuned! Happy Kaggling!

[NOTE] I will be releasing a starter notebook with pytorch code soon.