Competition Overview:

> If every breath is strained and painful, it could be a serious and potentially life-threatening condition. A pulmonary embolism (PE) is caused by an artery blockage in the lung. It is time consuming to confirm a PE and prone to overdiagnosis. Machine learning could help to more accurately identify PE cases, which would make management and treatment more effective for patients.
Currently, CT pulmonary angiography (CTPA), is the most common type of medical imaging to evaluate patients with suspected PE. These CT scans consist of hundreds of images that require detailed review to identify clots within the pulmonary arteries. As the use of imaging continues to grow, constraints of radiologists’ time may contribute to delayed diagnosis.
The Radiological Society of North America (RSNA®) has teamed up with the Society of Thoracic Radiology (STR) to help improve the use of machine learning in the diagnosis of PE.
In this competition, you’ll detect and classify PE cases. In particular, you'll use chest CTPA images (grouped together as studies) and your data science skills to enable more accurate identification of PE. If successful, you'll help reduce human delays and errors in detection and treatment.
With 60,000-100,000 PE deaths annually in the United States, it is among the most fatal cardiovascular diseases. Timely and accurate diagnosis will help these patients receive better care and may also improve outcomes.


![Image](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F603584%2F9a3aac7e7ac865f134201cc2a5cd52f3%2Fkaggle_header3.png?generation=1599585319459400&alt=media)

> Let's now explore the data! :)

## Importing essential Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style = "darkgrid")
import pydicom as dcm
import matplotlib.cm as cm
import gc

## Reading and Understanding the files

In [None]:
df_train = pd.read_csv("../input/rsna-str-pulmonary-embolism-detection/train.csv")
df_test = pd.read_csv("../input/rsna-str-pulmonary-embolism-detection/test.csv")
df_sub = pd.read_csv("../input/rsna-str-pulmonary-embolism-detection/sample_submission.csv")

> Let's now talk a little bit about train.csv file.

In [None]:
df_train.head().T

> Shape of train data

In [None]:
df_train.shape

> It seems 3 features are object and all other are int. Let's see the cardinality of remaining features.

In [None]:
train_cols = ['pe_present_on_image', 'negative_exam_for_pe', 'qa_motion',
       'qa_contrast', 'flow_artifact', 'rv_lv_ratio_gte_1', 'rv_lv_ratio_lt_1',
       'leftsided_pe', 'chronic_pe', 'true_filling_defect_not_pe',
       'rightsided_pe', 'acute_and_chronic_pe', 'central_pe', 'indeterminate']

In [None]:
count = 0
for col in train_cols:
    if len(df_train[col].value_counts())==2:
        count += 1
if count==len(train_cols):
    print("All the features other than UID's are Binary features.")

> Okay, so now, We know that All of the remaining features from training file are binary features.

> Let's now plot the two categories of all features side by side and see the count of values.

In [None]:
def plot_grid(cols = train_cols):
    fig=plt.figure(figsize=(8, 22))
    columns = 2
    rows = 7
    for i in range(1, columns*rows + 1):
        col = cols[i-1]
        fig.add_subplot(rows, columns, i)
        df_train[col].value_counts().plot(kind = "bar", color = "Purple", alpha = 0.4)
        count_0 = df_train[col].value_counts()[0]
        count_1 = df_train[col].value_counts()[1]
        ratio = count_0/count_1
        plt.xlabel(f"Feature name: {col}\n Count 0: {count_0}\n Count 1: {count_1}\n Ratio(0:1): {ratio:.1f}:1")
    plt.tight_layout()
    plt.show()

plot_grid()

> So with this plot, we can now see that the classes are not that balanced in between all those features.
> Many of them have count of "0" to be very high in comparision to "1" and vice versa. 

> Let's see the total number of unique values in "StudyInstanceUID", "SeriesInstanceUID" and "SOPInstanceUID".

> Let's now see the correlation between all the binary features available in the data.

In [None]:
corr_mat = df_train[train_cols].corr()
mask = np.triu(np.ones_like(corr_mat, dtype=bool))
f, ax = plt.subplots(figsize=(14, 12))
sns.heatmap(corr_mat, mask = mask, cmap = "summer", annot = True, vmax = 0.3, square = False, linewidths = 0.5, center = 0)

> Almost all of the features seems to be containing unique information which is good for us!

In [None]:
def details_first_three(df = df_train):
    print(f"Number of unique entries in StudyInstanceUID: {len(df.StudyInstanceUID.value_counts())}")
    print(f"Number of unique entries in SeriesInstanceUID: {len(df.SeriesInstanceUID.value_counts())}")
    print(f"Number of unique entries in SOPInstanceUID: {len(df.SOPInstanceUID.value_counts())}")
details_first_three()

> Now, we can see that the number of uniques entries for both Study and Series Instance UID is same.

> Also, we know from the competition's data overview that SOPInstanceUID is a Unique Identifier for a image and to verify it, we can see that the number of unique SOP values is same as length of train data.

> Now, Let's talk a little about test data.

In [None]:
df_test.head(3)

> Shape of test data

In [None]:
df_test.shape

> Unique values for first three features.

In [None]:
details_first_three(df_test)

> Seems like the case for Study and Series Instance UID is same here as train.

> Let's now see what Submission files looks like and what do we need to predict?

In [None]:
df_sub.head(3)

> Seems like we need to predict some "label" for the entries from test file.

> Shape of Submission file

In [None]:
df_sub.shape

### What do we predict?

> Now, we know by the Overview page of this competition that : 

> *In this competition we are predicting a number of labels, at both the image and study level. Note that some labels are logically mutually exclusive.*

> Let's now check what these labels really are.

In [None]:
df_sub["label_features"] = df_sub.id.apply(lambda x: "_".join(x.split("_")[1:]))

In [None]:
df_sub.label_features[df_sub.label_features == ""] = "UID"

In [None]:
df_sub.label_features.value_counts()

> Now, we can see that we need to predict values for the above mentioned labels other than the UID itself.

Actually, this verifies the information provided by Jebastin Nadar in a discussion that:

> *Each study has multiple images. We have to predict labels for images as well as studies.*
    
> *Image level - predict for each image i.e SOPInstanceUID*
  
> *Labels to predict : pe_present_on_image*
    
> *Study level - predict for each study i.e StudyInstanceUID*
    
> *Labels to predict : negative_exam_for_pe , indeterminate, rv_lv_ratio_gte_1, rv_lv_ratio_lt_1, leftsided_pe, rightsided_pe, central_pe, chronic_pe, acute_and_chronic_pe*


## Visualizing some of the Scans

> Now, finally, let's see some of the images provided to us in train and test folders.

> Let's pick some random image addresses.

In [None]:
img_addr = ["../input/rsna-str-pulmonary-embolism-detection/train/0003b3d648eb/d2b2960c2bbf/00ac73cfc372.dcm", 
           "../input/rsna-str-pulmonary-embolism-detection/train/005df0f53614/5e0e0d0b7a65/081c2fa491a1.dcm",
           "../input/rsna-str-pulmonary-embolism-detection/train/0072baad76be/d555455a1dc2/096497b1da4e.dcm", 
           "../input/rsna-str-pulmonary-embolism-detection/train/00d4f4409f0c/38a51605b9ab/079e029c0d1a.dcm", 
           "../input/rsna-str-pulmonary-embolism-detection/test/00e7015490cb/291c07d4a7c0/09c25538116c.dcm", 
            "../input/rsna-str-pulmonary-embolism-detection/test/0227030d6278/599fccda6e2b/0c247bfd9c27.dcm", 
           "../input/rsna-str-pulmonary-embolism-detection/train/00102474a2db/c1a6d49ce580/06ce8f7a39ae.dcm", 
           "../input/rsna-str-pulmonary-embolism-detection/train/00102474a2db/c1a6d49ce580/0fd29873e8e4.dcm",
           "../input/rsna-str-pulmonary-embolism-detection/train/00617c9fe236/16ed05bf3395/01d00e27c5ac.dcm", 
            "../input/rsna-str-pulmonary-embolism-detection/test/08115e1b649d/f69e3f9c7067/10ba32beefb2.dcm"]

In [None]:
def plot_image_grid(addresses = img_addr):
    fig=plt.figure(figsize=(12, 12))
    columns = 5
    rows = 2
    for i in range(1, columns*rows + 1):
        addr = addresses[i-1]
        fig.add_subplot(rows, columns, i)
        plt.imshow(dcm.dcmread(addr).pixel_array)
        plt.axis("off")
    plt.tight_layout()
    plt.show()

plot_image_grid()

## Bonus: DICOM Metadata!

* Bonus for reaching this far!

Note: We can fetch the metadata details from the dcm image by using the following function (dicom_metadata).

In [None]:
dicom_atts = ["SpecificCharacterSet","ImageType","SOPInstanceUID","Modality","Manufacturer", "ManufacturerModelName","PatientName","PatientID",
             "PatientSex","DeidentificationMethod","BodyPartExamined","SliceThickness", "KVP","SpacingBetweenSlices","DistanceSourceToDetector",
              "DistanceSourceToPatient","GantryDetectorTilt", "TableHeight","RotationDirection","XRayTubeCurrent","GeneratorPower",
              "FocalSpots","ConvolutionKernel","PatientPosition","RevolutionTime", "SingleCollimationWidth","TotalCollimationWidth","TableSpeed","TableFeedPerRotation",
              "SpiralPitchFactor", "StudyInstanceUID","SeriesInstanceUID","StudyID","InstanceNumber","PatientOrientation",
              "ImagePositionPatient","ImageOrientationPatient","FrameOfReferenceUID","PositionReferenceIndicator",
              "SliceLocation","SamplesPerPixel","PhotometricInterpretation", "Rows","Columns","PixelSpacing","BitsAllocated","BitsStored","HighBit",
              "PixelRepresentation","PixelPaddingValue","WindowCenter","WindowWidth","RescaleIntercept", "RescaleSlope","RescaleType"]

list_attributes = ["ImageType","ImagePositionPatient","ImageOrientationPatient","PixelSpacing"]

def dicom_metadata(folder_path):
    files = os.listdir(folder_path)
    patient_id = folder_path.split('/')[-1]
    
    ## Each row is an image file:
    base_data = {'Patient': [patient_id]*len(files), 'File': files}
    patient_df = pd.DataFrame(data=base_data)
    
    ## Add Columns by looping through DICOM attributes for each image file:
    slices = [dcm.dcmread(folder_path + '/' + s) for s in files] 
    for d in dicom_atts:
        attribute_i = []
        for s in slices:
            try:
                attribute_i.append(s[d].value)
            except:
                attribute_i.append(np.nan)
        patient_df[d] = attribute_i
        
    ## Store min pixel value for each image file 
    attribute_min_pixel = []
    for s in slices:
        try:
            mp = np.min(s.pixel_array.astype(np.int16).flatten())
        except:
            mp = np.nan
        attribute_min_pixel.append(mp)
    patient_df["MinPixelValue"] = attribute_min_pixel
  
    return patient_df

> For example: let's fetch the metadata for images from "test/00268ff88746/75d23269adbd" directory.

In [None]:
df = dicom_metadata("../input/rsna-str-pulmonary-embolism-detection/test/00268ff88746/75d23269adbd")

> Let's now see the data!

In [None]:
df.head(3)

> Amazing, Isn't it? :)

> Let's now see some of the values from this data too!

* We can see that SOPInstanceUID is available here in the metadata. So, now, it'll be very easy for us to join the dataframes if we want to do so.
* Also, "Patient" is nothing but the SeriesInstanceUID.
* We conclude that StudyInstanceUID is also in the metadata.

> Now, there are a total of 58 columns in the metadata, many features being not of any use, some are helpful too.
> Let's see the resolution of most of the scans available.

In [None]:
print(f"CT Scan resolution is: {df.Rows.value_counts().index[0]}x{df.Columns.value_counts().index[0]}")

In [None]:
del df_train, df_test, df_sub, df, count, dicom_atts, list_attributes, img_addr, corr_mat, mask, train_cols
gc.collect()

I hope you like the work, I will make sure to update this over time. Please leave your comments down below in case of any suggestions.

Thanks for reading and Good Luck with the competition! :)