# SIIM-FISABIO-RSNA COVID-19 Detection: A Simple and Easy EDA
In this competition, we are used DICOM images of chest X-ray radiographs, and to identify and localize COVID-19 abnormalities. This is important because typical diagnosis of COVID-19 requires molecular testing (polymerase chain reaction) requires several hours, while chest radiographs can be obtained in minutes, but it is hard to distinguish between COVID-19 pneumonia and other other viral and bacterial pneumonias. Therefore, in this competition, be hope to develop AI that that eventually help radiologists diagnose the millions of COVID-19 patients more confidently and quickly.


## Import the Depedencies

In [None]:
import pandas as pd 
import numpy as np
import os
import pydicom
import glob
from tqdm.notebook import tqdm
from pydicom.pixel_data_handlers.util import apply_voi_lut
import matplotlib.pyplot as plt 
from skimage import exposure
import cv2
import warnings
from path import Path
from fastai.vision.all import *
from fastai.medical.imaging import *
warnings.filterwarnings("ignore")

# Load the data

In [None]:
datapath = Path("../input/siim-covid19-detection")

In [None]:
datapath.ls()

We can see that we have:

 - train_study_level.csv - the train study-level metadata, with one row for each study, including correct labels.
 - train_image_level.csv - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some images in both test and train have multiple bounding boxes.
 - sample_submission.csv - a sample submission file containing all image- and study-level IDs.
 - train folder - comprises 6,334 chest scans in DICOM format, stored in paths with the form study/series/image
 - test folder - The hidden test dataset is of roughly the same scale as the training dataset.

# Lets look at the CSVs file
check train_study_level.csv file

In [None]:
train_study_df = pd.read_csv(datapath/'train_study_level.csv')

In [None]:
train_study_df.head()

Let's look at the unique labels:

In [None]:
study_classes = ["Negative for Pneumonia" , "Typical Appearance",
                "Indeterminate Appearance", "Atypical Appearance"]
np.unique(train_study_df[study_classes].values , axis = 0)

As you can see, at the study-level, we are predicting the following classes:

 - Negative for Pneumonia
 - Typical Appearance
 - Indeterminate Appearance
 - Atypical Appearance
 - This here is a standard multi-label classification problem. In the training set, interestingly they are not multi-label, but it is mentioned that:

Studies in the test set may contain more than one label.

Let's look at the distribution:

In [None]:
plt.figure(figsize = (10,5))
plt.bar([1,2,3,4] , train_study_df[study_classes].values.sum(axis = 0))
plt.xticks([1,2,3,4] , study_classes)
plt.ylabel("Frequency")
plt.show()

Let's now look at train_image_level.csv:

In [None]:
train_image_df = pd.read_csv(datapath/"train_image_level.csv")
train_image_df.head()

We have our bounding box labels provided in the label column. The format is as follows:

[class ID] [confidence score] [bounding box]

 - class ID - either opacity or none
 - confidence score - confidence from your neural network model. If none, the confidence is 1.
 - bounding box - typical xmin ymin xmax ymax format. If class ID is none, the bounding box is 1 0 0 1 1.
 
The bounding boxes are also provided in easily readable dictionary format in column boxes, and the study that each image is a part of is provided inStudyInstanceUID.

Let's quick look at the distribution of opacity vs none:

In [None]:
train_image_df['split_label'] = train_image_df.label.apply(lambda x: [x.split()[offs:offs+6] for offs in range(0, len(x.split()), 6)])

In [None]:
classes_freq = []
for i in range(len(train_image_df)):
    for j in train_image_df.iloc[i].split_label: classes_freq.append(j[0])

plt.hist(classes_freq)
plt.ylabel("Frequency")

Let's also look at the distribution of the bounding box areas:

In [None]:
bbox_areas=[]
for i in range(len(train_image_df)):
    for j in train_image_df.iloc[i].split_label:
        bbox_areas.append((float(j[4])-float(j[2]))*(float(j[5])*float(j[3])))
        
plt.hist(bbox_areas)
plt.ylabel("Frequency")

## A look at the images
Okay, let's now look at some example images:

In [None]:
def dicom2array(path, voi_lut=True, fix_monochrome=True):
    dicom = pydicom.read_file(path)
    # VOI LUT (if available by DICOM device) is used to
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data


def plot_img(img, size=(7, 7), is_rgb=True, title="", cmap='gray'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()

    
def plot_imgs(imgs, cols=4, size=7, is_rgb=True, title="", cmap='gray', img_size=(500,500)):
    rows = len(imgs)//cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()

In [None]:
dicom_paths = get_dicom_files(datapath/'train')
imgs = [dicom2array(path) for path in dicom_paths[:4]]
plot_imgs(imgs)

Let's actually look at how many images are available per study:

In [None]:
num_images_per_study = []
for i in (datapath/'train').ls():
    num_images_per_study.append(len(get_dicom_files(i)))
    if len(get_dicom_files(i)) > 5:
        print(f"Study {i} had {len(get_dicom_files(i))} images")

In [None]:
plt.hist(num_images_per_study)

In [None]:
def image_path(row):
    study_path = datapath/"train"/row.StudyInstanceUID
    for i in get_dicom_files(study_path):
        if row.id.split('_')[0] == i.stem: return i
        
train_image_df['image_path'] = train_image_df.apply(image_path , axis = 1)


In [None]:
train_image_df.head()

In [None]:
imgs = []
image_paths = train_image_df['image_path'].values

# map label_id to specify color
thickness = 10
scale = 5


for i in range(8):
    image_path = random.choice(image_paths)
    print(image_path)
    img = dicom2array(path=image_path)
    img = cv2.resize(img, None, fx=1/scale, fy=1/scale)
    img = np.stack([img, img, img], axis=-1)
    for i in train_image_df.loc[train_image_df['image_path'] == image_path].split_label.values[0]:
        if i[0] == 'opacity':
            img = cv2.rectangle(img,
                                (int(float(i[2])/scale), int(float(i[3])/scale)),
                                (int(float(i[4])/scale), int(float(i[5])/scale)),
                                [255,0,0], thickness)
    
    img = cv2.resize(img, (500,500))
    imgs.append(img)

plot_imgs(imgs, cmap=None)

## How to submit
Let's now go over the sample_submission.csv file so we know how to submit our predictions.

Before we do so, it's worth reminding ourselves that this is a code-only competition, meaning that your submission file has to be generated in a script/notebook. The sample_submission.csv file demonstrated what kind of file needs to be produced:



In [None]:
submission_df = pd.read_csv(datapath/"sample_submission.csv")
submission_df.head()

We can see we have to provide the study-level class label. These will be of the format [class] 1 0 0 1 1

In [None]:
submission_df.iloc[2000:2010]

We also have to provide the image-level bounding box. These will be of the format [class ID] [confidence score] [bounding box] as described earlier.

Of course, in both cases, you can have multi-label scenarios.

In [None]:
submission_df.to_csv("submission.csv" , index = False)