<h1 style='color:red;font-weight:500;'>SIIM-FISABIO-RSNA COVID-19 Detection: An Extended EDA </h1>

In this competition, we are provided with <span style='color:blue;font-weight:500;'>DICOM images </span> of chest X-ray radiographs, and we are asked to identify and localize COVID-19 abnormalities. This is important because typical diagnosis of COVID-19 requires molecular testing (polymerase chain reaction) requires several hours, while chest radiographs can be obtained in minutes, but it is hard to distinguish between COVID-19 pneumonia and other other viral and bacterial pneumonias. Therefore, in this competition, be hope to develop AI that that eventually help radiologists diagnose the millions of COVID-19 patients more confidently and quickly.

I'll provide a quick and simple EDA to help you get started with this very interesting competition!


<span style='color:green;font-size:20px;font-weight:500;'>This Notebook is forked from anoder EDA notebook. But in this one , I am going to explain all the fact about this competition that confused me from beginning. </span>

<span style='color:crimson;font-size:20px;font-weight:500;'><b> First thing before Getting Started : </b> The data might look huge, but there is only a small number of images. We will determine the exact number of images in later part of the notebook. The huge size is mainly due to `DICOM` format. If we extract only the images in PNG format, even in high quality the dataset comes down to 3-4 GB only.  </span> 

<span style='color:blue;font-size:15px;font-weight:500;'>Some kind Kagglers have converted the dataset to JPG or PNG format. Though I will show in this notebook how to convert from `.dcm` to .`.jpg'/'.png`, but the converted datasets will also be linked. </span> 


<span style='color:blue;font-size:18px;font-weight:500;'>DICOM : </span> `It is the standard format for the communication and management of medical imaging information and related data.`



In [None]:
! conda install -c conda-forge gdcm -y

In [None]:
# import numpy as np # linear algebra
# import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# import os
# import pydicom
# import glob
# from tqdm.notebook import tqdm
# from pydicom.pixel_data_handlers.util import apply_voi_lut
# import matplotlib.pyplot as plt
# from skimage import exposure
# import cv2
# import warnings
# from fastai.vision.all import *
# from fastai.medical.imaging import *
# warnings.filterwarnings('ignore')

import os
import pydicom
import glob
import cv2
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 



<h1 style='color:green;'>A look at the provided data </h1>

In [None]:
dataset_path = '../input/siim-covid19-detection/'

for path in glob.glob(dataset_path  + '*'):
    print(path)

<span style='color:blue;font-size:18px;font-weight:500;'> We can see that we have:</span>

* `train_study_level.csv` - the train study-level metadata, with one row for each study, including correct labels.
* `train_image_level.csv` - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some images in both test and train have multiple bounding boxes.
* `sample_submission.csv` - a sample submission file containing all image- and study-level IDs.
* `train` folder - comprises 6,334 chest scans in DICOM format, stored in paths with the form `study`/`series`/`image`
* `test` folder - The hidden test dataset is of roughly the same scale as the training dataset.


<span style='color:crimson;font-size:18px;font-weight:500;'> Train folder analysis: </span>


`We see that there are some folders in train folder. Each of them have atleast one subfolder in them, and each of the subfolders have at least one dcm file, 

In [None]:
train_images = '../input/siim-covid19-detection/train/'

n_folders = len(glob.glob(train_images  + '*'))
n_subfolders = len(glob.glob(train_images  + '*/*'))
n_images = len(glob.glob(train_images  + '*/*/*.dcm'))

print(f'There are {n_subfolders} subfolders in {n_folders} folders.')
print(f'There are altogether {n_images} images.')

`So , some folders have more than one subfolders and some subfolders have more than one image.`

In [None]:
folders = glob.glob(train_images  + '*')

subfolder_dict = {}

for folder in folders:
    n = len(glob.glob(folder + '/*'))
    if n in subfolder_dict.keys():
        subfolder_dict[n] += 1
    else:
        subfolder_dict[n] = 1
        
for k, v in subfolder_dict.items():
    print(f'There is {v} subfolders with {k} images.')



In [None]:
subfolders = glob.glob(train_images  + '*/*')

image_dict = {}

for subfolder in subfolders:
    n = len(glob.glob(subfolder + '/*'))
    if (n!=1):
        print (subfolder.split('/')[-1])
        for image in glob.glob(subfolder + '/*'):
            print ('-->' + image.split('/')[-1])

    if n in image_dict.keys():
        image_dict[n] += 1
    else:
        image_dict[n] = 1
        
for k, v in image_dict.items():
    print(f'There is {v} subfolders with {k} images.')


<span style='color:crimson;font-size:18px;font-weight:500;'> Major thing to remember is: </span>
`There are 6331 subfolders in 6054 folders.
There are altogether 6334 images. Lets keep those numbers around.`

<span style='color:green;font-size:18px;font-weight:500;'>What is Study Level? What is Image Level? </span> <br><br> `Wait for a sec...We will get the answer  just after analyzing the CSV files.`



<span style='color:red;font-size:18px;font-weight:500;'>train_study_level.csv :</span>

In [None]:
train_study_df = pd.read_csv(dataset_path +'/train_study_level.csv')

train_study_df

So we have `6054` rows in this. All the rows are classified in following classes:
* `Negative for Pneumonia`
* `Typical Appearance`
* `Intermediate Appearance`
* `Atypical Appearance`

 <b>So the `Study Level`  looks something like a classification task</b>. We will check later, if this is single label classification or multilabel. 

Do you remeber those numbers ?? <br> 
Yes, The `train_study_level.csv` targets these folder number.<b> Each Folder refers to a study.</b>

<span style='color:blue;font-size:16px;font-weight:500;'>Now, Checking frequency of classes...</span>


In [None]:
study_classes = train_study_df.drop(columns = ['id']).columns.tolist()
plt.figure(figsize = (10,5))
plt.bar([1,2,3,4], train_study_df[study_classes].values.sum(axis=0))
plt.xticks([1,2,3,4],study_classes)
plt.ylabel('Frequency')
plt.show()

This seems okay. Because, naturally number of `Typycal Appearance` will be highest, because that is the general case. Then there should be number of `Negative for Pneumonia` , which would be less than the normal case, as those X-rays are for pataints, not random people.There will be a moderate number of cases that the doctors will fail to identify. Those are `Indeteminate Appearence`. And the lowest number willbe of wired `Atypical Cases`.

<span style='color:blue;font-size:16px;font-weight:500;'>Lets check if any row has multiples labels</span>

In [None]:
train_study_df[study_classes].sum(axis = 1).value_counts()

We can see that each row has sum 1 after summing them along the row. So, `each of the 6054 studies have only one label each.`

<span style='color:red;font-size:18px;font-weight:500;'>train_image_level.csv :</span>

In [None]:
train_image_df = pd.read_csv(dataset_path +'/train_image_level.csv')

train_image_df

Firstly, Bboxex are goven here. That means it has something to do with localizing some portion of image. <b>This is an `Objet Detection Task`</b><br>
Do you recognize the row number?? <br>
Yap! That is the total image number.

<b>Thus `Image Level` means nothing but prediction for each image.</b>

We have our bounding box labels provided in the `label` column. The format is as follows:

`[class ID] [confidence score] [bounding box]`

* class ID - either `opacity` or `none`
* confidence score - confidence from your neural network model. If none, the confidence is `1`.
* bounding box - typical `xmin ymin xmax ymax` format. If class ID is none, the bounding box is `1 0 0 1 1`.

The bounding boxes are also provided in easily readable dictionary format in column `boxes`, and the study that each image is a part of is provided in`StudyInstanceUID`.

<span style='color:blue;font-size:16px;font-weight:500;'>Now we will create a new column, splitting the label to more understandable list format.</span>

In [None]:
train_image_df['split_label'] = train_image_df.label.apply(lambda x: [x.split()[offs:offs+6] for offs in range(0, len(x.split()), 6)])

train_image_df['split_label']

Okay, If the label is None, there is inly one detection. But if not `None` , there might be a number of detections.

<span style='color:blue;font-size:16px;font-weight:500;'>Now I will create one more column to estimate how many bounding boxes are there for each image.</span>


In [None]:
train_image_df['label_len'] = train_image_df['split_label'].apply(lambda x: 0 if (x[0][0] == 'none') else len(x))
train_image_df['label_len'].value_counts()

So, It looks like , there is 2040 images with no detection. 3113 images with 2 detections (understadibly in two lobes of lung).

<span style='color:blue;font-size:16px;font-weight:500;'>Let's quick look at the distribution of opacity vs none:</span>

In [None]:
classes_freq = []
for i in range(len(train_image_df)):
    for j in train_image_df.iloc[i].split_label: classes_freq.append(j[0])
plt.hist(classes_freq)
plt.ylabel('Frequency')

In [None]:
def dicom2array(path, voi_lut=True, fix_monochrome=True):
    dicom = pydicom.read_file(path)
    # VOI LUT (if available by DICOM device) is used to
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = pydicom.pixel_data_handlers.util.apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data
        
    
def plot_img(img, size=(7, 7), is_rgb=True, title="", cmap='gray'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()
    
def plot_imgs(imgs, cols=4, size=7, is_rgb=True, title="", cmap='gray', img_size=(500,500)):
    rows = len(imgs)//cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()


In [None]:
imgs = []
image_ids = train_image_df['id'].values

# map label_id to specify color
thickness = 10
scale = 5


for i in range(8):
    image_id = random.choice(image_ids)
    image_path = glob.glob(f"../input/siim-covid19-detection/train/*/*/{image_id.split('_')[0]}.dcm")[0]
    print(image_path)
    img = dicom2array(path=image_path)
    img = cv2.resize(img, None, fx=1/scale, fy=1/scale)
    img = np.stack([img, img, img], axis=-1)
    print(train_image_df.loc[train_image_df['id'] == image_id])
#     for i in train_image_df.loc[train_image_df['id'] == image_id].split_label.values[0]:
#         if i[0] == 'opacity':
#             img = cv2.rectangle(img,
#                                 (int(float(i[2])/scale), int(float(i[3])/scale)),
#                                 (int(float(i[4])/scale), int(float(i[5])/scale)),
#                                 [255,0,0], thickness)
    
    img = cv2.resize(img, (500,500))
    imgs.append(img)
    
plot_imgs(imgs, cmap=None)

In [None]:
bbox_areas = []
for i in range(len(train_image_df)):
    for j in train_image_df.iloc[i].split_label:
        bbox_areas.append((float(j[4])-float(j[2]))*(float(j[5])*float(j[3])))
plt.hist(bbox_areas)
plt.ylabel('Frequency')

<span style='color:crimson;font-size:20px;font-weight:500;'><b> If you like the EDA approach or the notebook, a lot more thing will be added. Thank you. </b>  </span> 

That's it for now!

**Please upvote if you found this helpful!**