Welcome to the [SIIM-FISABIO-RSNA COVID-19 Detection](http://www.kaggle.com/c/siim-covid19-detection) competition to identify and localize COVID-19 abnormalities on chest radiographs.

This notebook is a work in progress EDA notebook. 



In [None]:
#! conda install -c conda-forge gdcm -y

## Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import pydicom #as dicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
import cv2
import ast
import seaborn as sns

from fastai.vision.all import *
from fastai.medical.imaging import *

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

## Load

In [None]:
path = '/kaggle/input/siim-covid19-detection/'
os.listdir(path)

In [None]:
train_image = pd.read_csv(path+'train_image_level.csv')
train_study = pd.read_csv(path+'train_study_level.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

So we have:

* `train_study_level.csv` - the train study-level metadata, with one row for each study, including correct labels.
* `train_image_level.csv` - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some images in both test and train have multiple bounding boxes.
* `sample_submission.csv` - a sample submission file containing all image- and study-level IDs.
* `train` folder - comprises 6,334 chest scans in DICOM format, stored in paths with the form `study`/`series`/`image`
* `test` folder - The hidden test dataset is of roughly the same scale as the training dataset.


## Overview

In [None]:
txt = "The shape of image samples, study samples and sample submissions is: {}, {}, {}"
print(txt.format(train_image.shape, train_study.shape, samp_subm.shape))

In [None]:
train_image['StudyInstanceUID'].nunique()

In [None]:
train_study['id'].nunique()

In [None]:
studyGroups = train_image.groupby(['StudyInstanceUID']).size().sort_values(ascending=False)
#groupsDist = studyGroups

In [None]:
ImsPerStudy = studyGroups.value_counts().sort_values(ascending=False)
list(range(11, 17))
print(ImsPerStudy)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
axes = ImsPerStudy.plot.bar(rot=0, subplots=True)

There are 6054 unique imaging studies. The majority of these contain one image. train_image contains 6334 samples. 280 more samples than train_study

In [None]:
train_image.head()

In [None]:
train_study.head()

In [None]:
samp_subm.head()

In [None]:
ts_cols = train_study.columns

# counting the training labels
m = []
n = []
for i in range(1,5):
    m.append(train_study[ts_cols[i]].sum())
    n.append(ts_cols[i])
print(m)
print(n)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))

ax.bar(n, m)
ax.set_xticklabels(n, rotation=30)
ax.set_title('Distribution pneumonia and COVID types')
plt.grid()
plt.show()

In [None]:
train_image.columns

In [None]:
train_image['id'][0]

In [None]:
print(train_image['label'][0])

In [None]:
print(train_image['boxes'][0])

## Images

In [None]:
def dicom2array(path, voi_lut=True, fix_monochrome=True):
    dicom = pydicom.read_file(path)
    # VOI LUT (if available by DICOM device) is used to
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data
        
    
def plot_img(img, size=(7, 7), is_rgb=True, title="", cmap='gray'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()


def plot_imgs(imgs, cols=4, size=7, is_rgb=True, title="", cmap='gray', img_size=(500,500)):
    rows = len(imgs)//cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()
    
# thanks to https://www.kaggle.com/tanlikesmath/siim-covid-19-detection-a-simple-eda

In [None]:
dicom_paths = get_dicom_files(path+'train')
print(dicom_paths[:4])

In [None]:
imgs = [dicom2array(path) for path in dicom_paths[:4]]
plot_imgs(imgs)

--- If this was useful, please upvote! :)

## Thanks for reading! More to follow!