# SIIM-FISABIO-RSNA COVID-19 Detection: A simple EDA 

In this competition, we are provided with DICOM images of chest X-ray radiographs, and we are asked to identify and localize COVID-19 abnormalities. This is important because typical diagnosis of COVID-19 requires molecular testing (polymerase chain reaction) requires several hours, while chest radiographs can be obtained in minutes, but it is hard to distinguish between COVID-19 pneumonia and other other viral and bacterial pneumonias. Therefore, in this competition, be hope to develop AI that that eventually help radiologists diagnose the millions of COVID-19 patients more confidently and quickly.

I'll provide a quick and simple EDA to help you get started with this very interesting competition!

# Imports
Let's start out by setting up our environment by importing the required modules:

In [None]:
# Thanks to https://www.kaggle.com/awsaf49/pydicom-conda-helper for pydicom files

# !wget 'https://anaconda.org/conda-forge/libjpeg-turbo/2.1.0/download/linux-64/libjpeg-turbo-2.1.0-h7f98852_0.tar.bz2' -q
# !wget 'https://anaconda.org/conda-forge/libgcc-ng/9.3.0/download/linux-64/libgcc-ng-9.3.0-h2828fa1_19.tar.bz2' -q
# !wget 'https://anaconda.org/conda-forge/gdcm/2.8.9/download/linux-64/gdcm-2.8.9-py37h500ead1_1.tar.bz2' -q
# !wget 'https://anaconda.org/conda-forge/conda/4.10.1/download/linux-64/conda-4.10.1-py37h89c1867_0.tar.bz2' -q
# !wget 'https://anaconda.org/conda-forge/certifi/2020.12.5/download/linux-64/certifi-2020.12.5-py37h89c1867_1.tar.bz2' -q
# !wget 'https://anaconda.org/conda-forge/openssl/1.1.1k/download/linux-64/openssl-1.1.1k-h7f98852_0.tar.bz2' -q

!conda install 'libjpeg-turbo-2.1.0-h7f98852_0.tar.bz2' -c conda-forge -y
!conda install 'libgcc-ng-9.3.0-h2828fa1_19.tar.bz2' -c conda-forge -y
!conda install 'gdcm-2.8.9-py37h500ead1_1.tar.bz2' -c conda-forge -y
!conda install 'conda-4.10.1-py37h89c1867_0.tar.bz2' -c conda-forge -y
!conda install 'certifi-2020.12.5-py37h89c1867_1.tar.bz2' -c conda-forge -y
!conda install 'openssl-1.1.1k-h7f98852_0.tar.bz2' -c conda-forge -y

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import glob
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from skimage import exposure
import cv2
import warnings
from fastai.vision.all import *
warnings.filterwarnings('ignore')

In [None]:
pd.options.display.max_columns = 500
pd.options.display.max_rows=1000

In [None]:
from fastai.basics import *
from fastai.callback.all import *
from fastai.vision.all import *
from fastai.medical.imaging import *

import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut

# A look at the provided data

Let's check what data is available to us:

In [None]:
dataset_path = Path('../input/siim-covid19-detection')

In [None]:
dataset_path.ls()

We can see that we have:

* `train_study_level.csv` - the train study-level metadata, with one row for each study, including correct labels.
* `train_image_level.csv` - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some images in both test and train have multiple bounding boxes.
* `sample_submission.csv` - a sample submission file containing all image- and study-level IDs.
* `train` folder - comprises 6,334 chest scans in DICOM format, stored in paths with the form `study`/`series`/`image`
* `test` folder 

The hidden test dataset is of roughly the same scale as the training dataset.


# A look at the CSVs

Let's check the `train_study_level.csv` file:

In [None]:
train_study_df = pd.read_csv(dataset_path/'train_study_level.csv')
train_image_df = pd.read_csv(dataset_path/'train_image_level.csv')

In [None]:
train_study_df.head()

In [None]:
train_image_df.head()

Let's look at the unique labels:

In [None]:
train_study_df.shape

In [None]:
study_classes = ['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']
train_study_df[study_classes].value_counts()

As you can see, at the study-level, we are predicting the following classes:
* Negative for Pneumonia
* Typical Appearance
* Indeterminate Appearance
* Atypical Appearance

This here is a standard multi-label classification problem. In the training set, interestingly they are not multi-label, but it is mentioned that:
> Studies in the test set may contain more than one label.

Let's look at the distribution:


In [None]:
plt.figure(figsize = (10,5))
plt.bar([1,2,3,4], train_study_df[study_classes].values.sum(axis=0))
plt.xticks([1,2,3,4],study_classes)
plt.ylabel('Frequency')
plt.show()

Let's now look at `train_image_level.csv`:

We have our bounding box labels provided in the `label` column. The format is as follows:

`[class ID] [confidence score] [bounding box]`

* class ID - either `opacity` or `none`
* confidence score - confidence from your neural network model. If none, the confidence is `1`.
* bounding box - typical `xmin ymin xmax ymax` format. If class ID is none, the bounding box is `1 0 0 1 1`.

The bounding boxes are also provided in easily readable dictionary format in column `boxes`, and the study that each image is a part of is provided in`StudyInstanceUID`.

Let's quick look at the distribution of opacity vs none:

In [None]:
train_image_df['split_label'] = train_image_df.label.apply(lambda x: [x.split()[offs:offs+6] for offs in range(0, len(x.split()), 6) ]) # start, stop, step

In [None]:
train_image_df.head()

In [None]:
# this is only the distribution of image level labels: opacity and none
classes_freq = []
for i in range(len(train_image_df)):
    for j in train_image_df.iloc[i].split_label: classes_freq.append(j[0])
plt.hist(classes_freq)
plt.ylabel('Frequency')

print(classes_freq[:10])

we see more labels than the number of images because some images are tagged with more than 1 label because they have more than 1 nodes in them

Let's also look at the distribution of the bounding box areas:

In [None]:
train_image_df.head()

In [None]:
bbox_areas = []
for i in range(len(train_image_df)):
    for j in train_image_df.iloc[i].split_label:
        bbox_areas.append((float(j[4])-float(j[2]))*(float(j[5])-float(j[3])))
plt.hist(bbox_areas)
plt.ylabel('Frequency')

print(bbox_areas[:10])

# A look at the images

Okay, let's now look at some example images:

In [None]:
# https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
    
def dicom2array(path, voi_lut=False, fix_monochrome=True):
    dicom = pydicom.read_file(path)
    # VOI LUT (if available by DICOM device) is used to
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data
        
    
def plot_img(img, size=(7, 7), is_rgb=True, title="", cmap='gray'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()


def plot_imgs(imgs, cols=4, size=7, is_rgb=True, title="", cmap='gray', img_size=(500,500)):
    rows = len(imgs)//cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()

In [None]:
dicom_paths = get_dicom_files(dataset_path/'train')[:10]
imgs = [dicom2array(path) for path in dicom_paths]
plot_imgs(imgs)

Let's actually look at how many images are available per study:

In [None]:
num_images_per_study = []
for i in (dataset_path/'train').ls():
    num_images_per_study.append(len(get_dicom_files(i)))
    if len(get_dicom_files(i)) > 2:
        print(f'Study {i} had {len(get_dicom_files(i))} images')

In [None]:
plt.hist(num_images_per_study)

In [None]:
def image_path(row):
    study_path = dataset_path/'train'/row.StudyInstanceUID
    for i in get_dicom_files(study_path):
        if row.id.split('_')[0] == i.stem: return i 
        
train_image_df['image_path'] = train_image_df.apply(image_path, axis=1)

In [None]:
train_image_df['image_path'].iloc[0].stem, train_image_df['image_path'].iloc[0]

In [None]:
train_image_df.head(1)

In [None]:
imgs = []
image_paths = train_image_df['image_path'].values

# map label_id to specify color
thickness = 10
scale = 1


for i in range(8):
    image_path = random.choice(image_paths)
    print(image_path)
    img = dicom2array(path=image_path)
    img = cv2.resize(img, None, fx=1/scale, fy=1/scale)
    img = np.stack([img, img, img], axis=-1)
    for i in train_image_df.loc[train_image_df['image_path'] == image_path].split_label.values[0]:
        if i[0] == 'opacity':
            img = cv2.rectangle(img,
                                (int(float(i[2])/scale), int(float(i[3])/scale)),
                                (int(float(i[4])/scale), int(float(i[5])/scale)),
                                [255,0,0], thickness)
    
    img = cv2.resize(img, (500,500))
    imgs.append(img)
    
plot_imgs(imgs, cmap='gray')

## Combined analysis

In [None]:
train_image_df.head()

In [None]:
def get_label(row):
    for c in train_study_df.columns:
        if row[c] == 1:
            return str.lower(c.split(" ")[0])

train_study_df['study_label'] = train_study_df.apply(get_label, axis=1)

In [None]:
train_study_df['StudyInstanceUID'] = train_study_df['id'].apply(lambda x: x.split('_')[0])

In [None]:
train_study_df = train_study_df.rename(columns={'id': 'study_id'})
train_image_df = train_image_df.rename(columns={'id': 'image_id'})
train_study_df.head()

In [None]:
train_image_df.shape, train_study_df.shape

In [None]:
train_image_df['ImageInstanceUID'] = train_image_df['image_id'].apply(lambda x: x.split('_')[0])
train_image_df.head(2)

In [None]:
def get_num_opacities(labels):
    num_opacities = 0
    for i in labels:
        if i[0] == 'opacity':
            num_opacities += 1
    return num_opacities

In [None]:
combined_df = train_image_df.merge(train_study_df, on='StudyInstanceUID')
combined_df['num_opacities'] = combined_df.split_label.apply(get_num_opacities)
combined_df.to_csv('study_image_combined_df.csv', index=False)
combined_df.head(2)

In [None]:
train_study_df.shape, train_image_df.shape, combined_df.shape

In [None]:
# image_level_dicom_data = pd.DataFrame.from_dicoms(combined_df.image_path.values)
# len(image_level_dicom_data)
                          
# train_image_dicom_df = pd.DataFrame(image_level_dicom_data)
# train_image_dicom_df['ImageInstanceUID'] = train_image_dicom_df.fname.apply(lambda x: Path(x).stem)
# train_image_dicom_df.to_csv('image_dicom_data.csv', index=False)

In [None]:
# train_image_dicom_df[train_image_dicom_df['StudyDate']=='284b97d038b7']

# study_image_dicom_combined_df = combined_df.merge(train_image_dicom_df, on='ImageInstanceUID')
# study_image_dicom_combined_df.drop('StudyInstanceUID_x', axis=1)
# study_image_dicom_combined_df = study_image_dicom_combined_df.rename(columns={'StudyInstanceUID_x': 'StudyInstanceUID'})
# study_image_dicom_combined_df['num_opacities'] = study_image_dicom_combined_df.split_label.apply(get_num_opacities)
# study_image_dicom_combined_df.to_csv('study_image_dicom_combined_data.csv', index=False)

# Image level analysis

In [None]:
# combined_df = pd.read_csv('../input/siimcovidcsvfiles/study_image_dicom_combined_data.csv')

In [None]:
combined_df.head(2)

In [None]:
combined_df['study_label'].value_counts()

In [None]:
combined_df['study_label'].count()

In [None]:
combined_df['ImageInstanceUID'].unique().shape, combined_df['StudyInstanceUID'].unique().shape 

In [None]:
(combined_df[(combined_df['num_opacities']==0) &
                              (combined_df['study_label']=='negative')]['study_label'].value_counts(), 

 combined_df[(combined_df['num_opacities']==0) &
                              (combined_df['study_label']=='negative')]['StudyInstanceUID'].unique().shape)

As expected, all the negative labelled images have no bounding boxes associated with them. On top of that, we have 60 duplicate images 1736 - 1676 = 60


# Bad and duplicate data
Let's look at the images which have no bounding box but are labelled as not negative 


In [None]:
combined_df['StudyInstanceUID'].value_counts()

In [None]:
studies_with_more_than_1_image = \
combined_df[combined_df['StudyInstanceUID'].isin([s for s, i in combined_df['StudyInstanceUID'].value_counts().items() if i>1])]

In [None]:
(studies_with_more_than_1_image['StudyInstanceUID'].unique().shape, 
studies_with_more_than_1_image.study_label.value_counts(), 
studies_with_more_than_1_image.shape)

There are 232 unique studies with more than 1 image, and we can also see the division of duplicate images accross these studies.

We are dealing with 2 types of bad data
- Duplicate images in studies which don't add much value
- Images which are labelled as non-negative but don't have proper bounding boxes
- Images which are both, i.e are duplicated in a study and have no bounding boxes s

In [None]:
combined_df[(combined_df['num_opacities']==0) &
                              (combined_df['study_label']!='negative')]['study_label'].value_counts()

In [None]:
non_negative_no_bb_images = combined_df[(combined_df['num_opacities']==0) &
                              (combined_df['study_label']!='negative')]
non_negative_no_bb_images.shape

**Okay, so now we have**
- non_negative_no_bb_images: images which are part of studies which are non negative. But these images have no bounding boxes in them which is odd
- combined_df: study and image combined data where we map all the study level labels to all the corresponding images

In [None]:
# separate the images which have proper labels and have co-ordinates if applicable vs the images under suspicion
combined_df_new = combined_df.drop(non_negative_no_bb_images.index)

In [None]:
combined_df.shape, combined_df_new.shape, non_negative_no_bb_images.shape

There are studies with multiple images which have no bounding boxes and out of these studies some are captured in our new dataframe.

In [None]:
non_negative_no_bb_images['StudyInstanceUID'].value_counts()[:25], non_negative_no_bb_images['StudyInstanceUID'].unique().shape, non_negative_no_bb_images.shape

There are 304 images which are a part of 261 studies which don't have bounding boxes

In [None]:
# len((set(non_negative_no_bb_images['StudyInstanceUID']).union(set(studies_with_more_than_1_image['StudyInstanceUID'].unique()))).difference(
# (set(non_negative_no_bb_images['StudyInstanceUID']).intersection(set(studies_with_more_than_1_image['StudyInstanceUID'].unique())))))


print("Number of unique non-negative studies with no BB: {}".format(
    len(set(non_negative_no_bb_images['StudyInstanceUID']).intersection(
        set(studies_with_more_than_1_image['StudyInstanceUID'].unique())))))

print("Total unique studies with more than 1 image: {}".format(
    studies_with_more_than_1_image['StudyInstanceUID'].unique().shape))

len("Total negative studies with duplicate images: {}".format(
    (set(studies_with_more_than_1_image['StudyInstanceUID'].unique())).difference(
    (set(non_negative_no_bb_images['StudyInstanceUID'].unique())).intersection(
        set(studies_with_more_than_1_image['StudyInstanceUID'].unique())))))


## Analyzing negative labelled studies with duplicate images
We will need to separate duplicate images with negative labels

In [None]:
multi_image_negative_study_df = combined_df[combined_df['StudyInstanceUID'].isin((set(studies_with_more_than_1_image['StudyInstanceUID'].unique())).difference(
    (set(non_negative_no_bb_images['StudyInstanceUID']).intersection(set(studies_with_more_than_1_image['StudyInstanceUID'].unique())))))]

multi_image_negative_study_df.study_label.value_counts(), multi_image_negative_study_df.StudyInstanceUID.unique().shape

The previous 60 duplicate images can be found here. We'll need to manually select the images we want to use for these studies

In [None]:
bad_images = []
for study_id in multi_image_negative_study_df.StudyInstanceUID.unique():
    print(study_id)
    rows = multi_image_negative_study_df[multi_image_negative_study_df['StudyInstanceUID']==study_id]
    print(rows.ImageInstanceUID)
    try:
        rows = multi_image_negative_study_df[multi_image_negative_study_df['StudyInstanceUID']==study_id]
        dcmimg = [pydicom.dcmread(i) for i in rows.image_path]
        row_cols = [(item.Rows, item.Columns) for item in dcmimg]
        imgs = [dicom2array(path) for path in rows.image_path]
        print(row_cols)
        img_avrages = [im.mean() for im in imgs]
        print(img_avrages)
        plot_imgs(imgs)
        
    except Exception as e:
        bad_images.append((study_id, list(rows.ImageInstanceUID), list(rows.image_path)))
        print(e)
        print(f"check {study_id} manually")

There are 113 negative labeled images which are repeated under 53 studies. We will need to manually select the image that we want to use in each of the studies

In [None]:
# need help figuring out which images to remove and how to go about this
duplicate_images_negative_label_images = ['0d4d6acc9ed3','93a881fb3292','cdd9e3aaf45a','68ad4b624a6d','cbf0a27f993e','b61f3493c551','0b020a7aff0a','3e7b2ffc97db','ace7a9702770','e96133d06736','9108cdfd43dc','e897ef5c203c','d180fed57716','f208dc529d16','ea2688741043','21518ca15050','bdd3115879aa','a1fa5f79671d','59bc532be971','b0866caa201a','ea516e218fe6','93301812b0e7','a2ee4b862182','d9456aadecbe']
len(duplicate_images_negative_label_images)

In [None]:
print(combined_df_new.shape)
combined_df_new = combined_df_new.drop(
    combined_df_new[combined_df_new.ImageInstanceUID.isin(duplicate_images_negative_label_images)].index)
print(combined_df_new.shape)

## Understandig images with no opacities but non negative label

In [None]:
non_negative_no_bb_images.StudyInstanceUID.unique().shape, non_negative_no_bb_images.shape

In [None]:
common_studies = non_negative_no_bb_images[non_negative_no_bb_images['StudyInstanceUID'].isin(combined_df_new['StudyInstanceUID'])]

print("Images with no BB which are a part of non-negative study which has more than 1 image: {}".format(common_studies.shape))
print("NonNegative studie with more than 1 image: {}".format(common_studies['StudyInstanceUID'].unique().shape))
print("TotalUnique images here: {}".format(common_studies['ImageInstanceUID'].unique().shape))

In [None]:
common_studies.study_label.value_counts(), common_studies.StudyInstanceUID.unique().shape

Here, we get common studies. These are studies which are labelled with a `non-negative` tag and have images in which some images have opacities marked and some don't.

We can see that out of the 261 studies with images with no bounding boxes, 177 of those studies have atleast 1 image with a bounding box and we've captured that image in our dataframe `combined_df_new` along with the study details.

In [None]:
# in the original combined dataframe, we can see that there are 177 studies with more than 1 image. We capture the images for these 177 studies
# which have atleast 1 bounding box. We can ignore the rest of the images for these studies since a study will have 1 label and all the images under
# that study should have the same label
combined_df_new[combined_df_new['StudyInstanceUID'].isin(common_studies['StudyInstanceUID'])]['StudyInstanceUID'].shape

We can see that the 177 unique studies that we have in common (which have some images with bounding boxes and some with no bounding boxes) are covered in the dataframe where we separated out the bad data.

Now, we've separated the images with no bounding boxes and which are non negative with the images which have bounding boxes 

So, we have 177 studies which have images with an opacity and images without an opacity. These 177 have multiple images attached to them and we've filtered out the images with no opacity out from the DF. Let's check how many images the remaining studies have in the DF

In [None]:
uncommon_studies = non_negative_no_bb_images[~non_negative_no_bb_images['StudyInstanceUID'].isin(common_studies['StudyInstanceUID'])]
uncommon_studies.StudyInstanceUID.unique().shape, uncommon_studies.shape

In [None]:
uncommon_studies.study_label.value_counts()

In [None]:
uncommon_studies[uncommon_studies['study_label']=='typical']

In [None]:
uncommon_studies = uncommon_studies.drop(1793)

There are 86 atypical images wth no opacity defined and 1 typical image with no opacity defined. the latter seems like an anomaly and we can probably remove that and use the other 86 atypical ones. Let's also make sure that these are not already accounted for

In [None]:
set(uncommon_studies.StudyInstanceUID).intersection(set(combined_df_new.StudyInstanceUID))

In [None]:
uncommon_studies.StudyInstanceUID.value_counts()[:5]

In [None]:
bad_images = []
for study_id in ["0d9709b3af74", "784afbafee30"]:
    print(study_id)
    rows = uncommon_studies[uncommon_studies['StudyInstanceUID']==study_id]
    print(rows.ImageInstanceUID)
    try:
        rows = uncommon_studies[uncommon_studies['StudyInstanceUID']==study_id]
        dcmimg = [pydicom.dcmread(i) for i in rows.image_path]
        row_cols = [(item.Rows, item.Columns) for item in dcmimg]
        imgs = [dicom2array(path) for path in rows.image_path]
        print(row_cols)
        print([im.mean() for im in imgs])
        plot_imgs(imgs)
    except Exception as e:
        bad_images.append((study_id, list(rows.ImageInstanceUID), list(rows.image_path)))
        print(e)
        print(f"check {study_id} manually")

In [None]:
uncommon_studies = uncommon_studies.drop(
    uncommon_studies[uncommon_studies['ImageInstanceUID'].isin(["efc93a3917b6", "830063223a31"])].index)

In [None]:
uncommon_studies.shape

In [None]:
combined_df_new.shape

In [None]:
final_df = pd.concat([combined_df_new, uncommon_studies])

In [None]:
final_df.shape, train_image_df.shape, studies_with_more_than_1_image.shape

In [None]:
final_df.to_csv('cleaned_data-v1.csv', index=False)

Now, to get all the images that are useful, we need to select from the above data frames.

In [None]:
# let's see the remaining duplicate images available to us

bad_images = []
for study_id, cont in final_df.StudyInstanceUID.value_counts()[:30].items():
    print(study_id)
    rows = final_df[final_df['StudyInstanceUID']==study_id]
    print(rows.ImageInstanceUID)
    try:
        rows = final_df[final_df['StudyInstanceUID']==study_id]
        imgs = [dicom2array(path) for path in rows.image_path]
        plot_imgs(imgs)
        dcmimg = [pydicom.dcmread(i) for i in rows.image_path]
        row_cols = [(item.Rows, item.Columns) for item in dcmimg]
        print(row_cols)
    except Exception as e:
        bad_images.append((study_id, list(rows.ImageInstanceUID), list(rows.image_path)))
        print(e)
        print(f"check {study_id} manually")

Most of thse images are different enough for us to consider them.