# SIIM Covid-19 Detection Basic Exploratory Data Analysis (EDA)
This notebook provides a basic exploratory data analysis of the SIIM Covid-19 Detection data set. 

# Summary of Findings
## Training Set
- The dataset contains 6054 annotated training studies.
- Each study has one of 4 labels.
- Three labels indicates Covid-19 positive, 1 label Covid-10 negative.
- Study labels are not balanced.
- Each study contains between 1 and 9 training images with most studies constisting of 1 or 2 training images.
- The dataset contains 6334 annotated training images.
- Each training images has between 0 and 8 bounding boxes with most images having 0 to 2 bounding boxes.
- **The dataset contains several duplicate images with different boudning boxes marked.**
- The image height and width varies roughly between 1500 and 4000 pixels.
- The width to height ratio of the images is approximately 1.2

## Test Set
- The dataset contains 1214 training studies.
- Each study contains between 1 and 7 images with most studies containing 1 or 2 images.
- The dataset contains 1263 images.
- **The dataset contains several duplicate images.**
- The image height and width varies roughly between 1500 and 4000 pixels.
- The width to height ratio of the images is approximately 1.2

# References
The following references were used in this notebook.
- Notebook with all duplicate images: https://www.kaggle.com/kwk100/siim-covid-19-duplicate-training-images
- Reading dicom images: https://www.kaggle.com/trungthanhnguyen0502/eda-vinbigdata-chest-x-ray-abnormalities
 

# Imports

In [None]:
!conda install gdcm -c conda-forge -y

In [None]:
import cv2
import datetime
import gc
import glob
import imagehash
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import PIL
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
import sys
import tqdm

# Parameters

In [None]:
base_path = '../input/siim-covid19-detection'

# Utility Functions

In [None]:
def read_dicom_image(image_file, voi_lut=True, fix_monochrome=True):
    """
    Reads a dicom image from a file an returns a numpy array.
    References: https://www.kaggle.com/trungthanhnguyen0502/eda-vinbigdata-chest-x-ray-abnormalities
    Args:
        image_file:
        voi_lut:
        fix_monochrome:

    Returns:

    """
    dicom = pydicom.read_file(image_file)
    # VOI LUT (if available by DICOM device) is used to
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data


def string2boxes(string):
    strings = string.split()
    if strings[0].lower() == 'none':
        return []
    else:
        return [{'class': strings[idx],
                 'conf': float(strings[idx+1]),
                 'x1': float(strings[idx+2]),
                 'y1': float(strings[idx+3]),
                 'x2': float(strings[idx+4]),
                 'y2': float(strings[idx+5]),
                 } for idx in range(0, len(strings), 6)]

    
def plot_image(image, boxes=None, size=(5,5), title=None, columns=4):
    def plot_img(image, boxes=None, title=None):
        if isinstance(image, str):
            image_id = os.path.splitext(os.path.split(image)[1])[0]
            df = df_image.loc[df_image['id'] == image_id + '_image']
            boxes = string2boxes(df['label'].iloc[0]) if len(df) > 0 else None
            image = read_dicom_image(image)
        image = np.stack([image] * 3, axis=-1)
        if boxes is not None:
            for box in boxes:
                image = cv2.rectangle(image, (int(box['x1']), int(box['y1'])), (int(box['x2']), int(box['y2'])), [0, 255, 0], 10)
        plt.axis('on')
        plt.imshow(image, cmap='gray')
        if title is not None:
            plt.title(title)

    plt.figure(figsize=size)
    if isinstance(image, list):
        num = len(image)
        columns = min(columns, num)
        rows = math.ceil(num / columns)

        for index, single_image in enumerate(image):
            plt.subplot(rows, columns, index + 1)
            plot_img(single_image, boxes=boxes, title=None if title is None else title[index])
    else:
        plot_img(image, boxes=boxes, title=title)
    plt.show()


def images_find_duplicates(image_files, threshold=0.9):
    """
    Function to find duplicates in images.
    References: https://www.kaggle.com/appian/let-s-find-out-duplicate-images-with-imagehash
    Args:
        image_files:
        threshold:

    Returns:

    """
    funcs = [imagehash.average_hash, imagehash.phash, imagehash.dhash, imagehash.whash]
    image_ids = image_files
    hashes = []
    for file in tqdm.tqdm(image_files):
        image = PIL.Image.fromarray(read_dicom_image(file))
        hashes.append(np.array([f(image).hash for f in funcs]).reshape(256))
    hashes_all = np.array(hashes)

    # Comparisons without Pytorch
    sim_list = []
    for i in tqdm.tqdm(range(hashes_all.shape[0])):
        sim_list.append(np.sum(hashes_all[i] == hashes_all, axis=1)/256)

    # nxn-matrix of similarities (n = # of images), upper triangular matrix
    similarities = np.triu(np.array(sim_list), 1)

    idx_pair = np.where(similarities > threshold)
    df_pairs = pd.DataFrame({'image1': [image_ids[i] for i in list(idx_pair[0])],
                             'image2': [image_ids[i] for i in list(idx_pair[1])],
                             'similarity': [similarities[i1, i2] for i1, i2 in zip(idx_pair[0], idx_pair[1])]})

    idx_group = np.zeros(len(image_files))
    group_id = 1
    for i1, i2 in zip(idx_pair[0], idx_pair[1]):
        if idx_group[i1] == 0 and idx_group[i2] == 0:
            idx_group[i1] = group_id
            idx_group[i2] = group_id
            group_id += 1
        elif idx_group[i1] != 0 and idx_group[i2] == 0:
            idx_group[i2] = idx_group[i1]
        elif idx_group[i1] == 0 and idx_group[i2] != 0:
            idx_group[i1] = idx_group[i2]
        elif idx_group[i1] != 0 and idx_group[i2] != 0 and idx_group[i1] != idx_group[i2]:
            common_id = min(idx_group[i1], idx_group[i2])
            idx_group[idx_group == idx_group[i1]] = common_id
            idx_group[idx_group == idx_group[i2]] = common_id

    group_list = []
    for i in range(1, group_id + 1):
        group_ids = list(np.where(idx_group == i)[0])
        if len(group_ids) > 0:
            group_list.append([image_ids[j] for j in group_ids])

    return df_pairs, group_list

# File Structure
The files in the root of the dataset are shown below.
The dataset consists of 2 directories that contain training and test images and 3 csv-files with additional information about the images and the competition.
## Directory Contents

In [None]:
print('Directories:')
print('\n'.join([dir for dir in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, dir))]))
print('\nFiles:')
print('\n'.join([dir for dir in os.listdir(base_path) if not os.path.isdir(os.path.join(base_path, dir))]))

# CSV Files
# Train_Study_Level.csv
Lets take a look at the contents of the study training file.

In [None]:
df_study = pd.read_csv(os.path.join(base_path, 'train_study_level.csv'))
print(f'Number of rows: {len(df_study)}')
display(df_study)

In [None]:
df_study['Num Labels'] = df_study.iloc[:,1:].sum(axis=1)
print(f'Minimum number of labels per row: {min(df_study["Num Labels"])}')
print(f'Maximum number of labels per row: {max(df_study["Num Labels"])}')
print(f'Number of unique ids: {len(df_study["id"].unique())}')

In each row, exactly one of the 4 class is associated with the study id.

Since the number of unique studies is the same as the number of rows, so each study occurs only once in the table and each study is associated with only 1 class.

The frequency of the 4 classes in the training set is shown below.

In [None]:
plt.figure(figsize=(10,5))
plt.bar([1,2,3,4], df_study.iloc[:,1:5].sum(axis=0), tick_label=df_study.columns[1:5])
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.grid()

# Train_Image_Level.csv
Lets take a look at the contents of the image training file.

In [None]:
df_image = pd.read_csv(os.path.join(base_path, 'train_image_level.csv'))
print(f'Number of rows: {len(df_image)}')
display(df_image)

- The `label` column contains the bounding boxes for the image `bounding box 1`, `bounding box 2`, ...
  - Each bounding box is given in the format `class` `confidence score` `x1` `y1` `x2` `y2`
- The `boxes` column contains the x, y, width, and height information only in dictionary format.

# Training Images
The train images are arranged in directories and sub-directories with the structure `study`/`series`/`image`.

## Number of Studies

Each study contains between 1 and 9 series, with most studies containing 1 series.

In [None]:
dir_studies = glob.glob(os.path.join(base_path, 'train/*'))
print(f'Number of studies: {len(dir_studies)}')
dir_series = [glob.glob(dir_study + '/*') for dir_study in dir_studies]
count_series = pd.Series([len(dir_ser) for dir_ser in dir_series])
_ = plt.hist(count_series)
plt.title('Number of series per study')
plt.xlabel('Number of series per study')
plt.ylabel('Frequency')
plt.grid()
plt.show()
print(count_series.value_counts().sort_index().to_string())

## Number of Images
Similarly, each study contains between 1 and 9 images with most studies consisting of 1 image.

In [None]:
dir_images = [glob.glob(dir_study + '/*/*.dcm') for dir_study in dir_studies]
df_study_train = pd.DataFrame({'study_id': [s.split('/')[-1] for s in dir_studies]})
df_study_train['num_images'] = [len(dir_img) for dir_img in dir_images]
_ = plt.hist(df_study_train['num_images'])
plt.title('Number of images per study')
plt.xlabel('Number of images per study')
plt.ylabel('Frequency')
plt.grid()
plt.show()
print(df_study_train['num_images'].value_counts().sort_index().to_string() + '\n')
print('Studies with the most images')
display(df_study_train.sort_values('num_images', ascending=False)[0:10])
df_study_train.to_csv('study_train.csv', index=False)

## Image Display
The next step is to load the list of training files.

In [None]:
train_files = sorted(glob.glob(os.path.join(base_path, 'train/*/*/*.dcm')))
print(f'Number of training files: {len(train_files)}')

Now that the list is loaded, lets take a look at one image full-size and several scaled down images with their bounding boxes.

In [None]:
plot_image(train_files[7], size=(20,20))
plot_image(train_files[0:16], size=(20,20))

## Bounding Boxes
The number of bounding boxes in an image varies. From the data below, we can see that each image can contain between 0 and 8 bounding with most of the images having between 0 and 2 boxes.

In [None]:
df_image['box_dict'] = df_image['label'].apply(lambda x: string2boxes(x))
df_image['num_boxes'] = df_image['box_dict'].apply(lambda x: len(x))
output = plt.hist(df_image['num_boxes'])
plt.xlabel('Number of bounding boxes')
plt.ylabel('Frequency')
plt.grid()
plt.show()
print(df_image['num_boxes'].value_counts().sort_index().to_string())

## Image Size

In [None]:
img_list = []
for file in tqdm.tqdm(train_files):
    img = read_dicom_image(file)
    img_list.append({'file': file, 'width': img.shape[1], 'height': img.shape[0]})
df_images = pd.DataFrame(img_list)
df_images['ratio'] = df_images['width'] / df_images['height']
plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.hist(df_images['width'])
plt.title('Width Distribution of the Images')
plt.xlabel('Width')
plt.ylabel('Frequency')
plt.grid()
plt.subplot(1,2,2)
plt.hist(df_images['height'])
plt.title('Height Distribution of the Images')
plt.xlabel('Height')
plt.ylabel('Frequency')
plt.grid()
plt.show()
plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.scatter(df_images['width'], df_images['height'])
plt.title('Width and Height Distribution of the Images')
plt.xlabel('Width')
plt.ylabel('Height')
plt.grid()
plt.subplot(1,2,2)
plt.hist(df_images['ratio'], bins = 20)
plt.title('Width to Height Ratio of the Images')
plt.xlabel('Width to Height Ratio')
plt.ylabel('Frequency')
plt.grid()

## Duplicate Images
It is worthwile to check the dataset for duplicate images. For speed reasons, we are only checking the first 200 files for duplicates.

In [None]:
df_pairs, group_list = images_find_duplicates(train_files[0:200], threshold=0.95)
print(f'\nNumber of duplicate pairs: {len(df_pairs)}')
print(f'Number of duplicate groups: {len(group_list)}')

Since several duplicate images were found in the dataset, we plot some of them. It turns out that in the duplicate images, the bounding boxes are different!

In [None]:
for i, group in enumerate(group_list):
    group_ids = [os.path.basename(file) for file in group]
    print(f'\nGroup {i+1}')
    plot_image(group, size=(20, 10), title=group_ids, columns=8)

# Test Images
The test images are arranged in directories and sub-directories with the structure `study`/`series`/`image`.

## Number of Studies

Each study contains between 1 and 4 series, with most studies containing 1 series.

In [None]:
dir_studies = glob.glob(os.path.join(base_path, 'test/*'))
print(f'Number of studies: {len(dir_studies)}')
dir_series = [glob.glob(dir_study + '/*') for dir_study in dir_studies]
count_series = pd.Series([len(dir_ser) for dir_ser in dir_series])
_ = plt.hist(count_series)
plt.title('Number of series per study')
plt.xlabel('Number of series per study')
plt.ylabel('Frequency')
plt.grid()
plt.show()
print(count_series.value_counts().sort_index().to_string())

## Number of Images
Similarly, each study contains between 1 and 7 images with most studies consisting of 1 image.

In [None]:
dir_images = [glob.glob(dir_study + '/*/*.dcm') for dir_study in dir_studies]
df_study_test = pd.DataFrame({'study_id': [s.split('/')[-1] for s in dir_studies]})
df_study_test['num_images'] = [len(dir_img) for dir_img in dir_images]
_ = plt.hist(df_study_test['num_images'])
plt.title('Number of images per study')
plt.xlabel('Number of images per study')
plt.ylabel('Frequency')
plt.grid()
plt.show()
print(df_study_test['num_images'].value_counts().sort_index().to_string() + '\n')
print('Studies with the most images')
display(df_study_test.sort_values('num_images', ascending=False)[0:10])
df_study_test.to_csv('study_test.csv', index=False)

## Image Display
The next step is to load the list of test files.

In [None]:
test_files = sorted(glob.glob(os.path.join(base_path, 'test/*/*/*.dcm')))
print(f'Number of test files: {len(test_files)}')

Now that the list is loaded, lets take a look at one image full-size and several scaled down images.

In [None]:
plot_image(test_files[7], size=(20,20))
plot_image(test_files[0:16], size=(20,20))

## Image Size

In [None]:
img_list = []
for file in tqdm.tqdm(test_files):
    img = read_dicom_image(file)
    img_list.append({'file': file, 'width': img.shape[1], 'height': img.shape[0]})
df_images = pd.DataFrame(img_list)
df_images['ratio'] = df_images['width'] / df_images['height']
plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.hist(df_images['width'])
plt.title('Width Distribution of the Images')
plt.xlabel('Width')
plt.ylabel('Frequency')
plt.grid()
plt.subplot(1,2,2)
plt.hist(df_images['height'])
plt.title('Height Distribution of the Images')
plt.xlabel('Height')
plt.ylabel('Frequency')
plt.grid()
plt.show()
plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.scatter(df_images['width'], df_images['height'])
plt.title('Width and Height Distribution of the Images')
plt.xlabel('Width')
plt.ylabel('Height')
plt.grid()
plt.subplot(1,2,2)
plt.hist(df_images['ratio'], bins = 20)
plt.title('Width to Height Ratio of the Images')
plt.xlabel('Width to Height Ratio')
plt.ylabel('Frequency')
plt.grid()
plt.show()

## Duplicate Images
We also check the test images for duplicates. For speed reasons, we are only checking the first 200 files for duplicates.

In [None]:
df_pairs, group_list = images_find_duplicates(test_files[0:200], threshold=0.95)
print(f'\nNumber of duplicate pairs: {len(df_pairs)}')
print(f'Number of duplicate groups: {len(group_list)}')

Since several duplicate images were found in the dataset, we plot some of them.

In [None]:
for i, group in enumerate(group_list):
    group_ids = [os.path.basename(file) for file in group]
    print(f'\nGroup {i+1}')
    plot_image(group, size=(20, 10), title=group_ids, columns=8)

# Sample_Submission.csv
The `sample_submission.csv` file shows the format of the submissions file. consisting of the test image id and an rle encoded masks. The submission file contains both the study predictions and the image predictions using a common format with two columns.
- `id`: Id of the study/image followed by '_study' or '_image'
- `PredictionString`: A single string for the prediction for the study/image.
  - for studies: This is the predicted class followed by a confidence score and a one-pixel bounding box '0 0 1 1'
  - for images: class ID ('opacity'/'none'), confidence score, bounding box for each detected object.

In [None]:
df_submission = pd.read_csv(os.path.join(base_path,'sample_submission.csv'))
display(df_submission)