<center><h1 style="color:darkblue">VinBigData Chest X-ray Abnormalities Detection</h1></center>
<center><h1 style="color:brown">Automatically localize and classify thoracic abnormalities from chest radiographs</h1></center>
<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/24800/logos/header.png?t=2020-12-17-19-26-15">

# 1. About the Competition

When you have a broken arm, radiologists help save the day—and the bone. These doctors diagnose and treat medical conditions using imaging techniques like CT and PET scans, MRIs, and, of course, X-rays. Yet, as it happens when working with such a wide variety of medical tools, radiologists face many daily challenges, perhaps the most difficult being the chest radiograph. <span style="color:orange">The interpretation of chest X-rays can lead to medical misdiagnosis</span>, even for the best practicing doctor. Computer-aided detection and diagnosis systems (CADe/CADx) would help reduce the pressure on doctors at metropolitan hospitals and improve diagnostic quality in rural areas.

Existing methods of interpreting chest X-ray images classify them into a list of findings. ***There is currently no specification of their locations on the image which sometimes leads to inexplicable results***. A solution for <span style="color:brown">localizing findings on chest X-ray images</span> is needed for providing doctors with more meaningful diagnostic assistance.

In this competition:
- **Task**: Automatically localize and classify <span style="color:blue">14 types of thoracic abnormalities</span> from chest radiographs. 
- **Dataset**: Consisting of <span style="color:blue">18,000 scans</span>: <span style="color:blue">15,000 train images</span> and will be evaluated on a test set of <span style="color:blue">3,000 images</span>. 

- These annotations were collected via VinBigData's web-based platform, VinLab. Details on building the dataset can be found in the organizer's recent paper [“VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations”](https://storage.googleapis.com/kaggle-media/competitions/VinBigData/VinDr_CXR_data_paper.pdf).

# 2. Data

# 2.1 Intro

In this competition, we are classifying common thoracic lung diseases and localizing critical findings. This is **an object detection and classification** problem.

> For each test image, you will be predicting a bounding box and class for all findings. If you predict that there are no findings, you should create a prediction of "14 1 0 0 1 1" (14 is the class ID for no finding, and this provides a one-pixel bounding box with a confidence of 1.0).

The images are in DICOM format, which means they contain additional data that might be useful for visualizing and classifying.

# 2.2 Dataset information

The dataset comprises 18,000 postero-anterior (PA) CXR scans in DICOM format, which were de-identified to protect patient privacy. All images were labeled by a panel of experienced radiologists for the presence of 14 critical radiographic findings as listed below:

|class_id|class_name|
|---|---|
|0 | Aortic enlargement|
|1 | Atelectasis|
|2 | Calcification|
|3 | Cardiomegaly|
|4 | Consolidation|
|5| ILD|
|6 | Infiltration|
|7 | Lung Opacity|
|8 | Nodule/Mass|
|9 | Other lesion|
|10 | Pleural effusion|
|11 | Pleural thickening|
|12 | Pneumothorax|
|13 | Pulmonary fibrosis|
|14|No Finding|

> The "No finding" observation (14) was intended to capture the absence of all findings above.

***Note that a key part of this competition is working with ground truth from multiple radiologists.***

# 3. EDA

In [None]:
import os
import gc
import time
import random
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
import matplotlib.patches as ptc

from functools import partial
import multiprocessing as mpc
from joblib import Parallel, delayed

import pydicom as pdc
from pydicom.pixel_data_handlers.util import apply_voi_lut

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
train_dir = "../input/vinbigdata-chest-xray-abnormalities-detection/train"
test_dir = "../input/vinbigdata-chest-xray-abnormalities-detection/test"

train_files = os.listdir(train_dir)
test_files = os.listdir(test_dir)

train_df = pd.read_csv("../input/vinbigdata-chest-xray-abnormalities-detection/train.csv")
sample_submission = pd.read_csv("../input/vinbigdata-chest-xray-abnormalities-detection/sample_submission.csv")

# 3.1 Train.csv

The train set metadata, with one row for each object, including a class and a bounding box. Some images in both test and train have multiple objects.

## Columns

- **image_id** - unique image identifier
- **class_name** - the name of the class of detected object (or "No finding")
- **class_id** - the ID of the class of detected object
- **rad_id** - the ID of the radiologist that made the observation
- **x_min** - minimum X coordinate of the object's bounding box
- **y_min** - minimum Y coordinate of the object's bounding box
- **x_max** - maximum X coordinate of the object's bounding box
- **y_max** - maximum Y coordinate of the object's bounding box

In [None]:
train_df

In [None]:
train_df.isna().sum().to_frame().rename(columns={0:"Nan_counts"}).style.background_gradient(cmap="cool")

In [None]:
train_df.nunique().to_frame().rename(columns={0:"Unique Values"}).style.background_gradient(cmap="plasma")

In [None]:
plt.figure(figsize=(26, 8))
sns.countplot(x="class_name", data=train_df)
plt.title("Class Name Distribution")
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
sns.countplot(x="class_id", data=train_df)
plt.title("Class ID Distribution")
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
sns.countplot(x="rad_id", data=train_df)
plt.title("RAD ID Distribution")
plt.show()

In [None]:
plt.figure(figsize=(10, 10))
sns.pairplot(train_df, hue='class_id')
plt.show()

# 4. DICOM Exploration

In [None]:
print("Number of train images: ", len(train_files))
print("Number of test images: ", len(test_files))

Okay, as described, we have **15000** images for train and **3000** images for test. But, the length of train_df is **67914**. So, are there any duplicacy or what? Let's have a sanity check on unique image_ids of train_df.

In [None]:
print("Unique images in train_df: ", train_df.image_id.nunique())

So, it's confirmed that we have one or more than one ground truths in our train dataset. Now, we got a lot to explore, such as:
- Whether the bboxes are overlapped on a single image?
- What about different classes? Are they over-lapped too for different classes?
- In a single image, what's the maximum number of class (or taregt present)? etc.

We will continue to explore these aspects and try to come with some insights, but let's worry about DICOM at the first place. Let's read out some image file names from both the train and test folders.

In [None]:
for _ in range(3):
    print(train_files[random.randint(0, len(train_files))])

# 4.1. But What is [DICOM](https://en.wikipedia.org/wiki/DICOM)?

What wiki says is: 

> Digital Imaging and Communications in Medicine (DICOM) is the standard for the communication and management of medical imaging information and related data. DICOM is most commonly used for storing and transmitting medical images enabling the integration of medical imaging devices such as scanners, servers, workstations, printers, network hardware, and picture archiving and communication systems (PACS) from multiple manufacturers. It has been widely adopted by hospitals and is making inroads into smaller applications like dentists' and doctors' offices.

What else?

- DICOM files can be exchanged between two entities that are capable of receiving <span style="color:red">image</span> and <span style="color:orange">patient data</span> in **DICOM format**. 
    
- The different devices come with DICOM Conformance Statements which state which DICOM classes they support. The standard includes a file format definition and a network communications protocol that uses TCP/IP to communicate between systems.

Let's read one sample and grind it.

In [None]:
sample_fn = train_df.image_id.to_list()[random.randint(0, len(train_df))]
sample = pdc.read_file(os.path.join(train_dir, sample_fn + ".dicom"))
sample

So, for a single patient(yes, there is one-one mapping in patients and image_ids, so there is no duplicates or more than one images per patient, as confirmed by the organizers in the discussion forum), we have following entities:

- Patient's Sex                      
- Samples per Pixel                 
- Photometric Interpretation         
- Rows                               
- Columns                          
- Pixel Spacing                      
- Bits Allocated                      
- Bits Stored                         
- High Bit                            
- Pixel Representation             
- Window Center                       
- Window Width                      
- Rescale Intercept                   
- Rescale Slope                       
- Lossy Image Compression             
- Pixel Data                         

*NOTE: Do let me know about the extensions in the comment section, untill I update it myself in the next version :)*

In [None]:
def read_dicom_df(fn):
    _ = pdc.read_file(os.path.join(train_dir, fn))
    pass

#_ = Parallel(-1, verbose=1)(delayed(read_dicom_df)(x) for x in train_files)

# 5. Images(Extension from Dicom)

Update 1: Ref: https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way by @raddar

# 5.1 Visualising Images

In [None]:
def plot_pixel_array(data, figsize=(10,10)):
    plt.figure(figsize=figsize)
    plt.imshow(data, cmap=plt.cm.bone)
    plt.show()


# ref kernel: https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
def read_xray(path, voi_lut=True, fix_monochrome=True):
    dcm_data = pdc.read_file(path)
    
    def show_dcm_info(data):
        print("Gender :", data.PatientSex)
        if 'PixelData' in data:
            rows = int(data.Rows)
            cols = int(data.Columns)
            print("Image size : {rows:d} x {cols:d}, {size:d} bytes".format(
                rows=rows, cols=cols, size=len(data.PixelData)))
            if 'PixelSpacing' in data:
                print("Pixel spacing :", data.PixelSpacing)
    
    show_dcm_info(dcm_data)
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dcm_data.pixel_array, dcm_data)
    else:
        data = dcm_data.pixel_array
               
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dcm_data.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data

In [None]:
print("Examining train images...")
for _ in range(5):
    fn = train_files[np.random.randint(0, len(train_files))]
    file_path = os.path.join(train_dir, fn)
    data = read_xray(file_path)
    plot_pixel_array(data)

In [None]:
print("Examining test images...")
for _ in range(5):
    fn = test_files[np.random.randint(0, len(test_files))]
    file_path = os.path.join(test_dir, fn)
    data = read_xray(file_path)
    plot_pixel_array(data)

# 5.2 Examine the localizations

In [None]:
for _ in range(10):
    idx = np.random.randint(0, len(train_files))
    img_id = train_df.loc[idx, 'image_id']
    img = read_xray(os.path.join(train_dir, img_id+".dicom"))
    plt.figure(figsize=(8, 14))
    plt.imshow(img, cmap='gray')
    plt.title(train_df.loc[idx, 'class_name'])
    
    if train_df.loc[idx, 'class_name'] != 'No finding':
        bbox = [train_df.loc[idx, 'x_min'],
                train_df.loc[idx, 'y_min'],
                train_df.loc[idx, 'x_max'],
                train_df.loc[idx, 'y_max']]
        
        patch = ptc.Rectangle((bbox[0], bbox[1]),
                              bbox[2]-bbox[0],
                              bbox[3]-bbox[1],
                              ec='r', fc='none', lw=2.)
        ax = plt.gca()
        ax.add_patch(patch)

# 6. Sample Submission

In [None]:
sample_submission

In [None]:
sample_submission.to_csv("submission.csv", index=False)

<h1 style="color:red">Work in Progress...</h1>
Thanks for reading till the end, consider upvoting the kernel, if you enjoyed reading it ;)