#  The VinBigData Chest X-ray Classification Competition

In this competition, we are localizing critical findings and classifying common thoracic lung diseases. Results can be submitted as in the **'sample_submission.scv'** file; the id of the test images with a string containing the class id, the confidence value, and the bounding box. For the normal test images, i.e. no findings, the submission should be the image_id and the prediction string **`14 1 0 0 1 1`** *(the class ID , 14 for 'no finding', a one-pixel bounding box, and a confidence of 1.0)*
  
### (This is an ongoing notebook!)


# The Data 


The sample data contains **18000** images; **15000** manually labeled images for training (in the **train** directory) and **3000** for (testing in the **test** directory). There are **14** Classes of the findings in the training data, they have the following (class_id and class_name):

> **`0`** - Aortic enlargement <br>
**`1`** - Atelectasis <br>
**`2`** - Calcification <br>
**`3`** - Cardiomegaly <br>
**`4`** - Consolidation <br>
**`5`** - ILD <br>
**`6`** - Infiltration <br>
**`7`** - Lung Opacity <br>
**`8`** - Nodule/Mass <br>
**`9`** - Other lesion <br>
**`10`** - Pleural effusion <br>
**`11`** - Pleural thickening <br>
**`12`** - Pneumothorax <br>
**`13`** - Pulmonary fibrosis <br>
**`14`** - "No finding"

The last class, 14, means that the x-ray image is normal, i.e. has no findings.

The **'train.cvs'** file contains the labels for each training image with the following columns:

> **`image_id`** - unique image identifier<br>
**`class_name`** - one of the 14th class names<br>
**`class_id`** - the ID of the class of detected object<br>
**`rad_id`** - the ID of the radiologist that made the observation<br>
**`x_min`** - minimum X coordinate of the object's bounding box<br>
**`y_min`** - minimum Y coordinate of the object's bounding box<br>
**`x_max`** - maximum X coordinate of the object's bounding box<br>
**`y_max`** - maximum Y coordinate of the object's bounding box


## Import libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import pydicom
import warnings
warnings.filterwarnings("ignore")
import os



### 1. Data Directory

In [None]:
DataDir = "../input/vinbigdata-chest-xray-abnormalities-detection/"
!ls {DataDir}

In [None]:
Ntrain = len(os.listdir(DataDir+'train'))
Ntest = len(os.listdir(DataDir+'test'))
print(f"\n... The number of training files is {Ntrain} ...")
print(f"... The number of testing files is {Ntest} ...")

### 2. Reading Data

In [None]:
train = pd.read_csv(DataDir+'train.csv')
sub = pd.read_csv(DataDir+'sample_submission.csv')

In [None]:
train.head()

In [None]:
len(train)

So, 67914 rows mean that most of the images have multiple labels.

In [None]:
len(sub)

In [None]:
sub.head()

The prediction string contain 6 values "14 1 0 0 1 1";  the class ID ,14 for'no finding', followed by the confidence value, 1 for class 14, then the coordinate of the bounding box (in this case a one-pixel bounding box).
 
The images are in DICOM format, which means they contain additional data that might be useful for visualizing and classifying.

## Coverting DICOM Images into Data Arrays (PNG images)

Most of DICOM's store pixel values in exponential scale, which is resolved by standard standard DICOM viewers. So in order to get jpg/png the correct way (DICOM metadata stores information how to make such "human-friendly" transformations), we need to apply some transformations.

In [None]:
#
# Reading the image as numpy.array
#
dcm_File_Name = DataDir+ 'train/'+train['image_id'][2]+'.dicom'
dcm_file = pydicom.dcmread(dcm_File_Name)

dcm_pixels = dcm_file.pixel_array
dcm_pixels

### Displaying the X-Ray Image

In [None]:
plt.figure(figsize=(10,8))
plt.imshow(dcm_pixels, cmap=plt.cm.gray)
plt.show()

### Now transform image data into a data array

In [None]:
from pydicom.pixel_data_handlers.util import apply_voi_lut

def dicom2png(path, voi_lut = True, fix_monochrome = True):
    dicom = pydicom.read_file(path)
    
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
               
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data

In [None]:
#
# Image without fixing Monochrome.
#
png_image = dicom2png(dcm_File_Name, fix_monochrome = False)
plt.figure(figsize = (8,8))
plt.imshow(png_image, 'gray')

In [None]:
#
# Image after fixing Monochrome.
#
png_image = dicom2png(dcm_File_Name)
plt.figure(figsize = (8,8))
plt.imshow(png_image, 'gray')

### Extract Patient's and Image attributes from Raw Dicom files. 

In [None]:
#
# Finding the keywords that access the Attributes (Data elements).
#
dcm_file.dir()

In [None]:
def get_dcm_attributes(path):

    df = pd.DataFrame(columns=['image_id', 'Age', 'Gender','Image_Hieght',
                    'ImageWidth','x_spacing','y_spacing'])
    #Read some files for testing
    files = list(os.listdir(path))[0:10]
    #Read All files
    #files = list(os.listdir(path))
   
    try:
        i = 0
        for file in files:

            file_path = os.path.join(path,file)
            dcmData = pydicom.dcmread(file_path,stop_before_pixels=True)

            file_name = file.split(".")[0]

            attributes = dcmData.dir()
            if 'PatientAge' in attributes:
                age_str = dcmData.PatientAge
                if age_str != '' and age_str != 'Y':
                    age = int(age_str[:-1])
                else:
                    age = np.NaN
            else:
                age = np.NaN
            if 'PatientSex' in attributes:
                gender = dcmData.PatientSex
                if gender =='' : gender = np.NaN
            else:
                gender = np.NaN
            if 'Rows' in attributes:
                rows = dcmData.Rows
            else:
                rows = np.NaN
            if 'Columns' in attributes:
                clmns = dcmData.Columns
            else:
                clmns = np.NaN
            if 'PixelSpacing' in attributes:
                ps = dcmData.PixelSpacing
            else:
                ps = [np.NaN,np.NaN]

            df = df.append(pd.DataFrame({'image_id': file_name, 
                    'Age': age, 'Gender': gender,'Image_Hieght': rows,
                    'ImageWidth': clmns,
                    'x_spacing': ps[0],'y_spacing': ps[1]}, index=[i]))
            i+=1
    except ValueError:
            print('age_str',"   ", age_str)
    return df

In [None]:
#
# Reading attributes. (it takes several minutes for the whole dataset)
#
TrainDir = DataDir+'train/'
dcm_attr = get_dcm_attributes(TrainDir)
dcm_attr.head(10)

In [None]:
np.sum(dcm_attr.isna())

## Plotting Bounding Boxes


In [None]:
trainf = train[train['class_id'] != 14]
trainf.head()

In [None]:
from glob import glob
import cv2
import random
from random import randint

In [None]:
    
def plot_img(img, size=(8, 8), is_rgb=True, title="", cmap='gray'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()


def plot_imgs(imgs, cols=4, size=8, is_rgb=True, title="", cmap='gray', img_size=(500,500)):
    rows = len(imgs)//cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()
    

In [None]:
imgs = []
img_ids = trainf['image_id'].values
class_ids = trainf['class_id'].unique()

# map class_id to a color code
class2color = {class_id:[randint(0,255) for i in range(3)] for class_id in class_ids}
thickness = 3
scale = 5


for i in range(8):
    img_id = random.choice(img_ids)
    img_path = f'{DataDir}/train/{img_id}.dicom'
    img = dicom2png(img_path)
    img = cv2.resize(img, None, fx=1/scale, fy=1/scale)
    img = np.stack([img, img, img], axis=-1)
    
    boxes = trainf.loc[trainf['image_id'] == img_id, ['x_min', 'y_min', 'x_max', 'y_max']].values/scale
    classes = trainf.loc[trainf['image_id'] == img_id, ['class_id']].values.squeeze()
    
    for c_id, box in zip(classes, boxes):
        color = class2color[c_id]
        img = cv2.rectangle(
            img,
            (int(box[0]), int(box[1])),
            (int(box[2]), int(box[3])),
            color, thickness
    )
    img = cv2.resize(img, (500,500))
    imgs.append(img)
    
plot_imgs(imgs, cmap=None)

## EDA of the Training Data

Now we explore the provided annotations for the training data in the **'train.csv'** file. 

From the below analysis, we see that:

* The minimum number of annotations per patient are 3 and the maximum are 57.
* There are 17 radiologists, only radiologists (from R8 to R17) who annotated the images with findings. Most annotations are done by Radiologists R8, R9, and R10.
* Radiologists R8 and R9 are consistent with each others, while R9 seams to have more annotations.
* The most identified classes are 0, 3, and 11 by R9.


In [None]:
trainf.head()

In [None]:
# Visualization Imports
from matplotlib.colors import ListedColormap
import plotly.graph_objects as go
import plotly.express as px
import plotly


In [None]:
fig = px.histogram(trainf, x="image_id", 
                  labels={"value":"# of Annotations"},
                  title="<b>    Number of annotations per patient </b>",
                  )
fig.show()

In [None]:
fig = px.histogram(trainf, x="rad_id", 
                  log_y = True,
                  labels={"value":"# of Annotations"},
                  title="<b>    Distribution of Radiologists for the 'Findings'</b>")
fig.update_layout(showlegend=False,
                  xaxis_title="<b>Radiologist ID</b>",
                  yaxis_title="<b>Count   (log scale)</b>",
                  )
fig.show()

In [None]:
fig = px.histogram(train[train['class_id']==14] , x="rad_id", 
                  log_y = True,
                  #labels={"value":"# of Annotations"},
                  title="<b>Distribution of Radiologists for the 'No Finding'</b>")
fig.update_layout(showlegend=False,
                  xaxis_title="<b>Radiologist ID</b>",
                  yaxis_title="<b>Count   (log scale)</b>",
                 )
fig.show()

In [None]:
fig = px.density_heatmap(train, x="rad_id", y="class_id",
                 title="Distribution of Classes per Radiologists")
fig.update_layout(showlegend=False,
                 xaxis_title="Radiologist ID",
                 yaxis_title="Class ID",
                 )
fig.show()

In [None]:
#from pandas_profiling import ProfileReport
#profile = ProfileReport(train, title='Pandas Profiling Report')#, explorative=True)
#profile.to_widgets()

## Classes' Visualization

In [None]:
#
# Create Mapping of the Class id
#
# Create dictionary mappings

cid2cname = {i:train[train["class_id"]==i].iloc[0]["class_name"] for i in range(15)}
cname2cid = {v:k for k,v in cid2cname.items()}

print("\n... Dictionary Mapping of class_id to class name ...\n")
display(cid2cname)

print("\n... Dictionary Mapping class name to class_id ...\n")
display(cname2cid)

In [None]:
fig = go.Figure()

for i in range(15):
    fig.add_trace(go.Histogram(x=train[train["class_id"]==i]["rad_id"],
                name=f"<b>{cid2cname[i]}</b>"))

fig.update_xaxes(categoryorder="total descending")
fig.update_layout(title="Distribution of Classes Annotation per Radiologist",
                  barmode='stack',
                  xaxis_title="Radiologist ID",
                  yaxis_title="Number of Annotations",
                  )
fig.show()

In [None]:
from math import ceil
import matplotlib.patches
import matplotlib.gridspec as gridspec


In [None]:
def example(fclass, ex_num, path=DataDir+'train'):
    image_list = train[train['class_id']==fclass][0:ex_num]['image_id'].values
    image_index = train[train['class_id']==fclass][0:ex_num]['image_id'].index.values
    rows = int(ceil(ex_num / 4))
    gs = gridspec.GridSpec(rows, 4)
    fig = plt.figure(figsize=(30, 20))
    fig.suptitle(f"{cid2cname[fclass]}", fontsize=30)
    for i, (name, image_index) in enumerate(zip(image_list, image_index)):
        ax = fig.add_subplot(gs[i])
        ax.set_title(train.loc[i, 'image_id'])
        ax.axis("off")
        data = dicom2png(f"{path}/{name}.dicom")
        ax.imshow(data, cmap='gray')
        if cid2cname[fclass] != 'No finding':
            bbox = [train.loc[image_index, 'x_min'],
                    train.loc[image_index, 'y_min'],
                    train.loc[image_index, 'x_max'],
                    train.loc[image_index, 'y_max']]
            p = matplotlib.patches.Rectangle((bbox[0], bbox[1]),
                                             bbox[2]-bbox[0],
                                             bbox[3]-bbox[1],
                                             ec='r', fc='none', lw=2.)
            ax.add_patch(p)
    fig.tight_layout()
    plt.subplots_adjust(hspace=0.1, wspace=0.1, top=0.95)

In [None]:
example(0,4)

### ==============================

In [None]:
example(4,4)

### More to come, stay tuned ... Kindly, upvote (^_^)

### ============================
### Credit
I foundsome of the functions here in many notebooks without attribution, however, I benefited from notebooks by: @raddar, @trungthanhnguyen0502, and @dschuttler8845.
### =============================