<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/24800/logos/header.png?t=2020-12-17-19-26-15">
<center>
    <h1 style="color:red;font-weight:900;font-size:2.5em">VinBigData Chest X-ray Abnormalities Detection</h1>
    <h3>Automatically localize and classify thoracic abnormalities from chest radiographs</h3>
</center>
<br>
<br>
<hr>
<h2 style="color:blue;font-weight:600"> About Competition </h2>
<p>
    Radiologists diagnose and treat medical conditions using imaging techniques like CT and PET scans, MRIs, and, of course, X-rays. Yet, as it happens when working with such a wide variety of medical tools, radiologists face many daily challenges, perhaps the most difficult being the chest radiograph. The interpretation of chest X-rays can lead to medical misdiagnosis, even for the best practicing doctor. Computer-aided detection and diagnosis systems (CADe/CADx) would help reduce the pressure on doctors at metropolitan hospitals and improve diagnostic quality in rural areas.
</p>
<p>
    In this competition we are to predict the thoracic abnormalities in given X-Ray images and also locate those abnormalities. The data provided include:
    <ul>
    <li>Train and Test X-Ray images in folders <b style="font-weight:700">Train</b> and <b style="font-weight:700">Test</b>
    <li> sample submission file in sample_submission.csv
    <li> train dataframe in train.csv
    </ul>
</p>
<hr>
<br>
<a id="home"></a>
<div class="list-group" id="list-tab" role="tablist" style="background: rgb(49,114,163);
background: radial-gradient(circle, rgba(49,114,163,1) 0%, rgba(26,136,181,1) 15%, rgba(1,159,200,1) 52%, rgba(0,212,255,1) 60%, rgba(0,182,224,1) 64%, rgba(0,145,186,1) 69%, rgba(1,66,104,1) 82%, rgba(2,33,70,1) 95%, rgba(24,23,50,1) 100%);">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home"style="background: rgb(49,114,163);
background: radial-gradient(circle, rgba(49,114,163,1) 0%, rgba(26,136,181,1) 15%, rgba(1,159,200,1) 52%, rgba(0,212,255,1) 60%, rgba(0,182,224,1) 64%, rgba(0,145,186,1) 69%, rgba(1,66,104,1) 82%, rgba(2,33,70,1) 95%, rgba(24,23,50,1) 100%);">Table of Contents</h3>
    <center>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#first" role="tab" aria-controls="profile">First Look at the Data<span class="badge badge-primary badge-pill">1</span></a>
    <a class="list-group-item list-group-item-action" data-toggle="list" href="#second" role="tab" aria-controls="profile">EDA<span class="badge badge-primary badge-pill">2</span></a>
    <a class="list-group-item list-group-item-action" data-toggle="list" href="#third" role="tab" aria-controls="profile">An insight of the Data<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#fourth" role="tab" aria-controls="messages">Data Preparation<span class="badge badge-primary badge-pill">3</span></a>
    <a class="list-group-item list-group-item-action" data-toggle="list" href="#fifth" role="tab" aria-controls="messages">Model Building and training<span class="badge badge-primary badge-pill">4</span></a>
    </center>
</div>
<hr>
<h1 style="color:red">Note:</h1>
<h5 style="color:red">The utilities_x_ray module used here is a script that I have written(can be found <a href="https://www.kaggle.com/bibhash123/utilities-x-ray">here</a>). It contains some functions for visualization of the X-Ray images. The dicom image reading pipeline is taken from <a href="https://www.kaggle.com/raddar/popular-x-ray-image-normalization-techniques"> this Notebook</a> by <a href="https://www.kaggle.com/raddar">@raddar</a></h5>
<h5 style="color:red">The numpy files dataset used in this notebook can be found <a href="https://www.kaggle.com/bibhash123/xraynumpy">Here</a></h5>

<h2 style="color:blue">Updates:</h2>
<ul>
    <li>Improved data loading speed by using numpy files dataset</li>
    <li>Implemented Kfold cross validation</li>
    <li>Wrote a training and prediction loop</li>
</ul>

In [None]:
import numpy as np
import random
import tensorflow as tf
import tensorflow.keras.layers as L
import tensorflow.keras.backend as K
import seaborn as sns
import pandas as pd
import cv2
import os
import matplotlib.pyplot as plt
from utilities_x_ray import read_xray,showXray
from tqdm import tqdm
import pydicom
from sklearn.model_selection import KFold

import warnings
warnings.filterwarnings("ignore")

In [None]:
def seedAll(seed=355):
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    random.seed(seed)
seedAll()

<h1 style="display:inline"> <a id="first"> First Look at the data</a></h1>&emsp;&emsp;&emsp;&emsp;&emsp;<a href="#home" style="color:blue"><img src="https://toppng.com/uploads/preview/light-blue-up-arrow-11550117759k4je61afsa.png" style="display:inline;width:2em;height:2em"></a>

## 1. DataFrames

In [None]:
train = pd.read_csv('../input/vinbigdata-chest-xray-abnormalities-detection/train.csv')
ss = pd.read_csv('../input/vinbigdata-chest-xray-abnormalities-detection/sample_submission.csv')

In [None]:
train.head()

<ul>
<li><code>image_id</code> - unique image identifier</li>
<li><code>class_name</code>&nbsp;- the name of the class of detected object (or "No finding")</li>
<li><code>class_id</code>&nbsp;- the ID of the class of detected object</li>
<li><code>rad_id</code>&nbsp;- the ID of the radiologist that made the observation</li>
<li><code>x_min</code>&nbsp;- minimum X coordinate of the object's bounding box</li>
<li><code>y_min</code>&nbsp;- minimum Y coordinate of the object's bounding box</li>
<li><code>x_max</code>&nbsp;- maximum X coordinate of the object's bounding box</li>
<li><code>y_max</code>&nbsp;- maximum Y coordinate of the object's bounding box</li>
</ul>

In [None]:
ss.head()

The submission file must contain the image id and the prediction string in the format "a b (c,d,e,f)"<br>where
<ul>
    <li>a = predicted class ; 14 for no abnormality</li>
    <li>b= confidence</li>
    <li>(c,d,e,f) = (x_min,y_min,x_max,y_max)</li>
</ul>

## 2. Images

In [None]:
plt.figure(figsize=(8,10))
plt.imshow(read_xray('../input/vinbigdata-chest-xray-abnormalities-detection/train/0108949daa13dc94634a7d650a05c0bb.dicom'),cmap=plt.cm.bone)

In [None]:
showXray('../input/vinbigdata-chest-xray-abnormalities-detection/train/0108949daa13dc94634a7d650a05c0bb.dicom',train,with_boxes=True)

<h1 style="display:inline"><a id="second">EDA</a></h1>&emsp;&emsp;&emsp;&emsp;&emsp;<a href="#home" style="color:blue"><img src="https://toppng.com/uploads/preview/light-blue-up-arrow-11550117759k4je61afsa.png" style="display:inline;width:2em;height:2em"></a>

In [None]:
print("Number of rows in train dataframe: {}".format(train.shape[0]))
print("Number of Unique images in train set: {}".format(train.image_id.nunique()))
print("Number of Classes: {}\n".format(train.class_name.nunique()))
print("Class Names: {}".format(list(train.class_name.unique())))

In [None]:
print("Null Values:")
train.isna().sum().to_frame().rename(columns={0:'Null Value count'}).style.background_gradient('viridis')

The number of null values are same as the number of samples that do not have any abnormality

### The Distribution of Classes
We can see there is a huge class imbalance. The number of negative examples are very high and a few abnormalities have very few examples 

In [None]:
plt.figure(figsize=(9,6))
sns.countplot(train["class_id"]);
plt.title("Class Distributions");

### Distribution of Radiologists

In [None]:
plt.figure(figsize=(9,6))
sns.countplot(train["rad_id"]);
plt.title("rad_id Distributions");

<h1 style="display:inline"><a id="third"> An Intuition of the Data</a></h1>&emsp;&emsp;&emsp;&emsp;&emsp;<a href="#home" style="color:blue"><img src="https://toppng.com/uploads/preview/light-blue-up-arrow-11550117759k4je61afsa.png" style="display:inline;width:2em;height:2em"></a><br><br>
<h5>Before proceeding further let us try and get an intuition of the data and what exactly we need to do.</h5>
<h5> In this competition we have been given 15000 images for training. Parallelly we have a dataframe containing the ground truths for various abnormalities. Every sample in the datframe contains:</h5>
  <ul>
      <li>the image id</li><li>the id of the radiologist who annoted it</li><li>the name of the corresponding class</li><li>the class id</li><li>the bounding box coordinates</li>
  </ul>
<b style="font-weight:700">Important points to be noted here are:</b>
<ul>
    <li>Each image may have multiple corresponding abnormalities. Therefore this is a multilabel prediction</li>
    <li>Bounding boxes for each image have been annoted by multiple radiologists. Therefore for every sample we have multiple ground truths. A naive way to deal with this is to take mean of bounding box coordinates by every radiologists for a particular abnormality</li>
    <li>There is a significant class imbalance which is likely to affect the performance of models a lot.</li>
</ul>
<h4 style="font-weight:700">Information about dicom can be found: <a href="https://en.wikipedia.org/wiki/DICOM" style="font-size:1em">Here</a></h4>
<h4 style="font-weight:700">Procedure to extract DICOM metadata can be found in: <a href="https://www.kaggle.com/mrutyunjaybiswal/vbd-chest-x-ray-abnormalities-detection-eda" style="font-size:1em">this notebook</a></h4>

<h1 style="display:inline"><a id="fourth">Data Preparation</a></h1>&emsp;&emsp;&emsp;&emsp;&emsp;<a href="#home" style="color:blue"><img src="https://toppng.com/uploads/preview/light-blue-up-arrow-11550117759k4je61afsa.png" style="display:inline;width:2em;height:2em"></a>

In [None]:
class_names = sorted(train.class_name.unique())
del class_names[class_names.index('No finding')]
class_names = class_names+['No finding']
classes = dict(zip(list(range(15)),class_names))

In [None]:
def prepareDataFrame(train_df= train):
    train_df = train_df.fillna(0)
    cols = ['image_id','label']+list(range(4*len(class_names[:-1])))
    return_df = pd.DataFrame(columns=cols)
    
    for image in tqdm(train_df.image_id.unique()):
        df = train_df.query("image_id==@image")
        label = np.zeros(15)
        for cls in df.class_id.unique():
            label[int(cls)]=1
        bboxes_df = df.groupby('class_id')[['x_min','y_min','x_max','y_max']].mean().round()
        
        bboxes_list = [0 for i in range(60)]
        for ind in list(bboxes_df.index):
            bboxes_list[4*ind:4*ind+4] = list(bboxes_df.loc[ind,:].values)
        return_df.loc[len(return_df),:] = [image]+[label]+bboxes_list[:-4]
    return return_df
train_df = prepareDataFrame()

In [None]:
train_df.head(2)

In [None]:
def generateFolds(n_splits = None):
    kf = KFold(n_splits= n_splits)
    for id,(tr_,val_) in enumerate(kf.split(train_df["image_id"],train_df["label"])):
        train_df.loc[val_,'kfold'] = int(id)
    train_df["kfold"].astype(int)

generateFolds(n_splits=5)

In [None]:
class DataLoader:
    def __init__(self,path = None,train_df=train_df,val_df=None):
        self.path = path
        self.df = train_df
        self.val_df = val_df
        self.train_list = [f'{img}.npy' for img in train_df["image_id"].unique()]
        np.random.shuffle(self.train_list)
        self.test_list = [f'{img}.npy' for img in val_df["image_id"].unique()]
        np.random.shuffle(self.test_list)
    
    def read_image(self):
        for img in self.train_list:
            im_name = img.split('.npy')[0]
            image = np.load(self.path+img)
            temp = self.df[self.df.image_id==im_name]
            c_label,bb = temp.iloc[0,1],temp.iloc[0,2:].values.astype('float')
            yield image,c_label,bb
    
    
    def batch_generator(self,items,batch_size):
        a=[]
        i=0
        for item in items:
            a.append(item)
            i+=1

            if i%batch_size==0:
                yield a
                a=[]
        if len(a) is not 0:
            yield a
            
    def flow(self,batch_size):
        """
        flow from given directory in batches
        ==========================================
        batch_size: size of the batch
        """
        while True:
            for bat in self.batch_generator(self.read_image(),batch_size):
                batch_images = []
                batch_c_labels = []
                batch_bb = []
                for im,im_c_label,im_bb in bat:
                    batch_images.append(im)
                    batch_c_labels.append(im_c_label)
                    batch_bb.append(im_bb)
                batch_images = np.stack(batch_images,axis=0)
                batch_labels =  (np.stack(batch_c_labels,axis=0),np.stack(batch_bb,axis=0))
                yield batch_images,batch_labels
    
    def getVal(self):
        images = []
        c_labels = []
        bb_labels = []
        for img in self.test_list:
            im_name = img.split('.npy')[0]
            image = np.load(self.path+img)
            temp = self.val_df[self.val_df.image_id==im_name]
            c_label,bb = temp.iloc[0,1],temp.iloc[0,2:].values.astype('float')
            images.append(image)
            c_labels.append(c_label)
            bb_labels.append(bb)
        return np.stack(images,axis=0),(np.stack(c_labels,axis=0),np.stack(bb_labels,axis=0))
    

<h1 style="display:inline"><a id="fifth">Model Building and Training</a></h1>&emsp;&emsp;&emsp;&emsp;&emsp;<a href="#home" style="color:blue"><img src="https://toppng.com/uploads/preview/light-blue-up-arrow-11550117759k4je61afsa.png" style="display:inline;width:2em;height:2em"></a>

In [None]:
def build():
    in1 = L.Input(shape=(256,256,1))
    
    out1 = L.Conv2D(32,(3,3),activation="relu")(in1)
    out1 = L.Conv2D(32,(3,3),activation="relu")(out1)
    out1 = L.MaxPooling2D((2,2))(out1)
    
    out1 = L.Conv2D(64,(3,3),activation="relu")(out1)
    out1 = L.Conv2D(64,(3,3),activation="relu")(out1)
    out1 = L.MaxPooling2D((2,2))(out1)
    
    out1 = L.Conv2D(128,(3,3),activation="relu")(out1)
    out1 = L.Conv2D(128,(3,3),activation="relu")(out1)
    out1 = L.MaxPooling2D((2,2))(out1)
    out1 = L.Flatten()(out1)
    
    out2 = L.Dense(50,activation="relu",kernel_initializer="lecun_normal")(out1)
    out2 = L.Dense(30,activation="relu",kernel_initializer="lecun_normal")(out2)
    out2 = L.Dense(15,activation="sigmoid",kernel_initializer="lecun_normal",name='class_out')(out2)
    
    out3 = L.Dense(50,activation="relu",kernel_initializer="lecun_normal")(out1)
    out3 = L.Dense(30,activation="relu",kernel_initializer="lecun_normal")(out3)
    out3 = L.Dense(56,activation="relu",kernel_initializer="lecun_normal",name="bb_out")(out3)
    
    model = tf.keras.Model(inputs=in1,outputs=[out2,out3])
    model.compile(loss={'class_out':'categorical_crossentropy','bb_out':'mse'},optimizer="adam")
    return model

In [None]:
model = build()

In [None]:
tf.keras.utils.plot_model(model)

<h2>Training Loop</h2>

In [None]:
def getTest(path=None):
    images = []
    for img in tqdm(os.listdir(path)):
        im_name = img.split('.npy')[0]
        image = np.load(path+img)
        images.append(image)
    return np.stack(images,axis=0)

X_test = getTest('../input/xraynumpy/images/test/')

In [None]:
class_label = np.zeros((len(X_test),15))
bb_label = np.zeros((len(X_test),56))

for fold in range(5):
    print(f'\nFold: {fold}\n')
    
    X_train = train_df[train_df.kfold!=fold].drop('kfold',axis=1)
    X_val = train_df[train_df.kfold==fold].drop('kfold',axis=1)
    
    dl = DataLoader('../input/xraynumpy/images/train/',X_train,X_val)
    train_set = dl.flow(batch_size=32)
    X_eval,Y_eval = dl.getVal()
    
    chckpt = tf.keras.callbacks.ModelCheckpoint(f'./model_f{fold}.hdf5',monitor='val_loss',mode='min',save_best_only=True)
    
    K.clear_session()
    model = build()
    
    model.fit(train_set,
             epochs=10,
              steps_per_epoch=int(15000/32),
              validation_data = (X_eval,Y_eval),
              callbacks = [chckpt]
             )
    
    c,b = model.predict(X_test)
    class_label+=c
    bb_label+=b
class_label = class_label/5
bb_label = bb_label/5
np.save('./class_label.npy',class_label)
np.save('./bb_label.npy',bb_label)

# Work in Progress....
<h2 style="color:blue">To Do:</h2>
<ul>
    <li><h2 style="color:blue">1.Implement submission pipeline</h2></li>
    <li><h2 style="color:blue">2.Implement an evaluation metric corresponding to competition evaluation criteria</h2></li>
    <li><h2 style="color:blue">3.Choose better loss functions</h2></li>
</ul>