## **RSNA Pneumonia Detection**

Pneumonia is an infection in one or both lungs. Bacteria, viruses, and fungi cause it. The infection causes inflammation in the air sacs in your lungs, which are called alveoli.

CXRs are the most commonly performed diagnostic imaging study. A number of factors such as 
positioning of the patient and depth of inspiration can alter the appearance of the CXR, complicating 
interpretation further. In addition, clinicians are faced with reading high volumes of images every shift.

Automating Pneumonia screening in chest radiographs, providing affected area details through bounding box.

## **Objective:**
The objective of this project is to build an algorithm to locate the position of inflammation in a medical image. The algorithm needs to  locate lung opacities on chest radiographs automatically

The objective of the project is,
* Learn to how to do build an Object Detection Model
* Use transfer learning to fine-tune a model.
* Learn to set the optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc.
* Read different research papers of given domain to obtain the knowledge of advanced models for the given problem.


#### Acknowledgment for the datasets: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements

### **1.0 Importing and installing the necessary Libraries**

In [None]:
!pip install tqdm
!pip install pydicom
!pip install -U albumentations

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
print('Tensforflow version:',tf.__version__)
print('Keras version:',keras.__version__)

In [None]:
# Initialize the random number generator
import random
random.seed(0)

# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
from glob import glob
import os
import time
import math
import fnmatch
import sys

from zipfile import ZipFile
from tqdm import tqdm_notebook

import numpy as np
import pandas as pd

import cv2
import matplotlib.pyplot as plt
import matplotlib.patches as patches 
from matplotlib.patches import Rectangle
import pydicom as dicom
import seaborn as sns

from sklearn.utils import shuffle
from skimage.measure import label, regionprops

import albumentations as A


from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.python.keras.utils.data_utils import Sequence
from tensorflow.keras.layers import Conv2D, Input, Flatten, Dense, Dropout, Concatenate, BatchNormalization, Conv2DTranspose
from tensorflow.keras.models import Model, Sequential, load_model
import tensorflow.keras.backend as K
from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.losses import binary_crossentropy


In [None]:
rootDir='/kaggle/input/'
workingDir='/kaggle/working/'
# zipFilename=rootDir+'rsna-pneumonia-detection-challenge.zip'
datasetPath=rootDir+'rsna-pneumonia-detection-challenge/'


The input folder contains 4 important information
* stage_2_train_labels.csv - CSV file containing the patient id, bounding boxes and target label
* stage_2_detailed_class_info.csv - CSV file containing the detail informaiton of patientid and the corresponding label
* stage_2_train_images - directory contains train images in DICOM format
* stage_2_test_images - directory contains test images in DICOM format

In [None]:
trainImagesDir=datasetPath+'stage_2_train_images/'
testImagesDir=datasetPath+'stage_2_test_images/'
sampleSubmission=datasetPath+'stage_2_sample_submission.csv'
classInfo=datasetPath+'stage_2_detailed_class_info.csv'
rsnaLink=datasetPath+'GCP Credits Request Link - RSNA.txt'
trainLabels=datasetPath+'stage_2_train_labels.csv'


In [None]:
image_train_path = os.listdir(trainImagesDir)
image_test_path = os.listdir(testImagesDir)
print("Number of images in train set:", len(image_train_path),"\nNumber of images in test set:", len(image_test_path))

In [None]:
# Loading the data
# There are two input files given - Detailed class info and train labels
class_info_df = pd.read_csv(classInfo)
train_labels_df = pd.read_csv(trainLabels) 

In [None]:
print("Detailed class info -  rows: {}, columns: {}".format(class_info_df.shape[0], class_info_df.shape[1]))
print("Train labels -  rows: {}, columns: {}".format(train_labels_df.shape[0], train_labels_df.shape[1]))

There are only 26683 images in the image directory, but the csv file contains 30227 rows. There are more rows than the images.

In [None]:
class_info_df.head(10)

In [None]:
train_labels_df.head(10)

In Detailed class info dataset , the detailed information about the type of class associated with a certain patientId is given. It has 3 entries "Lung Opacity", "Normal" and "No Lung Opacity/Not Normal"

The CSV file contains PatientId, bounding box details with (x,y) coordinates and width and height that encapsulates the box. It also contains the Target variable. For target variable 0, the bounding box values has NaN values.


If we look closely, there are duplicate entries for patientId in the csv files. We can observe row #4 and #5, row #8 and #9 have same patientId values, aka, the patient is identified with pneumonia at multiple areas in lungs

Check the unique patient ID in the train dataset

In [None]:
print("Unique patientId in  train_class_df: ", train_labels_df['patientId'].nunique())

#### **Checking missing data in two datasets**

In [None]:
train_labels_df.info()

For the info of the data, we observe that of the total 30227 rows, 9555 rows has non null. So, all bounding boxes are either defined or not defined.

In [None]:
print(train_labels_df[train_labels_df.Target==0].shape[0])
print(train_labels_df[train_labels_df.Target==1].shape[0])

We see from above that the total number of patientIds that are identified with Pneumonia are 9555 and it matches to the non null values. It can be inferred from this that all pneumonia data set has bounding boxes defined and for normal patients, no bounding boxes exist.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return np.transpose(pd.concat([total, percent], axis=1, keys=['Total', 'Percent']))
missing_data(train_labels_df)

In [None]:
missing_data(class_info_df)

68.38% of values are missing for x,y, height and width in train labels for target 0 (not Lung opacity) in train labels dataset

In [None]:
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=20)
sns.set_palette('Set2')

In [None]:
f, ax = plt.subplots(1,1, figsize=(6,4))
total = float(class_info_df.shape[0])
sns.countplot(class_info_df['class'], order = class_info_df['class'].value_counts().index)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(100*height/total),
            ha="center") 
plt.show()

In [None]:
# More details on classes - No Lung Opacity / Not Normal, Lung Opacity, Normal

def get_feature_distribution(data, feature):
    # Get the count for each label
    label_counts = data[feature].value_counts()

    # Get total number of samples
    total_samples = data.shape[0]

    # Count the number of items in each class
    print("{:<30s}:   count(percentage)".format(feature))
    for i in range(len(label_counts)):
        label = label_counts.index[i]
        count = label_counts.values[i]
        percent = round((count / total_samples) * 100, 2)
        print("{:<30s}:   {}({}%)".format(label, count, percent))

get_feature_distribution(class_info_df, 'class')


No Lung Opacity / Not Normal and Normal have together the same percent (68.39%) as the percent of missing values for target window in class details information.

In the train set, the percent of data with pneumonia is therefore 31.61%.

In [None]:
train_labels_df.Target.unique()

The target has two classifications 0 and 1 namely Normal and Pneumonia

#### **Merging train labels and Detailed class info datasets to get more insights**

In [None]:
train_labels_df.shape[0], class_info_df.shape[0]

In [None]:
# merging the two datasets (train and class detail info) using Patient ID as the merge criteria
train_class_df = train_labels_df.merge(class_info_df, left_on='patientId', right_on='patientId', how='inner')

In [None]:
train_class_df.sample(5)

In [None]:
#plotting the number of examinations for each class detected, grouped by Target value
fig, ax = plt.subplots(nrows=1,figsize=(12,6))
tmp = train_class_df.groupby('Target')['class'].value_counts()
df = pd.DataFrame(data={'Freq': tmp.values}, index=tmp.index).reset_index()
sns.barplot(ax=ax, x='Target', y='Freq', hue='class', data=df)
plt.title("Chest examination - Frequency of Targets")
plt.show()

Plot frequency distribution graph for bounding box detection for Lung Opacity 

Exploring Dicom image files - Reading training & test files

#### **Extracting a single image and processing DICOM information**

It is observed that some useful information are available in the DICOM metadata with predictive values, for example:

Patient sex, Patient age, Modality, Body part examined, View position, Rows & Columns, Pixel Spacing

In [None]:
print('A maximum of {} areas are detected in Lungs for pneumonia patient'.format(max(train_labels_df.patientId.value_counts())))

## **Preprocess the dataset for model input**

In [None]:
def update_dataset(path, df1):
    pid=[]
    label=[]
    bbox=[]

    for name, group in df1.groupby(['patientId','Target']):
        pid.append(path+group['patientId'].tolist()[0]+'.dcm')
        label.append(group['Target'].tolist()[0])
        if group['Target'].tolist()[0] == 1:
            ibbox=[]
            for row in group.iterrows():
                ibbox.append([row[1]['x'], row[1]['y'], row[1]['width'], row[1]['height']])
            bbox.append(ibbox)
        else:
            bbox.append([])
    df = pd.DataFrame({'patientId':pid, 'bboxes': bbox, 'label':label})
    return df

We can observe that the non-null values are 9555 which matches to the patients that have pneumonia problem

In [None]:
df=update_dataset(trainImagesDir, train_labels_df)
print(df.shape)
df.head()

In [None]:
print('Total number of patients that are normal are {}'.format(df[df.label==0].shape[0]))
print('Total number of patients that have pneumonia are {}'.format(df[df.label==1].shape[0]))

In [None]:
imgWidth=224
imgHeight=224
imgChannels=3
imgSize=(imgHeight, imgWidth)
batchSize=64
labelDict={0:'normal', 1:'lung opacity'}

In [None]:
# try:
#     tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
#     tf.config.experimental_connect_to_cluster(tpu)
#     tf.tpu.experimental.initialize_tpu_system(tpu)
#     strategy = tf.distribute.experimental.TPUStrategy(tpu)
# except ValueError:
#     strategy = tf.distribute.get_strategy() # for CPU and single GPU
#     print('Number of replicas:', strategy.num_replicas_in_sync)

In [None]:
def loadImage(row, axis):
    image_path = row.patientId
    img = dicom.dcmread(image_path).pixel_array
    axis.imshow(img, cmap='gray')
    lbl=labelDict.get(row.label)
    bboxes=row.bboxes
    for bbox in bboxes:
        x=bbox[0]
        y=bbox[1]
        w=bbox[2]
        h=bbox[3]
        rect = patches.Rectangle((x,y), w, h, linewidth=2, edgecolor='red', fill=False)
        axis.add_patch(rect)
    axis.set_title(lbl)
    

In [None]:
def loadImages(df):
    cols=5
    rows=4
    idx=0
    f,axarr=plt.subplots(rows,cols,figsize=(18,10))
    for r in range(rows):
        for c in range(cols):
            axis=axarr[r,c]
            loadImage(df.iloc[idx], axis)
            idx+=1
    plt.tight_layout()

In [None]:
loadImages(df)

Let us print the image with maximum bounding boxes

In [None]:
max_bbox_idx=np.argmax([len(x) for x in df.bboxes])
f,axarry=plt.subplots(1,1,figsize=(5,5))
loadImage(df.iloc[max_bbox_idx], axarry)

In [None]:
c=math.ceil(df.shape[0]*0.7)
train_df,val_df=df[:c],df[c:]
print(train_df.shape, val_df.shape)

In [None]:
# image transformer object
transform = A.Compose([
        A.RandomRotate90(),
        A.Flip(),
        A.Transpose(),
        A.OneOf([
            A.IAAAdditiveGaussianNoise(),
            A.GaussNoise(),
        ], p=0.2),
        A.OneOf([
            A.MotionBlur(p=.2),
            A.MedianBlur(blur_limit=3, p=0.1),
            A.Blur(blur_limit=3, p=0.1),
        ], p=0.2),
        A.ShiftScaleRotate(shift_limit=0.0625, scale_limit=0.2, rotate_limit=45, p=0.2),
        A.OneOf([
            A.OpticalDistortion(p=0.3),
            A.GridDistortion(p=.1),
            A.IAAPiecewiseAffine(p=0.3),
        ], p=0.2),
        A.OneOf([
            A.CLAHE(clip_limit=2),
            A.IAASharpen(),
            A.IAAEmboss(),
            A.RandomBrightnessContrast(),            
        ], p=0.3),
        A.HueSaturationValue(p=0.3),
    ])


Need to identify the distribution (Pneumonia and normal) of the data used
Consider a portion of the dataset for train (2000) and validation (25%) of train size

In [None]:
trainsize=2000
valsize=int(0.25*trainsize)
trainpartial=train_df[:trainsize]
valpartial=val_df[:valsize]
print('Distribution of labels in the train data are ', 
      (trainpartial.label.value_counts().values/trainpartial.label.value_counts().values.sum())*100)
print('Distribution of labels in the validation data are ', 
      (valpartial.label.value_counts().values/valpartial.label.value_counts().values.sum())*100)

The distribution of the data in both train and validation is good. Also, the distribution of same label data across the train adn test are also good to proceed

In [None]:
def iou_loss(y_true, y_pred):
    y_true = tf.reshape(y_true, [-1])
    y_pred = tf.reshape(y_pred, [-1])
    intersection = tf.reduce_sum(float(y_true) * float(y_pred))
    score = (intersection + 1.) / (tf.reduce_sum(float(y_true)) + tf.reduce_sum(float(y_pred)) - intersection + 1.)
    return 1 - score

# combine bce loss and iou loss
def iou_bce_loss(y_true, y_pred):
    return 0.5 * keras.losses.binary_crossentropy(y_true, y_pred) + 0.5 * iou_loss(y_true, y_pred)

# mean iou as a metric
def mean_iou(y_true, y_pred):
    y_pred = tf.round(y_pred)
    intersect = tf.reduce_sum(float(y_true) * float(y_pred), axis=[1])
    union = tf.reduce_sum(float(y_true),axis=[1]) + tf.reduce_sum(float(y_pred),axis=[1])
    smooth = tf.ones(tf.shape(intersect))
    return tf.reduce_mean((intersect + smooth) / (union - intersect + smooth))


## **Mask R-CNN Model training**

Install Matterport's Mask-RCNN model from github.

In [None]:
!git clone https://www.github.com/matterport/Mask_RCNN.git
os.chdir('Mask_RCNN')

In [None]:
!wget --quiet https://github.com/matterport/Mask_RCNN/releases/download/v2.0/mask_rcnn_coco.h5
!ls -lh mask_rcnn_coco.h5

In [None]:
# Root directory of the project
mask_rcnn_dir = os.path.join(workingDir+"Mask_RCNN/")
sys.path.append(mask_rcnn_dir)  # To find local version of the library
# Path to trained weights file
COCO_WEIGHTS_PATH = os.path.join(mask_rcnn_dir, "mask_rcnn_coco.h5")
# Directory to save logs and model checkpoints
DEFAULT_LOGS_DIR = os.path.join(workingDir, "logs")

In [None]:
from mrcnn.config import Config
from mrcnn import utils
import mrcnn.model as modellib
from mrcnn import visualize
from mrcnn.model import log
from mrcnn.visualize import display_instances

In [None]:
class DetectorConfig(Config):
    """Configuration for training on the RSNA pneumonia dataset
    Overrides values from the base Config
    """
    
    # Give the configuration a recognizable name  
    NAME = 'maskrcnn'
    
    # Train on 1 GPU and 8 images per GPU.
    GPU_COUNT = 1
    IMAGES_PER_GPU = 8 
    
    BACKBONE = 'resnet50'
    
    NUM_CLASSES = 2  # a background + 1 pneumonia classes
    
    IMAGE_MIN_DIM = 256
    IMAGE_MAX_DIM = 256
    RPN_ANCHOR_SCALES = (32, 64, 128, 256)
    TRAIN_ROIS_PER_IMAGE = 32
    MAX_GT_INSTANCES = 3
    DETECTION_MAX_INSTANCES = 5
    DETECTION_MIN_CONFIDENCE = 0.9
    DETECTION_NMS_THRESHOLD = 0.1

    STEPS_PER_EPOCH = 50
    
config = DetectorConfig()
config.display()

In [None]:
class DetectorDataset(utils.Dataset):
    """Dataset class for training on the RSNA pneumonia dataset.
    """

    def __init__(self, imagePath, imageBBoxes, orig_height, orig_width):
        super().__init__(self)
        
        # Add classes
        self.add_class('pneumonia', 1, 'Lung Opacity')
   
        # add images 
        for i, fp in enumerate(imagePath):
            bboxes = imageBBoxes[i]
            self.add_image('pneumonia', image_id=i, path=fp, 
                           annotations=bboxes, orig_height=orig_height, orig_width=orig_width)
            
    def image_reference(self, image_id):
        info = self.image_info[image_id]
        return info['path']

    def load_image(self, image_id):
        info = self.image_info[image_id]
        fp = info['path']
        ds = pydicom.read_file(fp)
        image = ds.pixel_array
        if len(image.shape) != 3 or image.shape[2] != 3:
            image = np.stack((image,) * 3, -1)
        return image/255.0

    def load_mask(self, image_id):
        info = self.image_info[image_id]
        bboxes = info['annotations']
        count = len(bboxes)
        if count == 0:
            mask = np.zeros((info['orig_height'], info['orig_width'], 1), dtype=np.uint8)
            class_ids = np.zeros((1,), dtype=np.int32)
        else:
            mask = np.zeros((info['orig_height'], info['orig_width'], count), dtype=np.uint8)
            class_ids = np.zeros((count,), dtype=np.int32)
            for i, a in enumerate(bboxes):
                if a['Target'] == 1:
                    x = int(a['x'])
                    y = int(a['y'])
                    w = int(a['width'])
                    h = int(a['height'])
                    mask_instance = mask[:, :, i].copy()
                    cv2.rectangle(mask_instance, (x, y), (x+w, y+h), 1, -1)
                    mask[:, :, i] = mask_instance
                    class_ids[i] = 1
        return mask.astype(np.bool), class_ids.astype(np.int32)

In [None]:
# prepare the training dataset
dataset_train = DetectorDataset(trainpartial.patientId.to_list(), trainpartial.bboxes.to_list(), 1024, 1024)
dataset_train.prepare()

In [None]:
# prepare the validation dataset
dataset_val = DetectorDataset(valpartial.patientId.to_list(), valpartial.bboxes.to_list(), 1024, 1024)
dataset_val.prepare()

In [None]:
maskrcnnModel = modellib.MaskRCNN(mode='training', config=config, model_dir=workingDir)

In [None]:
# Train Mask-RCNN Model 
maskrcnnHistory=maskrcnnModel.train(dataset_train, dataset_val, 
                                        learning_rate=config.LEARNING_RATE, 
                                        epochs=1, 
                                        layers='all'
                                   )

In [None]:
# select trained model 
dir_names = next(os.walk(maskrcnnModel.model_dir))[1]
key = config.NAME.lower()
dir_names = filter(lambda f: f.startswith(key), dir_names)
dir_names = sorted(dir_names)

if not dir_names:
    import errno
    raise FileNotFoundError(
        errno.ENOENT,
        "Could not find model directory under {}".format(self.model_dir))
    
fps = []
# Pick the last directory
for d in dir_names: 
    dir_name = os.path.join(maskrcnnModel.model_dir, d)
    # Find the last checkpoint
    checkpoints = next(os.walk(dir_name))[2]
    checkpoints = filter(lambda f: f.startswith("mask_rcnn"), checkpoints)
    checkpoints = sorted(checkpoints)
    if not checkpoints:
        print('No weight files in {}'.format(dir_name))
    else:
        checkpoint = os.path.join(dir_name, checkpoints[-1])
        fps.append(checkpoint)

maskrcnn_model_path = sorted(fps)[-1]
print('Found model {}'.format(maskrcnn_model_path))

In [None]:
class InferenceConfig(DetectorConfig):
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1

inference_config = InferenceConfig()

# Recreate the model in inference mode
maskrcnnInfModel = modellib.MaskRCNN(mode='inference', 
                          config=inference_config,
                          model_dir=workingDir)

# Load trained weights (fill in path to trained weights here)
assert model_path != "", "Provide path to trained weights"
print("Loading weights from ", maskrcnn_model_path)
maskrcnnInfModel.load_weights(maskrcnn_model_path, by_name=True)

In [None]:
# Show few example of ground truth vs. predictions on the validation dataset
dataset = dataset_val
fig = plt.figure(figsize=(10, 30))

for i in range(4):
    image_id = random.choice(dataset.image_ids)
    original_image, image_meta, gt_class_id, gt_bbox, gt_mask = 
        modellib.load_image_gt(dataset_val, inference_config, image_id, use_mini_mask=False)
    print(original_image.shape)
    plt.subplot(6, 2, 2*i + 1)
    visualize.display_instances(original_image, gt_bbox, gt_mask, gt_class_id, 
                                dataset.class_names, colors=['red'], ax=fig.axes[-1])
    
    plt.subplot(6, 2, 2*i + 2)
    results = maskrcnnInfModel.detect([original_image])
    r = results[0]
    visualize.display_instances(original_image, r['rois'], r['masks'], r['class_ids'], 
                                dataset.class_names, r['scores'], colors=['green'], ax=fig.axes[-1])