# [Radiological Society of North America -- Pneumonia Detection Challenge](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge)

Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. In 2015, 920,000 children under the age of 5 died from the disease. In the United States, pneumonia accounts for over 500,000 visits to emergency departments [1](www.cdc.gov/nchs/data/nhamcs/web_tables/2015_ed_web_tables.pdf) and over 50,000 deaths in 2015 [2](http://www.cdc.gov/nchs/data/nvsr/nvsr66/nvsr66_06_tables.pdf), keeping the ailment on the list of top 10 causes of death in the country.

While common, accurately diagnosing pneumonia is a tall order. It requires review of a chest radiograph (CXR) by highly trained specialists and confirmation through clinical history, vital signs and laboratory exams. Pneumonia usually manifests as an area or areas of increased opacity [3](https://www.ncbi.nlm.nih.gov/pubmed/30036297) on CXR. However, the diagnosis of pneumonia on CXR is complicated because of a number of other conditions in the lungs such as fluid overload (pulmonary edema), bleeding, volume loss (atelectasis or collapse), lung cancer, or post-radiation or surgical changes. Outside of the lungs, fluid in the pleural space (pleural effusion) also appears as increased opacity on CXR. When available, comparison of CXRs of the patient taken at different time points and correlation with clinical symptoms and history are helpful in making the diagnosis.

CXRs are the most commonly performed diagnostic imaging study. A number of factors such as positioning of the patient and depth of inspiration can alter the appearance of the CXR [4](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3632825/), complicating interpretation further. In addition, clinicians are faced with reading high volumes of images every shift.

[The images for the Kaggle-RSNA challenge are from the NIH chest x-ray dataset](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345). National Institutes of Health Clinical Center has provided this Chest X-Ray dataset publicly.[5](http://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf) There has already been excellent attempts at developing an end-to-end deep-learning models, using this dataset, that classifies chest x-rays with expert human level accuracy.[6](https://stanfordmlgroup.github.io/projects/chexnet/)

In [0]:
setup = True
download_data = True
update_weights = True

gen_preds = True
download_submission = True
fetch_raw_data = False
upload_data = False

colab_mode = True

use_transfer_learn = False
fine_tune = False
load_weights =True

stage_2 = True

use_augmentation = 2 # 0, 1 or 2

LEARNING_RATE = 1e-5
NUM_EPOCHS = 4

In [0]:
# Original DICOM image size: 1024 x 1024
ORIG_SIZE = 1024
IMG_PROC_SIZE = 1024

In [0]:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(tf.__version__)
print(sess.run(hello))

print(tf.GIT_VERSION, tf.VERSION)

In [0]:
import os 
import sys
import subprocess

In [0]:
def execute_in_shell(command=None, 
                     verbose = False):
    """ 
        command -- keyword argument, takes a list as input
        verbsoe -- keyword argument, takes a boolean value as input
    
        This is a function that executes shell scripts from within python.
        
        Keyword argument 'command', should be a list of shell commands.
        Keyword argument 'verbose', should be a boolean value to set verbose level.
        
        Example usage: execute_in_shell(command = ['ls ./some/folder/',
                                                    ls ./some/folder/  -1 | wc -l'],
                                        verbose = True ) 
                                        
        This command returns dictionary with elements: Output and Error.
        
        Output records the console output,
        Error records the console error messages.
                                        
    """
    error = []
    output = []
    
    if isinstance(command, list):
        for i in range(len(command)):
            try:
                process = subprocess.Popen(command[i], shell=True, stdout=subprocess.PIPE)
                process.wait()
                out, err = process.communicate()
                error.append(err)
                output.append(out)
                if verbose:
                    print ('Success running shell command: {}'.format(command[i]))
            except Exception as e:
                print ('Failed running shell command: {}'.format(command[i]))
                if verbose:
                    print(type(e))
                    print(e.args)
                    print(e)
                
    else:
        print ('The argument command takes a list input ...')
    return {'Output': output, 'Error': error }

In [0]:
command = ['pip3 install -q pydicom kaggle PyDrive  cython >/dev/null 2>&1',
           'git clone https://github.com/rahulremanan/Mask_RCNN >/dev/null 2>&1',
           'git clone https://github.com/rahulremanan/cocoapi >/dev/null 2>&1',
           'cd ./cocoapi/PythonAPI/; make >/dev/null 2>&1; make install >/dev/null 2>&1; python3 setup.py install >/dev/null 2>&1; python3 setup.py build_ext --inplace >/dev/null 2>&1',
           'cd ./Mask_RCNN/; pip3 install -r requirements.txt >/dev/null 2>&1; python3 setup.py install >/dev/null 2>&1',
           'mkdir /content/',
           'mkdir /content/.kaggle/',
           'mkdir ./pneumonia_detection/']

In [0]:
if setup and colab_mode:
  execute_in_shell(command = command, 
                   verbose = True)

In [0]:
import random
import math
import numpy as np
import cv2
import matplotlib.pyplot as plt
import json
import pydicom
from imgaug import augmenters as iaa
from tqdm import tqdm
import pandas as pd 
import glob 

In [0]:
if colab_mode:
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive
    from google.colab import auth
    from oauth2client.client import GoogleCredentials
    from googleapiclient.http import MediaIoBaseDownload

In [0]:
import io
import glob
import fnmatch
import random

from multiprocessing import Process

## Part 01 -- Fetching and processing data

### Setting up Kaggle access for Google Colab

1.  Click on the right top corner of Kaggle website, where it displays your profile picture
2.  Go to My Accounts
3. Under API, click on Create new API token
4. Upload the kaggle.json file to Google Colab environment or to the Jupyter notebook




#### Upload kaggle.json file

In [0]:
if setup and fetch_raw_data and colab_mode:
  from google.colab import files
  uploaded = files.upload()

### Download RSNA pneumonia detection dataset using Kaggle API

In [0]:
command = ['mv ./*.json /content/.kaggle/',
           'cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json',
           'chmod ~/.kaggle/kaggle.json',
           'kaggle competitions download -c rsna-pneumonia-detection-challenge',
           'mv /content/stage_1_* ./',
           'mv /content/stage_2_* ./']

In [0]:
if fetch_raw_data:
  execute_in_shell(command = command, verbose = True)
  filename = "/content/.kaggle/kaggle.json"
  os.chmod(filename, 600)

### Setup Google Drive as Colab object storage

In [0]:
def cloud_authenticate():
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print ("Sucessfully authenticated to access Google Drive ...")
  return drive

### Authenticate Google Drive in CoLab mode

In [0]:
if colab_mode:
    drive = cloud_authenticate()

### Google Drive fetch and save for CoLab

In [0]:
def googledrive_fetch(file_name = None, 
                fetch=True, 
                fetch_by_id = False,
                latest = True,
                file_id = None,
                multi_file = False):
  
  """
    A function that fetches files from Google Drive.
    
    The function takes five keyword arguments:
      file_name -- Passes the file name string
      fetch -- Specify if a file name should be downloaded
      fetch_by_id -- Specify a file to be downloaded by file id
      multi_file -- Download all the files with the same file name from Google Drive
  """
  
  query = 'title='+"'"+file_name+"'"
  try:
    file_list=drive.ListFile({'q': "{}".format(query)}).GetList()
  except:
    return ("Error finding file with {}".format(query))
  
  if len(file_list) >1:
    print ("A total of {} files with the same file name found ...".format(len(file_list)))
    for f in file_list:
      title = f['title']
      id = f.metadata.get('id')
      print ("Found: {} file, with file id: {}".format(title, id))
    
    if multi_file:
      print ("Downloading {} files with file name {}".format(len(file_list), title))
      print ("Staring download ...")
    elif latest:
      print ("Downloading the most recent {} file ...".format(title))
    elif file_id == None:
      print ("Set keyword argument fetch_by_id = True and specify id using keyword argument file_id = 'id' to download a specific file ...")
      print ("--OR--")
      print ("Set keyword argument multi_file = True to automatically download all the files ...")
      return None
    else:
      print ("Starting download ...")
    
  n = 0
  
  if latest:
    try:
      title = file_list[0]['title']
    except:
      return ("Error finding file with {}".format(query))
    latest_file_id = file_list[0].metadata.get('id')
    print ("Found most recent version of: {} file with file id: {} ...".format(title, latest_file_id))    
  
  for f in file_list:
      if fetch and multi_file and n>0:
        save_path = os.path.join('./'+str(n)+'_'+file_name)
      else:
        save_path = os.path.join('./'+file_name)     
      
      title = f['title']
      
      if fetch_by_id and file_id !=None:
        id = file_id
      elif latest:
        id = latest_file_id
      elif fetch_by_id and file_id == None:
        print ('Please specify the file id for downloading using the file_id argument ...')
      else:
        id = f.metadata.get('id')
      
      print ("Downloading {} file, with file id: {} ...".format(title, id))
      
      if fetch or fetch_by_id or latest:
        local_file = io.FileIO(save_path, mode='wb')
        try:
          request = drive.auth.service.files().get_media(fileId=id)
          downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*102400)

          done = False

          while done is False:
              status, done = downloader.next_chunk()
        except:
          return 'Downloading failed ...'
        
        local_file.close()
        print ("Successfully downloaded the file: {} to: {} ...".format(file_name, save_path))
      
      if fetch_by_id and file_id !=None:
        return None
      elif latest:
        return None
      elif n >= 0:
        print ("Downloaded {} of {} files ...".format(n+1, len(file_list)))
      else:
        print ("Download failed ...")
      
      n +=1
  
  return None

In [0]:
def googledrive_save(file_name = None, 
               file_dir = None, 
               upload = False,
               prefix = None):
  if upload == True and file_name != None and file_dir !=None:
    try:
      if prefix != None:
        file = drive.CreateFile({'title': str(prefix) + str(file_name) })
      else:
        file = drive.CreateFile({'title': str(file_name) })
      file.SetContentFile(os.path.join(file_dir + str(file_name)))
      file.Upload()
      print (str(file_name) + " successfully uploaded to Google drive ...")
    except:
      print ("Failed to save :" + str(file_name) + " to Google drive ...")

### Setup Google Drive as object storage using rClone

In [0]:
def rClone_upload(drive_name = None,
                  local_folder = None,
                  cloud_folder = None,
                  verbose = False):
    command = ['rclone copy {} {}:{}'.format(local_folder,
                                             drive_name, 
                                             cloud_folder)]
    execute_in_shell(command = command, 
                     verbose = verbose)
    del command

In [0]:
def rClone_download(drive_name = None,
                    local_folder = None,
                    cloud_folder = None,
                    verbose = False):
    command = ['rclone copy {}:{} {}'.format(drive_name, 
                                             cloud_folder, 
                                             local_folder)]
    execute_in_shell(command = command, 
                     verbose = verbose)
    del command

#### Authenticate Google drive in Colab

In [0]:
if colab_mode:
    drive = cloud_authenticate()

#### Upload data to object storage

In [0]:
file_dir = './pneumonia_detection/'
file_name = ['stage_1_detailed_class_info.csv.zip',
             'stage_1_train_images.zip',
             'stage_1_sample_submission.csv',
             'stage_1_train_labels.csv.zip',
             'stage_1_test_images.zip']

In [0]:
if upload_data and colab_mode:
  for f in file_name:
    googledrive_save(file_name = f,
               file_dir = './',
               upload = True)

In [0]:
file_dir = './pneumonia_detection/'
file_name = ['stage_2_detailed_class_info.csv.zip',
             'stage_2_train_images.zip',
             'stage_2_sample_submission.csv',
             'stage_2_train_labels.csv.zip',
             'stage_2_test_images.zip']

In [0]:
if upload_data and colab_mode and stage_2:
  for f in file_name:
    googledrive_save(file_name = f,
               file_dir = './',
               upload = True)

#### Download data from object storage

In [0]:
if download_data and colab_mode:
  for f in file_name:
    googledrive_fetch(file_name = f, 
                      fetch=True, 
                      latest = True)

In [0]:
file_dir = './pneumonia_detection/'
file_name = ['mask_rcnn_coco.h5',
             '{}_mask_rcnn_pneumonia.h5'.format(IMG_PROC_SIZE)]

In [0]:
if upload_data and colab_mode:
  for f in file_name:
    googledrive_save(file_name = f,
               file_dir = './',
               upload = True)

In [0]:
if download_data and colab_mode:
  for f in file_name:
    googledrive_fetch(file_name = f, 
                fetch=True, 
                latest = True)
elif update_weights:
  for f in file_name:
    googledrive_fetch(file_name = f, 
                fetch=True, 
                latest = True)
else:
  pass

In [0]:
command = ['mv ./{}_mask_rcnn_pneumonia.h5 ./mask_rcnn_pneumonia.h5'.format(IMG_PROC_SIZE)]
if setup and download_data and colab_mode:
  execute_in_shell(command = command, 
                   verbose = True)
elif os.path.exists('./{}_mask_rcnn_pneumonia.h5'.format(IMG_PROC_SIZE)):
  execute_in_shell(command = command, 
                   verbose = True)
else:
  pass

In [0]:
command = ['mv ./pneumonia_detection/{}_mask_rcnn_pneumonia.h5 ./mask_rcnn_pneumonia.h5'.format(IMG_PROC_SIZE)]
if setup and download_data and not colab_mode:
  execute_in_shell(command = command, 
                   verbose = True)
elif os.path.exists('./pneumonia_detection/{}_mask_rcnn_pneumonia.h5'.format(IMG_PROC_SIZE)):
  execute_in_shell(command = command, 
                   verbose = True)
else:
  pass

#### Unzip and prepare data for ingestion into a deep-neural network

In [0]:
command = ['ls ./pneumonia_detection/',
           'mkdir ./pneumonia_detection/stage_1_train_images/',
           'mkdir ./pneumonia_detection/stage_1_test_images/',
           'unzip -q ./stage_1_detailed_class_info.csv.zip -d ./pneumonia_detection/',
           'unzip -q ./stage_1_train_images.zip  -d ./pneumonia_detection/stage_1_train_images/',
           'unzip -q ./stage_1_train_labels.csv.zip -d ./pneumonia_detection/',
           'unzip -q ./stage_1_test_images.zip -d ./pneumonia_detection/stage_1_test_images/']

In [0]:
if setup and not stage_2:
  execute_in_shell(command = command, verbose = True)

In [0]:
command = ['ls ./pneumonia_detection/',
           'mkdir ./pneumonia_detection/stage_2_train_images/',
           'mkdir ./pneumonia_detection/stage_2_test_images/',
           'unzip -q ./stage_2_detailed_class_info.csv.zip -d ./pneumonia_detection/',
           'unzip -q ./stage_2_train_images.zip  -d ./pneumonia_detection/stage_2_train_images/',
           'unzip -q ./stage_2_train_labels.csv.zip -d ./pneumonia_detection/',
           'unzip -q ./stage_2_test_images.zip -d ./pneumonia_detection/stage_2_test_images/']

In [0]:
if setup and stage_2:
  execute_in_shell(command = command, verbose = True)

In [0]:
cmd = ["wget --quiet https://github.com/matterport/Mask_RCNN/releases/download/v2.0/mask_rcnn_coco.h5"]

if use_transfer_learn and not load_weights and setup and fetch_raw_data:
  execute_in_shell(command = command, verbose = True)

In [0]:
COCO_WEIGHTS_PATH = "./mask_rcnn_coco.h5"

## Part 02 -- Mask RCNN Model

In [0]:
from mrcnn.config import Config
from mrcnn import utils
import mrcnn.model as modellib
from mrcnn import visualize
from mrcnn.model import log

In [0]:
DATA_DIR = './pneumonia_detection'

In [0]:
if stage_2:
  train_dicom_dir = os.path.join(DATA_DIR, 'stage_2_train_images')
  test_dicom_dir = os.path.join(DATA_DIR, 'stage_2_test_images')
else:
  train_dicom_dir = os.path.join(DATA_DIR, 'stage_1_train_images')
  test_dicom_dir = os.path.join(DATA_DIR, 'stage_1_test_images')

In [0]:
def get_dicom_fps(dicom_dir):
    dicom_fps = glob.glob(dicom_dir+'/'+'*.dcm')
    return list(set(dicom_fps))

def parse_dataset(dicom_dir, anns): 
    image_fps = get_dicom_fps(dicom_dir)
    image_annotations = {fp: [] for fp in image_fps}
    for index, row in anns.iterrows(): 
        fp = os.path.join(dicom_dir, row['patientId']+'.dcm')
        image_annotations[fp].append(row)
    return image_fps, image_annotations 

### Initialize weights using transfer learning -- COCO pre-trained weights

In [0]:
# The following parameters have been selected to reduce running time for demonstration purposes 
# These are not optimal 

class DetectorConfig(Config):
    """Configuration for training pneumonia detection on the RSNA pneumonia dataset.
    Overrides values in the base Config class.
    """
    
    # Give the configuration a recognizable name  
    NAME = 'pneumonia'
    
    # Train on 1 GPU and 8 images per GPU. We can put multiple images on each
    # GPU because the images are small. Batch size is 8 (GPUs * images/GPU).
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1
    
    BACKBONE = 'resnet101'
    NUM_CLASSES = 2  # background + 1 pneumonia classes
    
    IMAGE_MIN_DIM = IMG_PROC_SIZE
    IMAGE_MAX_DIM = IMG_PROC_SIZE
    RPN_ANCHOR_SCALES = (2, 4, 8, 16, 32)
    TRAIN_ROIS_PER_IMAGE = 256
    MAX_GT_INSTANCES = 1024
    DETECTION_MAX_INSTANCES = 3
    DETECTION_MIN_CONFIDENCE = 0.9  ## match target distribution
    DETECTION_NMS_THRESHOLD = 0.01
    
    #IMAGE_SHAPE = [IMG_PROC_SIZE, IMG_PROC_SIZE, 3]
    
    LOSS_WEIGHTS = {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
      
    LEARNING_RATE = LEARNING_RATE

    STEPS_PER_EPOCH = 1600
    
config = DetectorConfig()
config.display()

In [0]:
class DetectorDataset(utils.Dataset):
    """Dataset class for training pneumonia detection on the RSNA pneumonia dataset.
    """

    def __init__(self, image_fps, image_annotations, orig_height, orig_width):
        super().__init__(self)
        
        # Add classes
        self.add_class('pneumonia', 1, 'Lung Opacity')
   
        # add images 
        for i, fp in enumerate(image_fps):
            annotations = image_annotations[fp]
            self.add_image('pneumonia', image_id=i, path=fp, 
                           annotations=annotations, orig_height=orig_height, orig_width=orig_width)
            
    def image_reference(self, image_id):
        info = self.image_info[image_id]
        return info['path']

    def load_image(self, image_id):
        info = self.image_info[image_id]
        fp = info['path']
        ds = pydicom.read_file(fp)
        image = ds.pixel_array
        # If grayscale. Convert to RGB for consistency.
        if len(image.shape) != 3 or image.shape[2] != 3:
            image = np.stack((image,) * 3, -1)
        return image

    def load_mask(self, image_id):
        info = self.image_info[image_id]
        annotations = info['annotations']
        count = len(annotations)
        if count == 0:
            mask = np.zeros((info['orig_height'], info['orig_width'], 1), dtype=np.uint8)
            class_ids = np.zeros((1,), dtype=np.int32)
        else:
            mask = np.zeros((info['orig_height'], info['orig_width'], count), dtype=np.uint8)
            class_ids = np.zeros((count,), dtype=np.int32)
            for i, a in enumerate(annotations):
                if a['Target'] == 1:
                    x = int(a['x'])
                    y = int(a['y'])
                    w = int(a['width'])
                    h = int(a['height'])
                    mask_instance = mask[:, :, i].copy()
                    cv2.rectangle(mask_instance, (x, y), (x+w, y+h), 255, -1)
                    mask[:, :, i] = mask_instance
                    class_ids[i] = 1
        return mask.astype(np.bool), class_ids.astype(np.int32)

In [0]:
import pandas as pd
# training dataset
if stage_2:
  anns = pd.read_csv(os.path.join(DATA_DIR, 'stage_2_train_labels.csv'))
else:
  anns = pd.read_csv(os.path.join(DATA_DIR, 'stage_1_train_labels.csv'))
anns.head()

In [0]:
image_fps, image_annotations = parse_dataset(train_dicom_dir, anns=anns)

In [0]:
ds = pydicom.read_file(image_fps[0]) # read dicom image from filepath 
image = ds.pixel_array # get image array

In [0]:
# show dicom fields 
preview = False
if preview:
  ds

In [0]:
######################################################################
# Modify this line to use more or fewer images for training/validation. 
# To use all images, do: image_fps_list = list(image_fps)
image_fps_list = list(image_fps) 
#####################################################################

# split dataset into training vs. validation dataset 
# split ratio is set to 0.9 vs. 0.1 (train vs. validation, respectively)
sorted(image_fps_list)
random.seed(42)
random.shuffle(image_fps_list)

validation_split = 0.1
split_index = int((1 - validation_split) * len(image_fps_list))

image_fps_train = image_fps_list[:split_index]
image_fps_val = image_fps_list[split_index:]

print(len(image_fps_train), len(image_fps_val))

In [0]:
# prepare the training dataset
dataset_train = DetectorDataset(image_fps_train, image_annotations, ORIG_SIZE, ORIG_SIZE)
dataset_train.prepare()

In [0]:
# Show annotation(s) for a DICOM image 
test_fp = random.choice(image_fps_train)
image_annotations[test_fp]

In [0]:
# prepare the validation dataset
dataset_val = DetectorDataset(image_fps_val, image_annotations, ORIG_SIZE, ORIG_SIZE)
dataset_val.prepare()

In [0]:
# Load and display random samples and their bounding boxes
# Suggestion: Run this a few times to see different examples. 

summary_plot = False

if summary_plot:
    image_id = random.choice(dataset_train.image_ids)
    image_fp = dataset_train.image_reference(image_id)
    image = dataset_train.load_image(image_id)
    mask, class_ids = dataset_train.load_mask(image_id)

    print(image.shape)

    plt.figure(figsize=(10, 10))
    plt.subplot(1, 2, 1)
    plt.imshow(image[:, :, 0], cmap='gray')
    plt.axis('off')

    plt.subplot(1, 2, 2)
    masked = np.zeros(image.shape[:2])
    for i in range(mask.shape[2]):
        masked += image[:, :, 0] * mask[:, :, i]
    plt.imshow(masked, cmap='gray')
    plt.axis('off')

    print(image_fp)
    print(class_ids)

In [0]:
ROOT_DIR = './pneumonia_detection/'

In [0]:
model = modellib.MaskRCNN(mode='training', 
                          config=config, 
                          model_dir=ROOT_DIR)

In [0]:
if use_transfer_learn:
  # Exclude the last layers because they require a matching
  # number of classes
  try:
    model.load_weights(COCO_WEIGHTS_PATH, by_name=True, exclude=["mrcnn_class_logits", 
                                                               "mrcnn_bbox_fc",
                                                               "mrcnn_bbox", 
                                                               "mrcnn_mask"])
    print ('Loaded trained weights using COCO dataset ...')
  except:
    print ('Failed to load trained weights using COCO dataset ...')
elif load_weights and not use_transfer_learn:
  try:
    model.load_weights('./pneumonia_detection/mask_rcnn_pneumonia.h5', by_name=True)
    print ('Loaded weights from {}'.format('./pneumonia_detection/mask_rcnn_pneumonia.h5'))
  except:
    model.load_weights('./mask_rcnn_pneumonia.h5', by_name=True)
    print ('Loaded weights from {}'.format('./mask_rcnn_pneumonia.h5'))

In [0]:
# Image augmentation 
augmentation_1 = iaa.SomeOf((0, 1), [
    iaa.Fliplr(0.5),
    iaa.Affine(scale={"x": (0.8, 1.2), "y": (0.8, 1.2)},
               translate_percent={"x": (-0.2, 0.2), "y": (-0.2, 0.2)},
               rotate=(-25, 25),
               shear=(-8, 8)),
               iaa.Multiply((0.9, 1.1))])

In [0]:
# Image augmentation (light but constant)
augmentation_2 = iaa.Sequential([
    iaa.OneOf([ ## geometric transform
        iaa.Affine(
            scale={"x": (0.98, 1.02), "y": (0.98, 1.04)},
            translate_percent={"x": (-0.02, 0.02), "y": (-0.04, 0.04)},
            rotate=(-2, 2),
            shear=(-1, 1),
        ),
        iaa.PiecewiseAffine(scale=(0.001, 0.025)),
    ]),
    iaa.OneOf([ ## brightness or contrast
        iaa.Multiply((0.9, 1.1)),
        iaa.ContrastNormalization((0.9, 1.1)),
    ]),
    iaa.OneOf([ ## blur or sharpen
        iaa.GaussianBlur(sigma=(0.0, 0.1)),
        iaa.Sharpen(alpha=(0.0, 0.1)),
    ]),
])

In [0]:
if use_augmentation == 2:
  augmentation = augmentation_2
  print ('Using augmentation mode: 2')
elif use_augmentation == 1:
  augmentation = augmentation_1
  print ('Using augmentation mode: 1')
else:
  augmentation = None
  print ('Using augmentation mode: None')

In [0]:
# Train Mask-RCNN Model 
import warnings 
warnings.filterwarnings("ignore")

### Train top layer using COCO weights or pre-trained weights

In [0]:
command = ['rm -r {}/pneumonia2018*'.format(ROOT_DIR)]
execute_in_shell(command = command, verbose = True)

In [0]:
%%time
## train heads with higher lr to speedup the learning
if use_transfer_learn or fine_tune:
  layers = 'heads'
else:
  layers = 'all'
  
model.train(dataset_train, dataset_val,
            learning_rate=LEARNING_RATE*2,
            epochs=2,
            layers=layers,
            augmentation=None)  ## no need to augment yet
history = model.keras_model.history.history

#### Retrain all layers of the MRCNN network

In [0]:
%%time

# Train Mask-RCNN Model 
import warnings 
warnings.filterwarnings("ignore")
model.train(dataset_train, dataset_val, 
            learning_rate=LEARNING_RATE, 
            epochs=NUM_EPOCHS, 
            layers='all',
            augmentation=augmentation)

In [0]:
new_history = model.keras_model.history.history
for k in new_history: history[k] = history[k] + new_history[k]

### Test predictions using the same image as above

In [0]:
if augmentation != None:
  imggrid = augmentation.draw_grid(image[:, :], cols=2, rows=2)
  plt.figure(figsize=(30, 12))
  _ = plt.imshow(imggrid[:, :, 0], cmap='gray')
else:
  _ = plt.imshow(image, cmap='gray')

### Summarize training performance

In [0]:
epochs = range(1,len(next(iter(history.values())))+1)
pd.DataFrame(history, index=epochs)

In [0]:
plt.figure(figsize=(17,5))

plt.subplot(131)
plt.plot(epochs, history["loss"], label="Train loss")
plt.plot(epochs, history["val_loss"], label="Valid loss")
plt.legend()
plt.subplot(132)
plt.plot(epochs, history["mrcnn_class_loss"], label="Train class ce")
plt.plot(epochs, history["val_mrcnn_class_loss"], label="Valid class ce")
plt.legend()
plt.subplot(133)
plt.plot(epochs, history["mrcnn_bbox_loss"], label="Train box loss")
plt.plot(epochs, history["val_mrcnn_bbox_loss"], label="Valid box loss")
plt.legend()

plt.show()

### Output the best epoch

In [0]:
save_last = True
if save_last:
  best_epoch = len(epochs)-1
  print("Best Epoch:", best_epoch+1)
else:
  best_epoch = np.argmin(history["val_loss"])
  print("Best Epoch:", best_epoch + 1)

### Select best performing trained model for saving weights

In [0]:
dir_names = next(os.walk(model.model_dir))[1]
key = config.NAME.lower()
dir_names = filter(lambda f: f.startswith(key), dir_names)
dir_names = sorted(dir_names)

if not dir_names:
    import errno
    raise FileNotFoundError(
        errno.ENOENT,
        "Could not find model directory under {}".format(self.model_dir))
    
fps = []
# Pick last directory
for d in dir_names: 
    dir_name = os.path.join(model.model_dir, d)
    # Find the last checkpoint
    checkpoints = next(os.walk(dir_name))[2]
    checkpoints = filter(lambda f: f.startswith("mask_rcnn"), checkpoints)
    checkpoints = sorted(checkpoints)
    if not checkpoints:
        print('No weight files in {}'.format(dir_name))
    else:
        checkpoint = os.path.join(dir_name, checkpoints[best_epoch])
        fps.append(checkpoint)

model_path = sorted(fps)[-1]
print('Found model {}'.format(model_path))

In [0]:
command = ['mv {} {}/{}_mask_rcnn_pneumonia.h5'.format(model_path, ROOT_DIR, IMG_PROC_SIZE),
           'rm -r {}/pneumonia2018*'.format(ROOT_DIR),
           'ls {}'.format(ROOT_DIR)]
execute_in_shell(command = command, verbose = True)

In [0]:
! ls ./pneumonia_detection/

### Upload weights file to Google drive

In [0]:
if colab_mode:
    drive = cloud_authenticate()

In [0]:
file_dir = '{}'.format(ROOT_DIR)
checkpoint_file = '{}_mask_rcnn_pneumonia.h5'.format(IMG_PROC_SIZE)

upload = False
if colab_mode:
    upload = True
    googledrive_save(file_name = checkpoint_file,
               file_dir = file_dir,
               upload = upload)


In [0]:
drive_name = 'googledrive'
cloud_folder = 'Pneumonia_detection'
local_folder = '{}/{}_mask_rcnn_pneumonia.h5'.format(ROOT_DIR, IMG_PROC_SIZE)

In [0]:
if not colab_mode:
    rClone_upload(drive_name = drive_name,
                  local_folder = local_folder,
                  cloud_folder = cloud_folder,
                  verbose = True) 

### Create a function to generate predictions using MRCNN

In [0]:
class InferenceConfig(DetectorConfig):
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1

inference_config = InferenceConfig()

# Recreate the model in inference mode
model = modellib.MaskRCNN(mode='inference', 
                          config=inference_config,
                          model_dir=ROOT_DIR)



In [0]:
if gen_preds:
  # Exclude the last layers because they require a matching
  # number of classes
  try:
    model_path = '{}/{}_mmask_rcnn_pneumonia.h5'.format(ROOT_DIR, IMG_PROC_SIZE)

    # Load trained weights (fill in path to trained weights here)
    assert model_path != "", "Provide path to trained weights"
    model.load_weights(model_path, by_name=True)
    print("Loading weights from ", model_path)
  except:
      try:
        model.load_weights('{}/mask_rcnn_pneumonia.h5'.format(ROOT_DIR), by_name=True)
        print ('Loaded weights from {}'.format('{}/mask_rcnn_pneumonia.h5'.format(ROOT_DIR)))
      except:
        model.load_weights('./mask_rcnn_pneumonia.h5', by_name=True)
        print ('Loaded weights from {}'.format('./mask_rcnn_pneumonia.h5'))

### Set color for predictions class

In [0]:
def get_colors_for_class_ids(class_ids):
    colors = []
    for class_id in class_ids:
        if class_id == 1:
            colors.append((.941, .204, .204))
    return colors

### Display a few example of ground truth vs. predictions on the validation dataset 

In [0]:
dataset = dataset_val
fig = plt.figure(figsize=(10, 30))

for i in range(4):

    image_id = random.choice(dataset.image_ids)
    
    original_image, image_meta, gt_class_id, gt_bbox, gt_mask =\
        modellib.load_image_gt(dataset_val, inference_config, 
                               image_id, use_mini_mask=False)
        
    plt.subplot(6, 2, 2*i + 1)
    visualize.display_instances(original_image, gt_bbox, gt_mask, gt_class_id, 
                                dataset.class_names,
                                colors=get_colors_for_class_ids(gt_class_id), ax=fig.axes[-1])
    
    plt.subplot(6, 2, 2*i + 2)
    results = model.detect([original_image]) #, verbose=1)
    r = results[0]
    visualize.display_instances(original_image, r['rois'], r['masks'], r['class_ids'], 
                                dataset.class_names, r['scores'], 
                                colors=get_colors_for_class_ids(r['class_ids']), ax=fig.axes[-1])

### Get filenames of test dataset DICOM images

In [0]:
if gen_preds:
  test_image_fps = get_dicom_fps(test_dicom_dir)

### Make predictions on test images and write out submission file

In [0]:
def predict(image_fps, filepath='sample_submission.csv', min_conf=0.98): 
    
    # assume square image
    resize_factor = ORIG_SIZE / config.IMAGE_SHAPE[0]
    
    with open(filepath, 'w') as file:
      file.write("{},{}\n".format("patientId",	"PredictionString"))
      for image_id in tqdm(image_fps): 
        ds = pydicom.read_file(image_id)
        image = ds.pixel_array
          
        # If grayscale. Convert to RGB for consistency.
        if len(image.shape) != 3 or image.shape[2] != 3:
            image = np.stack((image,) * 3, -1)
        image, window, scale, padding, crop = utils.resize_image(image,
                                                                 min_dim=config.IMAGE_MIN_DIM,
                min_scale=config.IMAGE_MIN_SCALE,
                max_dim=config.IMAGE_MAX_DIM,
                mode=config.IMAGE_RESIZE_MODE)
            
        patient_id = os.path.splitext(os.path.basename(image_id))[0]

        results = model.detect([image])
        r = results[0]

        out_str = ""
        out_str += patient_id 
        assert( len(r['rois']) == len(r['class_ids']) == len(r['scores']) )
        if len(r['rois']) == 0: 
            pass
        else: 
            num_instances = len(r['rois'])
            out_str += ","
            for i in range(num_instances): 
                if r['scores'][i] > min_conf: 
                    out_str += ' '
                    out_str += str(round(r['scores'][i], 2))
                    out_str += ' '

                    # x1, y1, width, height 
                    x1 = r['rois'][i][1]
                    y1 = r['rois'][i][0]
                    width = r['rois'][i][3] - x1 
                    height = r['rois'][i][2] - y1 
                    bboxes_str = "{} {} {} {}".format(x1*resize_factor, y1*resize_factor, \
                                                      width*resize_factor, height*resize_factor)    
                    out_str += bboxes_str

        file.write(out_str+"\n")

### Generate predictions

In [0]:
if gen_preds:
  sample_submission_fp = 'MRCNN_submission.csv'
  predict(test_image_fps, 
          filepath=sample_submission_fp, 
          min_conf=0.98)

In [0]:
if gen_preds:
  output = pd.read_csv(sample_submission_fp)
  print (output.head(10))

### Save submission files to Google Drive

In [0]:
file_dir = '{}'.format('./')
if gen_preds and colab_mode:
  googledrive_save(file_name = sample_submission_fp,
             file_dir = file_dir,
             upload = True)

### Display a few test image predictions

In [0]:
def visualize(): 
    image_id = random.choice(test_image_fps)
    ds = pydicom.read_file(image_id)
    
    # original image 
    image = ds.pixel_array
    
    # assume square image 
    resize_factor = ORIG_SIZE / config.IMAGE_SHAPE[0]
    
    # If grayscale. Convert to RGB for consistency.
    if len(image.shape) != 3 or image.shape[2] != 3:
        image = np.stack((image,) * 3, -1) 
    resized_image, window, scale, padding, crop = utils.resize_image(
        image,
        min_dim=config.IMAGE_MIN_DIM,
        min_scale=config.IMAGE_MIN_SCALE,
        max_dim=config.IMAGE_MAX_DIM,
        mode=config.IMAGE_RESIZE_MODE)

    patient_id = os.path.splitext(os.path.basename(image_id))[0]
    print(patient_id)

    results = model.detect([resized_image])
    r = results[0]
    for bbox in r['rois']: 
        print(bbox)
        x1 = int(bbox[1] * resize_factor)
        y1 = int(bbox[0] * resize_factor)
        x2 = int(bbox[3] * resize_factor)
        y2 = int(bbox[2]  * resize_factor)
        cv2.rectangle(image, (x1,y1), (x2,y2), (77, 255, 9), 3, 1)
        width = x2 - x1 
        height = y2 - y1 
        print("x {} y {} h {} w {}".format(x1, y1, width, height))
    plt.figure() 
    plt.imshow(image, cmap=plt.cm.gist_gray)

In [0]:
if gen_preds:
  visualize()
  visualize()
  visualize()
  visualize()

In [0]:
if gen_preds and download_submission and colab_mode:
  from google.colab import files
  files.download(sample_submission_fp)