# AIML Online Capstone -Pneumonia Detection Challenge

## What is Pneumonia?

**Pneumonia** is an infection in one or both lungs. Bacteria, viruses, and fungi cause it. The infection causes inflammation in the air sacs in your lungs, which are called alveoli.

The alveoli fill with fluid or pus, making it difficult to breathe.Pneumonia is a lung infection that can range from mild to so severe that you have to go to the hospital.
![](https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2016/05/18/13/02/ww5r032t-8col-jpg.jpg)

Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. In 2017, 920,000 children under the age of 5 died from the disease. It requires review of a chest radiograph (CXR) by highly trained specialists and confirmation through clinical history, vital signs and laboratory exams. Pneumonia usually manifests as an area or areas of increased opacity on CXR. However, the diagnosis of pneumonia on CXR is complicated because of a number of other conditions in the lungs such as fluid overload (pulmonary edema), bleeding, volume loss (atelectasis or collapse), lung cancer, or post-radiation or surgical changes. Outside of the lungs, fluid in the pleural space (pleural effusion) also appears as increased opacity on CXR. When available, comparison of CXRs of the patient taken at different time points and correlation with clinical symptoms and history are helpful in making the diagnosis.

CXRs are the most commonly performed diagnostic imaging study. A number of factors such as positioning of the patient and depth of inspiration can alter the appearance of the CXR, complicating interpretation further. In addition, clinicians are faced with reading high volumes of images every shift.

## Pneumonia Detection

Now to detect Pneumonia, we need to detect **Inflammation** of the lungs. In this project, you’re challenged to build an algorithm to detect a visual signal for pneumonia in medical images. Specifically, your algorithm needs to automatically locate lung opacities on chest radiographs.

## How Is Pneumonia Diagnosed?
Sometimes pneumonia can be difficult to diagnose because the symptoms are so variable, and are often very similar to those seen in a cold or influenza. To diagnose pneumonia, and to try to identify the germ that is causing the illness, your doctor will ask questions about your medical history, do a physical exam, and run some tests.

### Medical history
Your doctor will ask you questions about your signs and symptoms, and how and when they began. To help figure out if your infection is caused by bacteria, viruses or fungi, you may be asked some questions about possible exposures, such as:

### Physical exam
Your doctor will listen to your lungs with a stethoscope. If you have pneumonia, your lungs may make crackling, bubbling, and rumbling sounds when you inhale.

### Diagnostic Tests
If your doctor suspects you may have pneumonia, they will probably recommend some tests to confirm the diagnosis and learn more about your infection. These may include:

1 Blood tests to confirm the infection and to try to identify the germ that is causing your illness.

2) Chest X-ray to look for the location and extent of inflammation in your lungs.

3) Pulse oximetry to measure the oxygen level in your blood. Pneumonia can prevent your lungs from moving enough oxygen into your bloodstream.

4) Sputum test on a sample of mucus (sputum) taken after a deep cough, to look for the source of the infection. If you are considered a high-risk patient because of your age and overall health, or if you are hospitalized, the doctors may want to do some additional tests, including:

5) CT scan of the chest to get a better view of the lungs and look for abscesses or other complications.

6) Arterial blood gas test, to measure the amount of oxygen in a blood sample taken from an artery, usually in      your wrist. This is more accurate than the simpler pulse oximetry.

7) Pleural fluid culture, which removes a small amount of fluid from around tissues that surround the lung, to      analyze and identify bacteria causing the pneumonia.

8) Bronchoscopy, a procedure used to look into the lungs' airways. If you are hospitalized and your treatment is    not working well, doctors may want to see whether something else is affecting your airways, such as a            blockage. They may also take fluid samples or a biopsy of lung tissue.


## Business Domain Value

Automating Pneumonia screening in chest radiographs, providing affected area details through bounding box. 

Assist physicians to make better clinical decisions or even replace human judgement in certain functional areas of healthcare (eg, radiology).

Guided by relevant clinical questions, powerful AI techniques can unlock clinically relevant information hidden in the massive amount of data, which in turn can assist clinical decision making.

## Image DataType

Medical images are stored in a special format called DICOM files (`*.dcm`). They contain a combination of header metadata as well as underlying raw image arrays for pixel data.

Dataset link: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data


## Prediction Output

In this project, we have to predict whether pneumonia exists in a given image. This is done by predicting bounding boxes around areas of the lung. Samples without bounding boxes are negative and contain no definitive evidence of pneumonia. Samples with bounding boxes indicate evidence of pneumonia.

When making predictions, the model should predict as many bounding boxes as necessary, in the format: `confidence x-min y-min width height`

There will be only ONE predicted row per image. This row may include multiple bounding boxes.

A properly formatted row may look like any of the following.

For patientIds with no predicted pneumonia / bounding boxes: `0004cfab-14fd-4e49-80ba-63a80b6bddd6,`

For patientIds with a single predicted bounding box: `0004cfab-14fd-4e49-80ba-63a80b6bddd6,0.5 0 0 100 100`

For patientIds with multiple predicted bounding boxes: `0004cfab-14fd-4e49-80ba-63a80b6bddd6,0.5 0 0 100 100 0.5 0 0 100 100,` etc.

The general format is as follows:

`patientId,{confidence x-min y-min width height},{confidence x-min y-min width height}, etc.`

## Dataset File Description

`stage_2_train_labels.csv` - the training set. Contains `patientId`s and bounding box / target information.

`stage_2_detailed_class_info.csv` - provides detailed information about the type of positive or negative class for each image.

## Data Fields

- `patientId_` - Each patientId corresponds to a unique image.
- `x_` - the upper-left x coordinate of the bounding box.
- `y_` - the upper-left y coordinate of the bounding box.
- `width_` - the width of the bounding box.
- `height_` - the height of the bounding box.
- `Target_` - the binary Target, indicating whether this sample has evidence of pneumonia.


## Lung Opacity

Tissues with sparse material, such as lungs which are full of air, do not absorb the X-rays and appear black in the image. Dense tissues such as bones absorb X-rays and appear white in the image.

While we are theoretically detecting “lung opacities”, there are lung opacities that are not pneumonia related.

In the data, some of these are labeled “Not Normal No Lung Opacity”.This extra third class indicates that while pneumonia was determined not to be present, there was nonetheless some type of abnormality on the image and often times this finding may mimic the appearance of true pneumonia.

It's important to note that the various shades of gray in the chest X-Ray refer to the following:

- **Black** = Air
- **White** = Bone
- **Grey** = Tissue or Fluid

In a normal image (shown above) we see the lungs as black, but they have different projections on them - mainly the rib cage bones, main airways, blood vessels and the heart.

In case of pneumonia, a **haziness** (also referred to as **consolidation**) is present in the chest x-ray image. 

Images with no lung opacity and no pneumonia are images where the patient can have rounded hazy boundaries or masses (probably because of lung nodules or masses which can be because of cancer).

There are other exceptional cases as well where there can be no lung opacity but no pneumonia either. Some of these cases include **pneumonectomy** (lung removed by surgery), **enlarged heart**, **pleural effusion**, etc. 

**Reference**: https://www.kaggle.com/zahaviguy/what-are-lung-opacities

# Data Visualization

**Reference**: https://www.kaggle.com/peterchang77/exploratory-data-analysis

Let's start the project with visualizing the data we have. Since the images are in DICOM format, we will use the Python module - `pydicom` for reading these images. We will also use `pylab` module for Data Visualization purposes, `pandas` for EDA, `numpy` for numerical processing and `glob` for directory listing based operations.

In [None]:
import glob, pylab 
import pandas as pd
import pydicom
import numpy as np
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
import gc
import os
import cv2
import warnings
warnings.simplefilter(action = 'ignore')

## Directory Structure

The challenge data is organized in several files and folders.

In [None]:
!ls ../input

The several key items in this folder:
* `stage_2_train_labels.csv`: CSV file containing training set patientIds and  labels (including bounding boxes)
* `stage_2_detailed_class_info.csv`: CSV file containing detailed labels (explored further below)
* `stage_2_train_images/`:  directory containing training set raw image (DICOM) files
* `stage_2_test_images/` : directory containing testing set image (DICOM) files

Let's go ahead and take a look at the first labels CSV file first:

In [None]:
df = pd.read_csv('../input/stage_2_train_labels.csv')
print(df.iloc[0])

As you can see, each row in the CSV file contains a `patientId` (one unique value per patient), a target (either 0 or 1 for absence or presence of pneumonia, respectively) and the corresponding abnormality bounding box defined by the upper-left hand corner (x, y) coordinate and its corresponding width and height. In this particular case, the patient does *not* have pneumonia and so the corresponding bounding box information is set to `NaN`. See an example case with pnuemonia here:

In [None]:
print(df.iloc[4])

One important thing to keep in mind is that a given `patientId` may have **multiple** boxes if more than one area of pneumonia is detected.

# Overview of DICOM files and medical images

Medical images are stored in a special format known as DICOM files (`*.dcm`). They contain a combination of header metadata as well as underlying raw image arrays for pixel data. In Python, one popular library to access and manipulate DICOM files is the `pydicom` module. To use the `pydicom` library, first find the DICOM file for a given `patientId` by simply looking for the matching file in the `stage_2_train_images/` folder, and the use the `pydicom.read_file()` method to load the data:

In [None]:
patientId = df['patientId'][0]
dcm_file = '../input/stage_2_train_images/%s.dcm' % patientId
dcm_data = pydicom.read_file(dcm_file)

print(dcm_data)

Most of the standard headers containing patient identifable information have been anonymized (removed) so we are left with a relatively sparse set of metadata. The primary field we will be accessing is the underlying pixel data as follows:

In [None]:
im = dcm_data.pixel_array
print(type(im))
print(im.dtype)
print(im.shape)

## Considerations

As we can see here, the pixel array data is stored as a Numpy array, a powerful numeric Python library for handling and manipulating matrix data (among other things). In addition, it is apparent here that the original radiographs have been preprocessed for us as follows:

* The relatively high dynamic range, high bit-depth original images have been rescaled to 8-bit encoding (256 grayscales). For the radiologists out there, this means that the images have been windowed and leveled already. In clinical practice, manipulating the image bit-depth is typically done manually by a radiologist to highlight certain disease processes. To visually assess the quality of the automated bit-depth downscaling and for considerations on potentially improving this baseline, consider consultation with a radiologist physician.

* The relativley large original image matrices (typically acquired at >2000 x 2000) have been resized to the data-science friendly shape of 1024 x 1024. For the purposes of this challenge, the diagnosis of most pneumonia cases can typically be made at this resolution. To visually assess the feasibility of diagnosis at this resolution, and to determine the optimal resolution for pneumonia detection (oftentimes can be done at a resolution *even smaller* than 1024 x 1024), consider consultation with a radiogist physician.

## Visualizing An Example

To take a look at this first DICOM image, let's use the `pylab.imshow()` method:

In [None]:
pylab.imshow(im, cmap=pylab.cm.gist_gray)
pylab.axis('off')

# Data Overview

As alluded to above, any given patient may potentially have many boxes if there are several different suspicious areas of pneumonia. To collapse the current CSV file dataframe into a dictionary with unique entries, consider the following method:

In [None]:
def parse_data(df):
    """
    Method to read a CSV file (Pandas dataframe) and parse the 
    data into the following nested dictionary:

      parsed = {
        
        'patientId-00': {
            'dicom': path/to/dicom/file,
            'label': either 0 or 1 for normal or pnuemonia, 
            'boxes': list of box(es)
        },
        'patientId-01': {
            'dicom': path/to/dicom/file,
            'label': either 0 or 1 for normal or pnuemonia, 
            'boxes': list of box(es)
        }, ...

      }

    """
    # --- Define lambda to extract coords in list [y, x, height, width]
    extract_box = lambda row: [row['y'], row['x'], row['height'], row['width']]

    parsed = {}
    for n, row in df.iterrows():
        # --- Initialize patient entry into parsed 
        pid = row['patientId']
        if pid not in parsed:
            parsed[pid] = {
                'dicom': '../input/stage_2_train_images/%s.dcm' % pid,
                'label': row['Target'],
                'boxes': []}

        # --- Add box if opacity is present
        if parsed[pid]['label'] == 1:
            parsed[pid]['boxes'].append(extract_box(row))

    return parsed

Let's use the method here:

In [None]:
parsed = parse_data(df)

As we saw above, patient `00436515-870c-4b36-a041-de91049b9ab4` has pnuemonia so lets check our new `parsed` dict here to see the patients corresponding bounding boxes:

In [None]:
print(parsed['00436515-870c-4b36-a041-de91049b9ab4'])

# Visualizing Boxes

In order to overlay color boxes on the original grayscale DICOM files, consider using the following  methods (below, the main method `draw()` requires the method `overlay_box()`):

In [None]:
def draw(data):
    """
    Method to draw single patient with bounding box(es) if present 

    """
    # --- Open DICOM file
    d = pydicom.read_file(data['dicom'])
    im = d.pixel_array

    # --- Convert from single-channel grayscale to 3-channel RGB
    im = np.stack([im] * 3, axis=2)

    # --- Add boxes with random color if present
    for box in data['boxes']:
        rgb = np.floor(np.random.rand(3) * 256).astype('int')
        im = overlay_box(im=im, box=box, rgb=rgb, stroke=6)

    pylab.imshow(im, cmap=pylab.cm.gist_gray)
    pylab.axis('off')

def overlay_box(im, box, rgb, stroke=1):
    """
    Method to overlay single box on image

    """
    # --- Convert coordinates to integers
    box = [int(b) for b in box]
    
    # --- Extract coordinates
    y1, x1, height, width = box
    y2 = y1 + height
    x2 = x1 + width

    im[y1:y1 + stroke, x1:x2] = rgb
    im[y2:y2 + stroke, x1:x2] = rgb
    im[y1:y2, x1:x1 + stroke] = rgb
    im[y1:y2, x2:x2 + stroke] = rgb

    return im

As we saw above, patient `00436515-870c-4b36-a041-de91049b9ab4` has pnuemonia so let's take a look at the overlaid bounding boxes:

In [None]:
draw(parsed['00436515-870c-4b36-a041-de91049b9ab4'])

## Exploring Detailed Labels

In this challenge, the primary endpoint will be the detection of bounding boxes consisting of a binary classification---e.g. the presence or absence of pneumonia. However, in addition to the binary classification, each bounding box *without* pneumonia is further categorized into *normal* or *no lung opacity / not normal*. This extra third class indicates that while pneumonia was determined not to be present, there was nonetheless some type of abnormality on the image---and oftentimes this finding may mimic the appearance of true pneumonia. Keep in mind that this extra class is provided as supplemental information to help improve algorithm accuracy if needed; generation of this separate class **will not** be a formal metric used to evaluate performance in this competition.

As above, we saw that the first patient in the CSV file did not have pneumonia. Let's look at the detailed label information for this patient:

In [None]:
df_detailed = pd.read_csv('../input/stage_2_detailed_class_info.csv')
print(df_detailed.iloc[0])

As we see here, the patient does not have pneumonia however *does* have another imaging abnormality present. Let's take a closer look:

In [None]:
patientId = df_detailed['patientId'][0]
draw(parsed[patientId])

While the image displayed inline within the notebook is small, it is evident that the patient has several well circumscribed nodular densities in the left lung (right side of image). This can be because of lung cancer masses.

## Label Summary

Finally, let us take a closer look at the distribution of labels in the dataset. To do so we will first parse the detailed label information:

In [None]:
summary = {}
for n, row in df_detailed.iterrows():
    if row['class'] not in summary:
        summary[row['class']] = 0
    summary[row['class']] += 1
    
print(summary)

As we can see, there is a relatively even split between the three classes, with nearly 2/3rd of the data comprising of no pneumonia (either completely *normal* or *no lung opacity / not normal*). Compared to most medical imaging datasets, where the prevalence of disease is quite low, this dataset has been significantly enriched with pathology.

Before we move on with further EDA on the Metadata, let's have a look at some more images from the three categories to get a better understanding of the dataset.

In [None]:
def show_dicom_image(data_df):
        img_data = list(data_df.T.to_dict().values())
        f, ax = plt.subplots(2,2, figsize=(16,18))
        for i,data_row in enumerate(img_data):
            pid = data_row['patientId']
            dcm_file = '../input/stage_2_train_images/%s.dcm' % pid
            dcm_data = pydicom.read_file(dcm_file)                    
            ax[i//2, i%2].imshow(dcm_data.pixel_array, cmap=plt.cm.bone)
            ax[i//2, i%2].set_title('ID: {}\n Age: {} Sex: {}'.format(
                data_row['patientId'],dcm_data.PatientAge, dcm_data.PatientSex))

We will start off with the images of patients with pnuemonia.

In [None]:
df_orig = df.copy()

In [None]:
df = pd.concat([df_orig,df_detailed["class"]],axis=1,sort=False)
show_dicom_image(df[df['Target']==1].sample(n=4))

Next, let's have a look at chest x-rays of patients who do not have pneumonia but don't have a normal chest x-ray either.

In [None]:
show_dicom_image(df[ (df['Target']==0) & (df['class']=='No Lung Opacity / Not Normal')].sample(n=4))

Finally, let's visualize a few images from `"Normal"` class as well.

In [None]:
show_dicom_image(df[ (df['Target']==0) & (df['class']=='Normal')].sample(n=4))

We can also display the positive pnuemonia chest x-rays with bounding boxes.

In [None]:
def show_dicome_with_boundingbox(data_df):
    img_data = list(data_df.T.to_dict().values())
    f, ax = plt.subplots(2,2, figsize=(16,18))
    for i,data_row in enumerate(img_data):
        pid = data_row['patientId']
        dcm_file = '../input/stage_2_train_images/%s.dcm' % pid
        dcm_data = pydicom.read_file(dcm_file)                    
        ax[i//2, i%2].imshow(dcm_data.pixel_array, cmap=plt.cm.bone)
        ax[i//2, i%2].set_title('ID: {}\n Age: {} Sex: {}'.format(
                data_row['patientId'],dcm_data.PatientAge, dcm_data.PatientSex))
        rows = data_df[data_df['patientId']==data_row['patientId']]
        box_data = list(rows.T.to_dict().values())        
        for j, row in enumerate(box_data):            
            x,y,width,height = row['x'], row['y'],row['width'],row['height']
            rectangle = Rectangle(xy=(x,y),width=width, height=height, color="red",alpha = 0.1)
            ax[i//2, i%2].add_patch(rectangle)

In [None]:
show_dicome_with_boundingbox(df[df['Target']==1].sample(n=4))

Now that we have a basic idea regarding the labels and the images, we can explore the metadata present in the dataset to see whether some information can be extracted directly from them. This will be extremely important since images will require deep learning based solutions which will take large amount of resources and time for coming up with efficient results, whereas analysing and extracting useful information from metadata is extremely fast and easy to do.

# EDA on Metadata

Since the `target` information is directly given in the metadata, we can approach the problem using 2 approaches:

1. Predicting `target` based on metadata and finding the bounding boxes for the images with `target=1` using the deep neural network model.
2. Predicting `target` and bounding boxes based on DNN model.

Let's try the first approach first. For this, we will start off with the EDA on metadata.

**Reference**: https://www.kaggle.com/aantonova/practical-eda-on-numerical-data

In [None]:
# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import gc

In [None]:
# Load detailed class information
detailed_class_info = pd.read_csv('../input/stage_2_detailed_class_info.csv')
# Load training dataset labels
train_labels = pd.read_csv('../input/stage_2_train_labels.csv')

# Merge the above data information into one dataframe
df = pd.merge(left = detailed_class_info, right = train_labels, how = 'left', on = 'patientId')

# Remove the original dataframes since they don't hold any useful information now
del detailed_class_info, train_labels
gc.collect()

In [None]:
# Display information about the merged dataframe
df.info(null_counts = True)

As we can see, there are 37.6k patients' information. Out of these patients, only for 16.9k patients the bounding box coordinates are available meaning that there was no Pneumonia detected for the rest of the patients. 

In [None]:
# First 5 rows of the dataframe
df.head()

Let's first start with data cleaning. The first step is to remove the duplicates, if any.

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.info()

We are now left with data for 30.2k unique patients. Out of these only 9.5k patients have pneumonia.

Next, we know that it's possible to have multiple bounding boxes in an image corresponding to a patient. Let's see this.

In [None]:
df['patientId'].value_counts().head(10)

Let's a see sample of this data for the first patient id - `32408669-c137-4e8d-bd62-fe8345b40e73`

In [None]:
df[df['patientId'] == '32408669-c137-4e8d-bd62-fe8345b40e73']

As we can see, this patient's chest x-ray image has 4 bounding boxes.

Let's see the distribution of number of bounding boxes in patients' x-ray images.

In [None]:
df['patientId'].value_counts().value_counts()

We know that a patient without Pneumonia will have only one row (0 bounding boxes). Let's see how many such cases are there.

In [None]:
df[df['Target'] == 0]['patientId'].value_counts().value_counts()

As we can see here, most of the patients have no pneumonia. But this also means that there are some other cases as well. Let's explore the same.

In [None]:
df[df['Target'] == 1]['patientId'].value_counts().value_counts()

Notice how that even for patients with pneumonia, there are 2614 patients which have only one row.

Next, let's have a look at the distribution of the target class.

## Distribution of `class`

In [None]:
sns.countplot(x = 'class', hue = 'Target', data = df)
plt.show()

Notice here that most of the images are from the class `No Lung Opacity/Not Normal` or `Normal class`.

Let's have a look at the `target` distribution for these classes. Again, 

- `Lung Opacity` - 1
- `Normal Class` - 0
- `No Lung Opacity/Not Normal` - 0

In [None]:
df[df['class'] == 'Lung Opacity']['Target'].value_counts(dropna = False)

In [None]:
df[df['class'] == 'No Lung Opacity / Not Normal']['Target'].value_counts(dropna = False)

In [None]:
df[df['class'] == 'Normal']['Target'].value_counts(dropna = False)

## Feature Engineering

Next, using the metadata we have, we will create some new features regarding the bounding box.

1. `x_2`, `y_2` - Corner point coordinates opposite to `x`, `y`
2. `area` - Area of the bounding box
3. `x_center`, `y_center` - Center point of the bounding box

In [None]:
df_areas = df.dropna()[['x', 'y', 'width', 'height']].copy()
df_areas['x_2'] = df_areas['x'] + df_areas['width']
df_areas['y_2'] = df_areas['y'] + df_areas['height']
df_areas['x_center'] = df_areas['x'] + df_areas['width'] / 2
df_areas['y_center'] = df_areas['y'] + df_areas['height'] / 2
df_areas['area'] = df_areas['width'] * df_areas['height']

df_areas.head()

Now that we have some new features, let's see if there is any relationship between these features using a `jointplot`.

In [None]:
def createJointplot(df, x, y):
    sns.jointplot(x = x, y = y, data = df, kind = 'hex', gridsize = 20)
    plt.show()

In [None]:
createJointplot(df_areas,'x','y')

In [None]:
createJointplot(df_areas,'x_center','y_center')

In [None]:
createJointplot(df_areas,'x_2','y_2')

As we can see, there is no significant correlation between the points (x,y) and (x_2, y_2). Let's also study the correlation between width and height.

In [None]:
createJointplot(df_areas,'width','height')

As we can see from the above joint plot, width and height have high correlation, so while building a model, we can choose to keep only one of the 2 for increasing the simplicity of the model.

## Outlier Analysis

In this section, we will use boxplots to see whether there are any outliers in the features.

In [None]:
n_columns = 3
n_rows = 3
_, axes = plt.subplots(n_rows, n_columns, figsize=(8 * n_columns, 5 * n_rows))
for i, c in enumerate(df_areas.columns):
    sns.boxplot(y = c, data = df_areas, ax = axes[i // n_columns, i % n_columns])
plt.tight_layout()
plt.show()

There are some outliers in the `width` and `height` features. Let's find the rows where these values exist.

In [None]:
df_areas[df_areas['width'] > 500]

In [None]:
pid_width = list(df[df['width'] > 500]['patientId'].values)
df[df['patientId'].isin(pid_width)]

In [None]:
df_areas[df_areas['height'] > 900].shape[0]

In [None]:
pid_height = list(df[df['height'] > 900]['patientId'].values)
df[df['patientId'].isin(pid_height)]

Let's drop all these rows for the 2 patients. This will remove the extreme outliers.

In [None]:
df = df[~df['patientId'].isin(pid_width + pid_height)]
df.shape

We are now left with around 30k patients.

## Metadata Cleanup

Finally, let's clean up the entire metadata to keep only the relevant columns. We will use the cleaned up data to see whether there is any correlation between the `target` column and other columns.

In [None]:
df_meta = df.drop('class', axis = 1).copy()
dcm_columns = None

for n, pid in enumerate(df_meta['patientId'].unique()):
    if n%1000==0:
        print(n,len(df_meta['patientId'].unique()))
    dcm_file = '../input/stage_2_train_images/%s.dcm' % pid
    dcm_data = pydicom.read_file(dcm_file)
    
    if not dcm_columns:
        dcm_columns = dcm_data.dir()
        dcm_columns.remove('PixelSpacing')
        dcm_columns.remove('PixelData')
    
    for col in dcm_columns:
        if not (col in df_meta.columns):
            df_meta[col] = np.nan
        index = df_meta[df_meta['patientId'] == pid].index
        df_meta.loc[index, col] = dcm_data.data_element(col).value
        
    del dcm_data
    
gc.collect()

In [None]:
df_meta.columns

In [None]:
to_drop = df_meta.nunique()
to_drop = to_drop[(to_drop <= 1) | (to_drop == to_drop['patientId'])].index
to_drop = to_drop.drop('patientId')

In [None]:
df_meta.drop(to_drop, axis = 1, inplace = True)

In [None]:
df_meta.head()

In [None]:
# try:
#     df_meta.drop('ReferringPhysicianName', axis = 1, inplace = True)
# except:
#     print("Referring Physician Name not found")
df_meta['PatientAge'] = df_meta['PatientAge'].astype(int)
df_meta['SeriesDescription'] = df_meta['SeriesDescription'].map({'view: AP': 'AP', 'view: PA': 'PA'})

df_meta.drop('SeriesDescription', axis = 1, inplace = True)

df_meta['PatientSex'] = df_meta['PatientSex'].map({'F': 0, 'M': 1})
df_meta['ViewPosition'] = df_meta['ViewPosition'].map({'PA': 0, 'AP': 1})

In [None]:
df_meta.head()

Now using this cleaned data, let's first see if there is any relation of `target` variable and the categorical features like patient's age, sex and the view position of the chest x-ray.

In [None]:
plt.figure(figsize = (25, 5))
sns.countplot(x = 'PatientAge', hue = 'Target', data = df_meta)
plt.show()

As we can see from the above count plot, patients with age lying in the mid range have higher number of patients. The ratio of patients with target 0 versus target 1 is mostly same throughout the range. There does not seem to be any significant correlation between target and patient age.

In [None]:
sns.countplot(x = 'PatientSex', hue = 'Target', data = df_meta)

In [None]:
sns.countplot(x = 'ViewPosition', hue = 'Target', data = df_meta)

Let's also find out the correlation between the features.

In [None]:
df_meta.corr()

As we can see from the above correlation matrix, there is a high correlation between `ViewPosition` and `Target` features.

At this point, we can try using the cleaned metadata for predicting the `target` column.

In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold

import warnings
warnings.simplefilter("ignore")

In [None]:
def fast_lgbm_cv_scores(df, target, task, rs = 0):
    
    clf = LGBMClassifier(n_estimators = 10000, nthread = 4, random_state = rs)
    metric = 'auc'

    # Cross validation model
    folds = KFold(n_splits = 2, shuffle = True, random_state = rs)
        
    # Create arrays and dataframes to store results
    pred = np.zeros(df.shape[0])
    
    feats = df.columns.drop(target)
    
    feature_importance_df = pd.DataFrame(index = feats)
    
    for n_fold, (train_idx, valid_idx) in enumerate(folds.split(df[feats], df[target])):
        train_x, train_y = df[feats].iloc[train_idx], df[target].iloc[train_idx]
        valid_x, valid_y = df[feats].iloc[valid_idx], df[target].iloc[valid_idx]

        clf.fit(train_x, train_y, 
                eval_set = [(valid_x, valid_y)], eval_metric = metric, 
                verbose = -1, early_stopping_rounds = 100)

        pred[valid_idx] = clf.predict_proba(valid_x, num_iteration = clf.best_iteration_)[:, 1]
        
        feature_importance_df[n_fold] = pd.Series(clf.feature_importances_, index = feats)
        
        del train_x, train_y, valid_x, valid_y
        gc.collect()

    return feature_importance_df, pred, roc_auc_score(df[target], pred)

Using the 4 columns we have from the metadata - `patientId`, `x`, `y`, `width`, `height`, we can fit an LGBM Classifier and find out the area under the ROC curve. 

In [None]:
f_imp, _, score = fast_lgbm_cv_scores(df_meta.drop(['patientId', 'x', 'y', 'width', 'height'], axis = 1), 
                                      target = 'Target', task = 'classification')
print('ROC-AUC for Target = {}'.format(score))

In [None]:
f_imp

As we can see from the area under the ROC curve is quite high (~0.75) which means that there is significant information we can get from the metadata regarding the `target` column. This shows that the first approach that we proposed has some significant potential.

# Model Building

In the previous section, we saw that metadata does play an important role when it comes to predicting the `target` column but including the metadata and the images in the pipeline will mean having three networks:
1. One network for getting information out of the metadata
2. One network for processing the chest x-ray image
3. One network for combining the outputs of the above two networks.

Before we start with the above, let's try a more simplistic approach. In this approach, we will focus only on the chest x-ray images and train a simple CNN model to understand the performance we are getting out of it. We can then modify the layers and the activation functions to understand the effect they have on the performance. But, going by the typical computer vision solutions, it's highly unlikely that we will get excellent results from the shallow network. That's why, we will next shift to transfer learning using pre-trained models like VGG-16, VGG-19 and then to even more advanced networks like Mask-RCNN and YOLOv3. Due to the recent release of YOLOv4 which has been shown to give better results, we can also train a YOLOv4 model and compare the results. But, in the interim report, we will focus primarily on training a shallow CNN model and using VGG models for transfer learning.

Before we jump into model building, let's first generate the data which we can use later on in model training.

In [None]:
import os
import csv
import random
import pydicom
import numpy as np
import pandas as pd
from skimage import measure
from skimage.transform import resize
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
%matplotlib inline

In [None]:
pneumonia_locations = {}
# load table
with open(os.path.join('../input/stage_2_train_labels.csv'),
          'r') as infile:
    # open reader
    reader = csv.reader(infile)
    # skip header
    next(reader, None)
    # loop through rows
    for rows in reader:
        # retrieve information
        filename = rows[0]
        location = rows[1:5]
        pneumonia = rows[5]
        # if row contains pneumonia add label to dictionary
        # which contains a list of pneumonia locations per filename
        if pneumonia == '1':
            # convert string to float to int
            location = [int(float(i)) for i in location]
            # save pneumonia location in dictionary
            if filename in pneumonia_locations:
                pneumonia_locations[filename].append(location)
            else:
                pneumonia_locations[filename] = [location]

In [None]:
# load and shuffle filenames
folder = '../input/stage_2_train_images'
filenames = os.listdir(folder)
random.shuffle(filenames)
# split into train and validation filenames
n_valid_samples = 8000
train_filenames = filenames[n_valid_samples:]
valid_filenames = filenames[:n_valid_samples]
print('n train samples', len(train_filenames))
print('n valid samples', len(valid_filenames))
n_train_samples = len(filenames) - n_valid_samples

### Data generator

The dataset is too large to fit into memory, so we need to create a generator that loads data on the fly.

The generator takes in some filenames, batch_size and other parameters.

The generator outputs a random batch of numpy images and numpy masks.

In [None]:
class generator(keras.utils.Sequence):    
    def __init__(self, folder, filenames, pneumonia_locations=None, batch_size=32, image_size=256, shuffle=True, augment=False, predict=False):
        self.folder = folder
        self.filenames = filenames
        self.pneumonia_locations = pneumonia_locations
        self.batch_size = batch_size
        self.image_size = image_size
        self.shuffle = shuffle
        self.augment = augment
        self.predict = predict
        self.on_epoch_end()
        
    def __load__(self, filename):
        # load dicom file as numpy array
        img = pydicom.dcmread(os.path.join(self.folder, filename)).pixel_array
        # create empty mask
        msk = np.zeros(img.shape)
        # get filename without extension
        filename = filename.split('.')[0]
        # if image contains pneumonia
        if filename in self.pneumonia_locations:
            # loop through pneumonia
            for location in self.pneumonia_locations[filename]:
                # add 1's at the location of the pneumonia
                x, y, w, h = location
                msk[y:y+h, x:x+w] = 1
        # resize both image and mask
        img = resize(img, (self.image_size, self.image_size), mode='reflect')
        msk = resize(msk, (self.image_size, self.image_size), mode='reflect') > 0.5
        # if augment then horizontal flip half the time
        if self.augment and random.random() > 0.5:
            img = np.fliplr(img)
            msk = np.fliplr(msk)
        # add trailing channel dimension
        img = np.expand_dims(img, -1)
        msk = np.expand_dims(msk, -1)
        return img, msk
    
    def __loadpredict__(self, filename):
        # load dicom file as numpy array
        img = pydicom.dcmread(os.path.join(self.folder, filename)).pixel_array
        # resize image
        img = resize(img, (self.image_size, self.image_size), mode='reflect')
        # add trailing channel dimension
        img = np.expand_dims(img, -1)
        return img
        
    def __getitem__(self, index):
        # select batch
        filenames = self.filenames[index*self.batch_size:(index+1)*self.batch_size]
        # predict mode: return images and filenames
        if self.predict:
            # load files
            imgs = [self.__loadpredict__(filename) for filename in filenames]
            # create numpy batch
            imgs = np.array(imgs)
            return imgs, filenames
        # train mode: return images and masks
        else:
            # load files
            items = [self.__load__(filename) for filename in filenames]
            # unzip images and masks
            imgs, msks = zip(*items)
            # create numpy batch
            imgs = np.array(imgs)
            msks = np.array(msks)
            return imgs, msks
        
    def on_epoch_end(self):
        if self.shuffle:
            random.shuffle(self.filenames)
        
    def __len__(self):
        if self.predict:
            # return everything
            return int(np.ceil(len(self.filenames) / self.batch_size))
        else:
            # return full batches only
            return int(len(self.filenames) / self.batch_size)

Let's also define the evaluation metrics that we are going to use.

In [None]:
# define iou or jaccard loss function
def iou_loss(y_true, y_pred):
    #print(y_true)
    y_true=tf.cast(y_true, tf.float32)
    y_pred=tf.cast(y_pred, tf.float32)
    y_true = tf.reshape(y_true, [-1])
    y_pred = tf.reshape(y_pred, [-1])
   
    intersection = tf.reduce_sum(y_true * y_pred)
    score = (intersection + 1.) / (tf.reduce_sum(y_true) + tf.reduce_sum(y_pred) - intersection + 1.)
    return 1 - score

# combine bce loss and iou loss
def iou_bce_loss(y_true, y_pred):
    return 0.5 * keras.losses.binary_crossentropy(y_true, y_pred) + 0.5 * iou_loss(y_true, y_pred)

# mean iou as a metric
def mean_iou(y_true, y_pred):
    y_pred = tf.round(y_pred)
    intersect = tf.reduce_sum(y_true * y_pred, axis=[1, 2, 3])
    union = tf.reduce_sum(y_true, axis=[1, 2, 3]) + tf.reduce_sum(y_pred, axis=[1, 2, 3])
    smooth = tf.ones(tf.shape(intersect))
    return tf.reduce_mean((intersect + smooth) / (union - intersect + smooth))

def create_downsample(channels, inputs):
    x = keras.layers.BatchNormalization(momentum=0.9)(inputs)
    x = keras.layers.LeakyReLU(0)(x)
    x = keras.layers.Conv2D(channels, 1, padding='same', use_bias=False)(x)
    x = keras.layers.MaxPool2D(2)(x)
    return x

def create_resblock(channels, inputs):
    x = keras.layers.BatchNormalization(momentum=0.9)(inputs)
    x = keras.layers.LeakyReLU(0)(x)
    x = keras.layers.Conv2D(channels, 3, padding='same', use_bias=False)(x)
    x = keras.layers.BatchNormalization(momentum=0.9)(x)
    x = keras.layers.LeakyReLU(0)(x)
    x = keras.layers.Conv2D(channels, 3, padding='same', use_bias=False)(x)
    return keras.layers.add([x, inputs])

def create_network(input_size, channels, n_blocks=2, depth=4):
    # input
    inputs = keras.Input(shape=(input_size, input_size, 1))
    x = keras.layers.Conv2D(channels, 3, padding='same', use_bias=False)(inputs)
    # residual blocks
    for d in range(depth):
        channels = channels * 2
        x = create_downsample(channels, x)
        for b in range(n_blocks):
            x = create_resblock(channels, x)
    # output
    x = keras.layers.BatchNormalization(momentum=0.9)(x)
    x = keras.layers.LeakyReLU(0)(x)
    x = keras.layers.Conv2D(1, 1, activation='sigmoid')(x)
    outputs = keras.layers.UpSampling2D(2**depth)(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

Next, we will define the batch size.

In [None]:
BATCH_SIZE = 128
IMAGE_SIZE = 128

Let's now start training the model.

In [None]:
model = create_network(input_size=IMAGE_SIZE, channels=32, n_blocks=2, depth=4)
model.compile(optimizer='adam', loss=iou_bce_loss, metrics=['accuracy', mean_iou])

# cosine learning rate annealing
def cosine_annealing(x):
    lr = 0.0001
    epochs = 3
    return lr*(np.cos(np.pi*x/epochs)+1.)/2


learning_rate = tf.keras.callbacks.LearningRateScheduler(cosine_annealing)

# create train and validation generators
folder = '../input/stage_2_train_images'
train_gen = generator(folder, train_filenames, pneumonia_locations, batch_size=BATCH_SIZE, 
                      image_size=IMAGE_SIZE, shuffle=True, augment=False, predict=False)
valid_gen = generator(folder, valid_filenames, pneumonia_locations, batch_size=BATCH_SIZE, 
                      image_size=IMAGE_SIZE, shuffle=False, predict=False)

print(model.summary())

In [None]:
EPOCHS=5
MULTI_PROCESSING = True 

history = model.fit_generator(train_gen, validation_data=valid_gen, callbacks=[learning_rate], epochs=EPOCHS, 
                              workers=4, use_multiprocessing=MULTI_PROCESSING)

Let's now plot the training & validation accuracy, loss and IoU values.

In [None]:
plt.figure(figsize=(12,4))
plt.subplot(131)
plt.plot(history.epoch, history.history["loss"], label="Train loss")
plt.plot(history.epoch, history.history["val_loss"], label="Valid loss")
plt.legend()
plt.subplot(132)
plt.plot(history.epoch, history.history["accuracy"], label="Train accuracy")
plt.plot(history.epoch, history.history["val_accuracy"], label="Valid accuracy")
plt.legend()
plt.subplot(133)
plt.plot(history.epoch, history.history["mean_iou"], label="Train iou")
plt.plot(history.epoch, history.history["val_mean_iou"], label="Valid iou")
plt.legend()
plt.show()

Finally, let's use the model we have trained to predict the output using the validation generator.

In [None]:
i=0
for imgs, msks in valid_gen:    
    # predict batch of images
    preds = model.predict(imgs)
    # create figure
    f, axarr = plt.subplots(4, 8, figsize=(20,15))
    axarr = axarr.ravel()
    axidx = 0
    # loop through batch
    for img, msk, pred in zip(imgs, msks, preds):
        i=i+1
        #exit after 32 images
        if i>32:
            break
        # plot image
        axarr[axidx].imshow(img[:, :, 0])
        # threshold true mask
        comp = msk[:, :, 0] > 0.5
        # apply connected components
        comp = measure.label(comp)
        # apply bounding boxes
        predictionString = ''
        for region in measure.regionprops(comp):
            # retrieve x, y, height and width
            y, x, y2, x2 = region.bbox
            height = y2 - y
            width = x2 - x
            axarr[axidx].add_patch(patches.Rectangle((x,y),width,height,linewidth=2,
                                                     edgecolor='b',facecolor='none'))
        # threshold predicted mask
        comp = pred[:, :, 0] > 0.5
        # apply connected components
        comp = measure.label(comp)
        # apply bounding boxes
        predictionString = ''
        for region in measure.regionprops(comp):
            # retrieve x, y, height and width
            y, x, y2, x2 = region.bbox
            height = y2 - y
            width = x2 - x
            axarr[axidx].add_patch(patches.Rectangle((x,y),width,height,linewidth=2,
                                                     edgecolor='r',facecolor='none'))
        axidx += 1
    plt.show()
    # only plot one batch
    break

## VGG 16 and VGG 19

Next, let's use a pre-trained model instead of training our own CNN model from scratch.

For training the VGG-16 and VGG-19 models, we will have to change the directory structure slightly.

In [None]:
import os
import csv
import random
import pydicom
import numpy as np
import pandas as pd
from skimage import measure
from skimage.transform import resize
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
%matplotlib inline

In [None]:
from tqdm import tqdm

In [None]:
df=pd.read_csv('../input/stage_2_train_labels.csv')

In [None]:
df['path']='../input/stage_2_train_images/'+df['patientId'].astype(str)+'.dcm'

In [None]:
negative=df[df['Target']==0]
print(len(negative))

In [None]:
positive=df[df['Target']==1]
unique_positive=positive[['path','patientId']]
path=unique_positive['path'].unique()
patientId=unique_positive['patientId'].unique()

unique_positive=pd.DataFrame({'path':path,'patientId':patientId})
len(unique_positive)

In [None]:
try:
    os.mkdir('/kaggle/working/data')
    os.mkdir('/kaggle/working/data/positive')
    os.mkdir('/kaggle/working/data/negative')
    os.chdir('/kaggle/working')
except:
    pass

In [None]:
for _,row in tqdm(unique_positive.iterrows()):
    img=pydicom.read_file(row['path']).pixel_array
    img=resize(img,(256,256))
    plt.imsave('data/positive/'+row['patientId']+'.jpg',img,cmap='gray')

In [None]:
for _,row in tqdm(negative.iterrows()):
    img=pydicom.read_file(row['path']).pixel_array
    img=resize(img,(256,256))
    plt.imsave('data/negative/'+row['patientId']+'.jpg',img,cmap='gray')

In [None]:
from tensorflow.keras.applications.vgg19 import VGG19,preprocess_input
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
datagen=ImageDataGenerator(samplewise_center=True,samplewise_std_normalization=True,horizontal_flip=True,
                          width_shift_range=0.05,rescale=1/255,fill_mode='nearest',height_shift_range=0.05,
                           preprocessing_function=preprocess_input,validation_split=0.3,
                          )

In [None]:
# Create data-generators for training and validation/testing
train=datagen.flow_from_directory('data',color_mode='rgb',batch_size=32,
                                  class_mode='binary',subset='training')
test=datagen.flow_from_directory('data',color_mode='rgb',batch_size=32,
                                 class_mode='binary',subset='validation')

We will remove the last layer and add our own layers to the model.

In [None]:
pre_trained_model = VGG19(input_shape = (256,256,3), 
                                include_top = False, 
                                weights = 'imagenet')

for layer in pre_trained_model.layers:
    layer.trainable = False

# pre_trained_model.summary()

last_layer = pre_trained_model.get_layer('block5_pool')
print('last layer output shape: ', last_layer.output_shape)
last_output = last_layer.output

In [None]:
from tensorflow.keras.layers import Flatten,Dense,Dropout,BatchNormalization,LeakyReLU,ReLU,GaussianDropout

In [None]:
model = Flatten()(last_output)
model = Dense(1024)(model)
model=LeakyReLU(0.1)(model)
model=Dropout(0.25)(model)
model=BatchNormalization()(model)
model = Dense(1024)(model)
model=LeakyReLU(0.1)(model)
model=Dropout(0.25)(model)
model=BatchNormalization()(model)
model = Dense(1, activation='sigmoid')(model)

In [None]:
from tensorflow.keras.models import Model

In [None]:
fmodel = Model( pre_trained_model.input, model) 

fmodel.compile(optimizer = 'adam', 
              loss = 'binary_crossentropy', 
              metrics = ['accuracy'])

In [None]:
fmodel.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping,ReduceLROnPlateau

In [None]:
early=EarlyStopping(monitor='accuracy',patience=3,mode='auto')
reduce_lr = ReduceLROnPlateau(monitor='accuracy', factor=0.5, 
                              patience=2, verbose=1,cooldown=0, mode='auto',min_delta=0.0001, min_lr=1e-5)

<!--
**Note** - The training was done on Kaggle kernel for VGG-19. Because of 30 hours restriction, only 20 steps per epoch were chosen, which resulted in a drop in accuracy. If time permits, we will re-run the training with 100 steps per epoch.
-->

In [None]:
# class_weight={0:1,1:3.3}
# # Train model
# fmodel.fit(train,epochs=30,callbacks=[reduce_lr],
#            steps_per_epoch=100,validation_data=test,class_weight=class_weight)

In [None]:
# fmodel.save('/kaggle/working/model_vgg19.h5')

In [None]:
# # Plot accuracy
# plt.figure(figsize=(30,20))
# val_acc=np.asarray(fmodel.history.history['val_accuracy'])*100
# acc=np.asarray(fmodel.history.history['accuracy'])*100
# acc=pd.DataFrame({'val_acc':val_acc,'acc':acc})
# acc.plot(figsize=(20,10),yticks=range(50,100,5))

In [None]:
# # Plot loss
# loss=fmodel.history.history['loss']
# val_loss=fmodel.history.history['val_loss']
# loss=pd.DataFrame({'val_loss':val_loss,'loss':loss})
# loss.plot(figsize=(20,10))

From the above graph, we can see the onset of overfitting around 30 epochs. But, from the accuracy graphs, we can see that since we stopped the training at 40 epochs, the model has not overfitted to a large extent.

**Note** - Training was stopped at 40 epochs since the 30 hours quota in Kaggle was completed.

In [None]:
# y=[]

# test.reset()

# for i in tqdm(range(84)):
#     _,tar=test.__getitem__(i)
#     for j in tar:
#         y.append(j)

In [None]:
# test.reset()
# y_pred=fmodel.predict(test)

In [None]:
# pred=[]
# for i in y_pred:
#     if i[0]>=0.5:
#         pred.append(1)
#     else:
#         pred.append(0)

In [None]:
# from sklearn.metrics import roc_curve,auc,precision_recall_curve,classification_report

In [None]:
# # Classification report
# print(classification_report(y,pred[:len(y)]))

We can see that the F1-score for normal category is very high - 0.86, whereas for pneumonia, it's just 0.56. This clearly shows the effect of imbalanced dataset. Let's have a look at the area under the ROC curve to assess the model performance.

In [None]:
# plt.figure(figsize=(30,20))
# fpr,tpr,_=roc_curve(y,y_pred[:len(y)])
# area_under_curve=auc(fpr,tpr)
# print('The area under the curve is:',area_under_curve)
# # Plot area under curve
# plt.plot(fpr,tpr,'b.-')
# plt.xlabel('false positive rate')
# plt.ylabel('true positive rate')
# plt.plot(fpr,fpr,linestyle='--',color='black')

The area under ROC curve is 0.86 which is much higher than 0.50 (random model), this means that the model has actually learnt and is performing much better than a random model.

Next, we will replace the LeakyReLU layer with ReLU layer to see the effect it has on performance metrics.

In [None]:
model = Flatten()(last_output)
model = Dense(1024)(model)
model=ReLU(0.1)(model)
model=Dropout(0.25)(model)
model=BatchNormalization()(model)
model = Dense(1024)(model)
model=ReLU(0.1)(model)
model=Dropout(0.25)(model)
model=BatchNormalization()(model)
model = Dense(1, activation='sigmoid')(model)

In [None]:
from tensorflow.keras.models import Model

In [None]:
fmodel = Model( pre_trained_model.input, model) 

fmodel.compile(optimizer = 'adam', 
              loss = 'binary_crossentropy', 
              metrics = ['accuracy'])

In [None]:
fmodel.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping,ReduceLROnPlateau

In [None]:
early=EarlyStopping(monitor='accuracy',patience=3,mode='auto')
reduce_lr = ReduceLROnPlateau(monitor='accuracy', factor=0.5, 
                              patience=2, verbose=1,cooldown=0, mode='auto',min_delta=0.0001, min_lr=1e-5)

<!--
**Note** - The training was done on Kaggle kernel for VGG-19. Because of 30 hours restriction, only 20 steps per epoch were chosen, which resulted in a drop in accuracy. If time permits, we will re-run the training with 100 steps per epoch.
-->

In [None]:
class_weight={0:1,1:3.3}
# Train model
fmodel.fit(train,epochs=30,callbacks=[reduce_lr],
           steps_per_epoch=100,validation_data=test,class_weight=class_weight)

In [None]:
# Plot accuracy
plt.figure(figsize=(30,20))
val_acc=np.asarray(fmodel.history.history['val_accuracy'])*100
acc=np.asarray(fmodel.history.history['accuracy'])*100
acc=pd.DataFrame({'val_acc':val_acc,'acc':acc})
acc.plot(figsize=(20,10),yticks=range(50,100,5))

In [None]:
# Plot loss
loss=fmodel.history.history['loss']
val_loss=fmodel.history.history['val_loss']
loss=pd.DataFrame({'val_loss':val_loss,'loss':loss})
loss.plot(figsize=(20,10))

In [None]:
y=[]

test.reset()

for i in tqdm(range(84)):
    _,tar=test.__getitem__(i)
    for j in tar:
        y.append(j)

In [None]:
test.reset()
y_pred=fmodel.predict(test)

In [None]:
pred=[]
for i in y_pred:
    if i[0]>=0.5:
        pred.append(1)
    else:
        pred.append(0)

In [None]:
from sklearn.metrics import roc_curve,auc,precision_recall_curve,classification_report

In [None]:
# Classification report
print(classification_report(y,pred[:len(y)]))

We can see that the F1-score for normal category is very high - 0.86, whereas for pneumonia, it's just 0.56. This clearly shows the effect of imbalanced dataset. Let's have a look at the area under the ROC curve to assess the model performance.

In [None]:
plt.figure(figsize=(30,20))
fpr,tpr,_=roc_curve(y,y_pred[:len(y)])
area_under_curve=auc(fpr,tpr)
print('The area under the curve is:',area_under_curve)
# Plot area under curve
plt.plot(fpr,tpr,'b.-')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.plot(fpr,fpr,linestyle='--',color='black')

The area under ROC curve is 0.86 which is much higher than 0.50 (random model), this means that the model has actually learnt and is performing much better than a random model.

## Future Work

So far, we have trained a CNN model from scratch, and used transfer learning for training with pre-trained models like VGG-16 and VGG-19. We saw that the CNN model gave a decent IoU (both training and validation) of around 0.7, whereas the VGG models gave good accuracies for the classification problem. In next steps, we will try to further utilize the power of transfer learning by training models like YOLO, Mask-RCNN, etc. which are much better suited for object detection problems, where we also need to give the bounding box coordinates in the prediction, just as is expected in the current problem statement.