1. [ 1. NIH Chest X-ray Dataset](#1-1)
    - [1.1 Brief Overview](#1-1)
    - [1.2 Data Limitations](#1-2)
    - [1.3 Class Description](#1-3)
    - [1.4 Special Remarks](#1-4)
2. [Loading Data](#2-1)
    - [2.1 Loading Libraries](#2-1)
    - [2.2 Loading DataFrames](#2-2)
    


<a name='1-1'></a>
# 1. NIH Chest X-ray Dataset
## National Institutes of Health Chest X-Ray Dataset
Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, (Openi)[https://openi.nlm.nih.gov/] was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. **To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports.** **The labels are expected to be >90% accurate and suitable for weakly-supervised learning.** The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." [(Wang et al)](https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community)

<a name='1-2'></a>
## 1.2 Data limitations:
The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%. Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv) Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation


### File contents
Image format: 112,120 total images with size 1024 x 1024
* images_001.zip: Contains 4999 images
* images_002.zip: Contains 10,000 images
* images_003.zip: Contains 10,000 images
* images_004.zip: Contains 10,000 images
* images_005.zip: Contains 10,000 images
* images_006.zip: Contains 10,000 images
* images_007.zip: Contains 10,000 images
* images_008.zip: Contains 10,000 images
* images_009.zip: Contains 10,000 images
* images_010.zip: Contains 10,000 images
* images_011.zip: Contains 10,000 images
* images_012.zip: Contains 7,121 images
README_ChestXray.pdf: Original README file

### BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels
Image Index: File name
Finding Label: Disease type (Class label)
Bbox x
Bbox y
Bbox w
Bbox h

### Dataentry2017.csv: Class labels and patient data for the entire dataset
Image Index: File name
Finding Labels: Disease type (Class label)
Follow-up #
Patient ID
Patient Age
Patient Gender
View Position: X-ray orientation
OriginalImageWidth
OriginalImageHeight
OriginalImagePixelSpacing_x
OriginalImagePixelSpacing_y

<a name='1-3'></a>
### 1.3 Class descriptions
There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:

* Atelectasis
* Consolidation
* Infiltration
* Pneumothorax
* Edema
* Emphysema
* Fibrosis
* Effusion
* Pneumonia
* Pleural_thickening
* Cardiomegaly
* Nodule Mass
* Hernia

<a name='1-4'></a>
## 1.4 Special Remarks
### In order to use the images for a classification task, I converted the metadata file Data_Entry_2017.csv and converted it into a one hot vector encoding for using in classification problem [here](https://www.kaggle.com/redwankarimsony/chestxray8-dataframe)

I also removed some of the images from the dataframe as those images were inverted, rotated or not-frontal view of the chest because they carry little or no information necesary for the classification problem. ****


<a name='2-0'></a>
# 2. Loading Data

<a name='2-1'></a>
## 2.1 Loading Libraries


In [None]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from os import listdir
from os.path import join, isfile, isdir
from glob import glob


from PIL import Image
sns.set()
from tqdm import tqdm
%matplotlib inline




from keras.preprocessing.image import ImageDataGenerator


<a name='2-2'></a>
## 2.2 Loading DataFrames


In [None]:
data_dir1 = '../input/data/'
data_dir2 = '../input/chestxray8-dataframe/'
train_df = pd.read_csv(data_dir1 + 'Data_Entry_2017.csv')
image_label_map = pd.read_csv(data_dir2 + 'train_df.csv')
bad_labels = pd.read_csv(data_dir2 + 'cxr14_bad_labels.csv')

# Listing all the .jpg filepaths
image_paths = glob(data_dir1+'images_*/images/*.png')
print(f'Total image files found : {len(image_paths)}')
print(f'Total number of image labels: {image_label_map.shape[0]}')
print(f'Unique patients: {len(train_df["Patient ID"].unique())}')

image_label_map.drop(['No Finding'], axis = 1, inplace = True)
labels = image_label_map.columns[2:-1]
labels




<a name='2-3'></a>
## 2.3 Removing Samples with Bad Labels

You oboserve that in main dataset`Data_Entry_2017.csv` contains **112120** rows but in modified dataset `train_df.csv` contains only **111863** images. It turns out some of the images are problematic as discussed in this datasets discussion [thread](https://www.kaggle.com/nih-chest-xrays/data/discussion/55461). They are inverted, not-frontal or somehow badly rotated. Therefore they are removed. So we need to do a little bit of peprocessing here to deal with that matter and we are good to go. 

In [None]:
train_df.rename(columns={"Image Index": "Index"}, inplace = True)
image_label_map.rename(columns={"Image Index": "Index"}, inplace = True)
train_df = train_df[~train_df.Index.isin(bad_labels.Index)]
train_df.shape

Index =[]
for path in image_paths:
    Index.append(path.split('/')[5])
index_path_map = pd.DataFrame({'Index':Index, 'FilePath': image_paths})
index_path_map.head()

# Merge the absolute path of the images to the main dataframe
pd.merge(train_df, index_path_map, on='Index', how='left')

In [None]:
pd.merge(train_df, index_path_map, on='Index', how='left')

<a name='2-4'></a>
### 2.4 Preparing Images
With our dataset splits ready, we can now proceed with setting up our model to consume them. 
- For this the off-the-shelf [ImageDataGenerator](https://keras.io/preprocessing/image/) class from the Keras framework, which allows us to build a "generator" for images specified in a dataframe. 
- This class also provides support for basic data augmentation such as random horizontal flipping of images.
- We also use the generator to transform the values in each batch so that their mean is $0$ and their standard deviation is 1. 
    - This will facilitate model training by standardizing the input distribution. 
- The generator also converts our single channel X-ray images (gray-scale) to a three-channel format by repeating the values in the image across all channels.
    - We will want this because the pre-trained model that we'll use requires three-channel inputs.

Since it is mainly a matter of reading and understanding Keras documentation, we have implemented the generator for you. There are a few things to note: 
1. The mean and standard deviation of the data is normalized
3. The input is shuffled after each epoch.
4. The default image size is selected as 320px by 320px but it can be changed. 



<a name='2-5'></a>
### 2.5 Creating Data Generator

Before we create the training data generator from the keras built-in library, we can set several parameters. You can play with the folowing parameters to see what changes. 


In [None]:
IMAGE_SIZE=[256, 256]
EPOCHS = 20
# BATCH_SIZE = 8 * strategy.num_replicas_in_sync
BATCH_SIZE = 64

In [None]:
def get_train_generator(df, image_dir, x_col, y_cols, shuffle=True, batch_size=8, seed=1, target_w = 320, target_h = 320):
    """
    Return generator for training set, normalizing using batch
    statistics.

    Args:
      train_df (dataframe): dataframe specifying training data.
      image_dir (str): directory where image files are held.
      x_col (str): name of column in df that holds filenames.
      y_cols (list): list of strings that hold y labels for images.
      batch_size (int): images per batch to be fed into model during training.
      seed (int): random seed.
      target_w (int): final width of input images.
      target_h (int): final height of input images.
    
    Returns:
        train_generator (DataFrameIterator): iterator over training set
    """        
    print("getting train generator...")
    # normalize images
    image_generator = ImageDataGenerator(
        samplewise_center=True,
        samplewise_std_normalization= True, 
        shear_range=0.1,
        zoom_range=0.15,
        rotation_range=5,
        width_shift_range=0.1,
        height_shift_range=0.05,
        horizontal_flip=True, 
        vertical_flip = False, 
        fill_mode = 'reflect')
    
    
    # flow from directory with specified batch size
    # and target image size
    generator = image_generator.flow_from_dataframe(
            dataframe=df,
            directory=image_dir,
            x_col=x_col,
            y_col=y_cols,
            class_mode="raw",
            batch_size=batch_size,
            shuffle=shuffle,
            seed=seed,
            target_size=(target_w,target_h))
    
    return generator

train_generator = get_train_generator(df = image_label_map,
                                      image_dir = None, 
                                      x_col = 'FilePath',
                                      y_cols = labels, 
                                      batch_size=BATCH_SIZE,
                                      target_w = IMAGE_SIZE[0], 
                                      target_h = IMAGE_SIZE[1] 
                                      )

<a name='2-6'></a>
### 2.6 Look at the X-rays
Now lets have a look at the original dataset. Here `get_label()` function returns the concatenated version of the names of the diagnosis categories. You can run the following cell multiple times to get more images. 

In [None]:
X, Y = train_generator.next()

def get_label(y):

    ret_labels = []
    for idx in range(len(y)):
        if y[idx]: ret_labels.append(labels[idx])
    if len(ret_labels):  return '|'.join(ret_labels)
    else: return 'No Label'

rows = int(np.floor(np.sqrt(X.shape[0])))
cols = int(X.shape[0]//rows)
fig = plt.figure(figsize=(20,15))
for i in range(1, rows*cols+1):
    fig.add_subplot(rows, cols, i)
    plt.imshow(X[i-1], cmap='gray')
    plt.title(get_label(Y[i-1]))
    plt.axis(False)
    fig.add_subplot

<a name='2-7'></a>
### 2.7 Diagnosis Distribution (Normal vs Sick)
Now lets have a look at the distribution of the dataset. It is quire evident that almost half of the images didn't have any problem. They are simply the x-rays of healthy people marked as **No Finding** in the data frame. However we can have a look at the number of healthy and non-healthy x-rays.  

In [None]:
import bokeh
import IPython.display as ipd
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, LinearAxis, Range1d
from bokeh.models.tools import HoverTool
from bokeh.palettes import BuGn4, cividis
from bokeh.plotting import figure, output_notebook, show, output_file
from bokeh.transform import cumsum
from bokeh.palettes import Category20b

output_notebook()
diagnosis = ['Normal', 'Sick' ]
counts = [(train_df['Finding Labels'] == 'No Finding').sum(), train_df.shape[0]- (train_df['Finding Labels'] == 'No Finding').sum()]
source = ColumnDataSource(pd.DataFrame({'Type':diagnosis,'Counts':counts, 'color':['#054000', '#e22d00']}))

tooltips = [
    ("Category", "@Type"),
    ("No of Samples", "@Counts")
]

normal_vs_sick = figure(x_range=diagnosis, y_range=(0,70000), plot_height=400, plot_width = 400, title="Normal vs Sick Distribution", tooltips = tooltips)
normal_vs_sick.vbar(x='Type', top='Counts', width=0.75, legend_field="Type", color = 'color', source=source)
normal_vs_sick.xgrid.grid_line_color = None
normal_vs_sick.legend.orientation = "vertical"
normal_vs_sick.legend.location = "top_right"
show(normal_vs_sick)




<a name='2-8'></a>
### 2.8 Diagnosis Distribution
Now lets have a look at the distribution of the dataset. It is quire evident that almost half of the images didn't have any problem. They are simply the x-rays of healthy people marked as **No Finding** in the data frame. However we can have a look at the number of healthy and non-healthy x-rays.  

In [None]:
data = image_label_map[labels].sum(axis=0).sort_values(ascending = True)

# bokeh packages

diagnosis = data.index.tolist()
source = ColumnDataSource(data=dict(diagnosis=data.index.tolist(), counts=data.tolist(), color = Category20b[len(data)]))

tooltips = [("Diagnosis", "@diagnosis"), ("Count", "@counts") ]
diag_dist = figure(x_range=diagnosis, y_range=(0,15000), plot_height=400, plot_width = 700, title="Diagnosis Distributions", tooltips = tooltips)
diag_dist.vbar(x='diagnosis', top='counts', width=0.65, color='color', legend_field="diagnosis", source=source)

diag_dist.xgrid.grid_line_color = None
diag_dist.legend.orientation = "vertical"
diag_dist.legend.location = "top_left"

# show(diag_dist)




def plot_pie_bokeh(data = None):
    from math import pi
    from bokeh.palettes import Category20c
    x = data.to_dict()

    data = pd.Series(x).reset_index(name='value').rename(columns={'index':'category'})
    data['angle'] = data['value']/data['value'].sum() * 2*pi
    data['color'] = Category20b[len(x)]
    p = figure(plot_height=400, plot_width = 700, title="Pie Chart", tooltips="@category: @value%", x_range=(-0.5, 1.0))
    p.wedge(x=0.38, y=1, radius=0.4, start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
            line_color="black", fill_color='color', legend_field='category', source=data)

    p.axis.axis_label=None
    p.axis.visible=False
    p.grid.grid_line_color = None

    p.legend.orientation = "vertical"
    p.legend.location = "top_left"
    
    return p


dist_diag_percent = plot_pie_bokeh(data/data.sum()*100)

show(column(diag_dist, dist_diag_percent))

We observe that among the different identified conditions, **Infiltration, Effusion and Atelectasis** have the highest dominance and **Hernia** has the lowest prevalance. Hernia is very small **(0.28% only)** among the sick patients. However the hightest prevalance **Infiltration** has only **25%** and rest of the 13 classses combines to the rest. So the dataset has highly imbalanced positive class which will in turn create problem of negative bias while training CNN models. 

In [None]:
show(plot_pie_bokeh(data/data.sum()*100))

<a name='2-9'></a>
### 2.9 Age Histogram Distribution
Looking at the dataframe from the metadata, I found something interesting.  At first try, when I wanted to draw the histogram using `np.histogram`, I was getting unusually high range. Then I had a look at the metadata checking the Patient Age has some unusually high numbers. Just checked the patients with age greater than 100 and found out several patients samples where age is listed 140+ year to even 450 years :P :P :P Then I replaced those values with mean of the rest of the dataset which are less than 100 years old because they are simply data entry error. 

In [None]:
train_df.rename(columns={"Patient Age": "PatientAge"}, inplace = True)
train_df[train_df['PatientAge'] > 100]

In [None]:
average_age = int(train_df[train_df['PatientAge'] < 100]['PatientAge'].mean())
for idx in range(train_df.shape[0]):
    if train_df.iloc[idx, 4] > 100:
        print(f'{train_df.iloc[idx, 0]} : age {train_df.iloc[idx, 4]} is changed to ->> {average_age}')
        train_df.iloc[idx, 4] = average_age

train_df[train_df['PatientAge'] > 100]

In [None]:
def hist_hover(data, column=None,  title = 'Histogram',  colors=["SteelBlue", "Tan"], bins=30, log_scale=False, show_plot=True):

    # build histogram data with Numpy
    hist, edges = np.histogram(data, bins = bins)

    hist_df = pd.DataFrame({column: hist, "left": edges[:-1], "right": edges[1:]})
    hist_df["interval"] = ["%d to %d" % (left, right) for left, 
                           right in zip(hist_df["left"], hist_df["right"])]

    # bokeh histogram with hover tool
    if log_scale == True:
        hist_df["log"] = np.log(hist_df[column])
        src = ColumnDataSource(hist_df)
        plot = figure(plot_height = 300, plot_width = 600,
              title = title,
              x_axis_label = column.capitalize(),
              y_axis_label = "Log Count")    
        plot.quad(bottom = 0, top = "log",left = "left", 
            right = "right", source = src, fill_color = colors[0], 
            line_color = "black", fill_alpha = 0.7,
            hover_fill_alpha = 1.0, hover_fill_color = colors[1])
    else:
        src = ColumnDataSource(hist_df)
        plot = figure(plot_height = 300, plot_width = 600,
            title = title,
              x_axis_label = column.capitalize(),
              y_axis_label = "Count")    
        plot.quad(bottom = 0, top = column,left = "left", 
            right = "right", source = src, fill_color = colors[0], 
            line_color = "black", fill_alpha = 0.7,
            hover_fill_alpha = 1.0, hover_fill_color = colors[1])
    # hover tool
    hover = HoverTool(tooltips = [(' Age Interval', '@interval'),
                              ('Sample Count', str("@" +str(column)))])
    plot.add_tools(hover)
    # output
    if show_plot == True:
        show(plot)
    else:
        return plot

In [None]:
hist_hover(train_df['PatientAge'], column = 'PatientAge', bins = 100)

In [None]:
train_df[train_df['Patient Age'] > 100 ]

In [None]:
ages_male = train_df.loc[(train_df["Patient Gender"] == 'M'), "PatientAge"].tolist()
ages_female = train_df.loc[(train_df["Patient Gender"] == 'F'), "PatientAge"].tolist()

In [None]:
show(column(hist_hover(ages_male, column = 'MaleAges', title = 'Male Patients Age Histogram', bins = 95, show_plot=False),
            hist_hover(ages_female, column = 'FemaleAges', title = 'Female Patients Age Histogram',  bins = 95, show_plot=False)))

In [None]:
train_df.PatientAge.max() - train_df.PatientAge.min()