# Data Preparation

## Loading Prerequisites and Libraries 

**Standard Imports:**  
Plotly is configured to run in the offline mode.  
Pandas is chosen to for manipulating csv files.  
Numpy is for the arithmetic applications.

**Custom Functions:**  
All additional functions and code are written and imported from the **eda_xray.py** file.

In [1]:
%load_ext autoreload
%autoreload 2
import os
from eda_xray import load_xray #custom functions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go

%matplotlib inline

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('precision', 5)

There is a total of 112120 images on the NIH dataset.

In [2]:
# creating image location and looking for the number of the images
img_location = 'data/images'
len([file for file in os.listdir(img_location) if file.endswith('.png')])

112120

## Importing data and renaming/dropping columns

The function **load_xray** does the following:
1. imports the css
2. renames the columns to lower letter
3. drops the last empty column
4. adds a new columns caled **'path'** that has the x-rays *image*.png* location

In [3]:
data = 'data/Data_Entry_2017.csv'
df = load_xray(data)
df.sample(5)

Unnamed: 0,imgindex,label,followup,patientID,age,gender,viewposition,width,height,x,y,path
25823,00006788_001.png,No Finding,1,6788,46,F,PA,2021,2021,0.19431,0.19431,data/images/00006788_001.png
35442,00009349_022.png,Mass,22,9349,70,M,AP,2500,2048,0.168,0.168,data/images/00009349_022.png
105810,00028486_000.png,No Finding,0,28486,37,F,PA,3056,2544,0.139,0.139,data/images/00028486_000.png
72633,00017906_006.png,Infiltration,6,17906,22,M,PA,2602,2991,0.143,0.143,data/images/00017906_006.png
5822,00001565_000.png,No Finding,0,1565,42,M,PA,2500,2048,0.171,0.171,data/images/00001565_000.png


# Exploratory Data Analysis

## Unique pathologies of x-ray abnormalities and choosing a subset

There 836 unique values in the labels and these values have several different groupings of the pathologies. 

There are 836 different combinations of labeling among the x-rays.

In [4]:
df.label.nunique()

836

If we don't the consider the multilabel abnormalities, we'd have 14 labels plus the *'No Finding'* classification.

In [5]:
single_labels = ['Atelectasis', 'Consolidation', 'Infiltration', 'Pneumothorax',
                 'Edema', 'Emphysema', 'Fibrosis', 'Effusion', 'Pneumonia',
                 'Pleural_Thickening', 'Cardiomegaly', 'Nodule', 'Mass', 'Hernia']
print(df[df.label.isin(single_labels)].label.value_counts())

fig = px.bar(df[df.label.isin(single_labels)].label.value_counts().reset_index(),
       x='index',
       y='label',
       title='X-ray counts for single label pathologies',
       labels={'index': 'Pathology', 'label': 'Count'},
       template='plotly_white', width=800, height=600)
fig.update_traces(marker_color='rgb(128,128,255)', marker_opacity=0.8)
fig.show()

Infiltration          9547
Atelectasis           4215
Effusion              3955
Nodule                2705
Pneumothorax          2194
Mass                  2139
Consolidation         1310
Pleural_Thickening    1126
Cardiomegaly          1093
Emphysema              892
Fibrosis               727
Edema                  628
Pneumonia              322
Hernia                 110
Name: label, dtype: int64


## Removing  items with less than 20 images

In [6]:
df.label.value_counts()

No Finding                                                   60361
Infiltration                                                  9547
Atelectasis                                                   4215
Effusion                                                      3955
Nodule                                                        2705
                                                             ...  
Cardiomegaly|Infiltration|Nodule|Pneumonia                       1
Infiltration|Cardiomegaly                                        1
Atelectasis|Fibrosis|Hernia                                      1
Effusion|Fibrosis|Infiltration|Mass                              1
Atelectasis|Consolidation|Pleural_Thickening|Pneumothorax        1
Name: label, Length: 836, dtype: int64

By looking at the values of the pathology labels, we can see that for some labels/findings there are only one image and some have a few more. To facilitate classification and modeling we remove the x-rays with less than 20 instances.

In [7]:
df = df.groupby('label').filter(lambda x : len(x)>20)

## Creating Categorical Variables

The **create_categorical** function creates dummy variables through pandas and creates a *'target'* column by one hot encoding the labels for later comparison of the convolutional neural networks results.

In [8]:
from eda_xray import create_categorical
df = create_categorical(df)
df.sample(5)

Unnamed: 0,imgindex,label,followup,patientID,age,gender,viewposition,width,height,x,y,path,Atelectasis,Consolidation,Infiltration,Pneumothorax,Edema,Emphysema,Fibrosis,Effusion,Pneumonia,Pleural_Thickening,Cardiomegaly,Nodule,Mass,Hernia,target
101426,00026955_000.png,Cardiomegaly|Nodule,0,26955,78,F,AP,3056,2544,0.139,0.139,data/images/00026955_000.png,0,0,0,0,0,0,0,0,0,0,1,1,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]"
82260,00020215_000.png,No Finding,0,20215,47,F,PA,2992,2991,0.143,0.143,data/images/00020215_000.png,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
69332,00017110_021.png,Effusion|Infiltration,21,17110,60,F,AP,2500,2048,0.168,0.168,data/images/00017110_021.png,0,0,1,0,0,0,0,1,0,0,0,0,0,0,"[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]"
59531,00014709_004.png,Infiltration,4,14709,51,M,AP,2500,2048,0.168,0.168,data/images/00014709_004.png,0,0,1,0,0,0,0,0,0,0,0,0,0,0,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
93546,00023469_000.png,No Finding,0,23469,30,F,PA,2048,2500,0.168,0.168,data/images/00023469_000.png,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


## Saving Subset To Pickle

Following cells create the _subset_ folder and save the subset dataframe to this folder.

In [39]:
import os
if not os.path.exists('subset'):
    os.makedirs('subset')

In [40]:
df.to_pickle('subset/subset.pkl')

## Split train, validation, and test data

In [10]:
from sklearn.model_selection import train_test_split

train_set, valid_set = train_test_split(df, test_size=0.3,
                                        random_state=42,
                                        stratify=df.label)

train_set, test_set = train_test_split(train_set, test_size=0.3,
                                       random_state=42,
                                       stratify=train_set.label)


print('training set values: ', train_set.shape[0])
print('validation set values: ', valid_set.shape[0])
print('testing set values: ', test_set.shape[0])
print('subset data values:', df.shape[0])

training set values:  53692
validation set values:  32874
testing set values:  23011
subset data values: 109577


## Visualizing the Split

Excluding the No Findings label: Visualizing only the first 14 labels.

In [15]:
import plotly.graph_objects as go
labels = train_set.label.value_counts().keys()
train_counts = train_set.label.value_counts()
valid_counts = valid_set.label.value_counts()
test_counts = test_set.label.value_counts()
fig = go.Figure(data=[
    go.Bar(name='Training Set', x=labels[1:14],
           y=train_set.label.value_counts()),
    go.Bar(name='Validation Set', x=labels[1:14],
           y=valid_set.label.value_counts()),
    go.Bar(name='Testing Set', x=labels[1:14],
           y=test_set.label.value_counts())
])
# Change the bar mode
fig.update_layout(barmode='stack', template = 'plotly_white', 
                  title = 'Training, Validation and Testing Splits')
fig.show()

## Saving Training, Validation and Testing Sets to Pickle

In [45]:
train_set.to_pickle('subset/train_set.csv')
valid_set.to_pickle('subset/valid_set.csv')
test_set.to_pickle('subset/test_set.csv')