# Neural Networks

https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

## Project: Machine Learning 

---
## Domain Background & Problem Statement

**Domain Background**

Computer vision (image recognition) in the healthcare field.

**Problem Statement**

To correctly classify images as normal vs pneumonia using machine learning, initial metric is accuracy.

# Project Layout

We break the notebook into separate steps. These links navigate the notebook.

* [Step 1](#import): Obtaining Data
* [Step 2](#eda): Exploratory Data Analysis
* [Step 3](#modeling): Modeling
    * [Model 1](#NN_baseline):  Neural Network, 2 layers, baseline model
    * [Model 2](#NN): Neural Network, complexity  
    * [Model 3](#CNN): Convolutional Neural Network
    * [Model 4](#transfer): Transfer Learning NN
* [Step 99](#resources): Resources

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["axes.grid"] = False
%matplotlib inline   

import os
from glob import glob

In [None]:
# load filenames for human and dog images
normal_files = np.array(glob("data/chest_xray/train/NORMAL/*"))
pneumonia_files = np.array(glob("data/chest_xray/train/pneumonia/*"))

# print number of images in each dataset
print('There are %d total normal x_ray images in the training set.' % len(normal_files)) # normal_files = human_files
print('There are %d total pneumonia x_ray images in the training set.' % len(pneumonia_files)) # pneumonia_files = dog_files

In [None]:
# load filenames for human and dog images
normal_files = np.array(glob("data/chest_xray/val/NORMAL/*"))
pneumonia_files = np.array(glob("data/chest_xray/val/pneumonia/*"))

# print number of images in each dataset
print('There are %d total normal x_ray images in the validation set.' % len(normal_files)) # normal_files = human_files
print('There are %d total pneumonia x_ray images in the validation set.' % len(pneumonia_files)) # pneumonia_files = dog_files

This is an imbalanced dataset between the training and validation, introducing bias. As a result, the imbalance was improved to a near 80/20 split due to this large imbalance. 

The original download has only 16 images in validation and 5216 in training. 

---
<a id='eda'></a>
## Exploratory Data Analysis

The following shows the diversity within each image.

In [None]:
from PIL import Image
import torchvision.transforms as transforms
from torch.autograd import Variable

w=300
h=80
fig=plt.figure(figsize=(20, 20))
columns = 3
rows = 1
for i in range(1, columns*rows +1):
    img = Image.open(pneumonia_files[i*10])
    fig.add_subplot(rows, columns, i)
    plt.imshow(img, cmap = plt.cm.gray)
    plt.title(i*10)
plt.show()

Examining just a few images there are some important elements to note. Much less .... image 3 not as clear....

In [None]:
w=300
h=80
fig=plt.figure(figsize=(20, 20))
columns = 3
rows = 1
for i in range(1, columns*rows +1):
    img = Image.open(normal_files[i*10])
    fig.add_subplot(rows, columns, i)
    plt.imshow(img, cmap = plt.cm.gray)
    plt.title(i*10)
plt.show()

In [None]:
from PIL import Image
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
import cv2     

width = []
height = []
channels = []

for i in range(len(pneumonia_files)):
    img = cv2.imread(pneumonia_files[i])
    dimensions = img.shape
    width.append(dimensions[0])
    height.append(dimensions[1])
    channels.append(dimensions[2])

In [None]:
df = pd.DataFrame(list(zip(width, height, channels)), columns = ['Width (pixels)', 'Height (pixels)','Channels (RGB)'])
df.describe()

The following shows the range of input images in our data that we will use to train, validate, and test the model. All images are three channels, which indicates RGB, but is in grayscale, so not sure why not 2 channels. The "width" dimension has an average of 825 pixels and "height" dimension of 1200 pixels. However, statistically speaking, there is a lot of variation in the data. So, what does this mean to us? Some images have more information than others. Preprocessing steps may affect some images differently, and resizing will be necessary. (not this is only the training data)

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Define figure and subplot
new_figure = plt.figure(figsize=(14,4))
ax = new_figure.add_subplot(121)
ax2 = new_figure.add_subplot(122)

ax.boxplot(df['Width (pixels)'], vert = False)
ax.set_title('Boxplot: Width of training data images (pixels)')
ax.set_xlabel('Width (pixels)')

ax2.boxplot(df['Height (pixels)'], vert = False)
ax2.set_title('Boxplot: Height of training data images (pixels)')
ax2.set_xlabel('Height (pixels)')

In [None]:
from keras.preprocessing.image import ImageDataGenerator

train_dir = 'data/chest_xray/train'
validation_dir = 'data/chest_xray/val'
test_dir = 'data/chest_xray/test'

num_samples = len(pneumonia_files) + len(normal_files)
batch_size = 24

# All images will be rescaled by 1./255, data augmentation for training dataset
# train_datagen = ImageDataGenerator(rescale=1./255) # 255 for scaling the 0-256 RGB values
train_datagen = ImageDataGenerator(rescale=1./255,
                                  rotation_range = 10, #10 degree rotation
                                  zoom_range=0.2, #zoom up to 20%
                                  shear_range=0.1 #rotation plan 10%
                                  ) 

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=(224, 224),         # All images will be resized to 224X224
        batch_size= batch_size, 
        class_mode='binary') #binary_crossentropy loss, we need binary labe

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(224, 224),
        batch_size= batch_size,
        class_mode='binary')

# test generator
test_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        test_dir, 
        target_size=(224, 224), 
        batch_size= 1,
        shuffle= False,
        class_mode='binary')

---
<a id='modeling'></a>
# Modeling

<a id='NN_baseline'></a>
# Baseline: Neural Network, MLP
- no convolutions
- simple, with 2 hidden layers

In [None]:
from keras import layers
from keras import models
from keras import optimizers
from keras.callbacks import EarlyStopping

epochs = 5
early_stopping_monitor = EarlyStopping(patience=2) # 2 epochs no improvement

model = models.Sequential()
model.add(layers.Dense(32, activation='relu',input_shape=(224, 224, 3)))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(1, activation='sigmoid'))

optimizer = optimizers.Adam(lr=0.0001) # decreased learning rate due to oscillating 

model.compile(loss='binary_crossentropy',
              optimizer= optimizer,
              metrics=['accuracy'])

#Set the model to train; see warnings above
history = model.fit_generator(
      train_generator,
      steps_per_epoch= num_samples // batch_size,
      epochs=epochs,
      verbose=1,
      validation_data=validation_generator,
      validation_steps=50,
      callbacks=[early_stopping_monitor])

# saving the model
model.save('static/artifacts/chest_xray_ann_data.h5')

Visualing the model loss and accuracy.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

train_losses = history.history['loss']
val_losses = history.history['val_loss']
acc = history.history['acc']
val_acc = history.history['val_acc']

# Define figure and subplot
new_figure = plt.figure(figsize=(14,4))
ax = new_figure.add_subplot(121)
ax2 = new_figure.add_subplot(122)

# Loss Plot
ax.plot(train_losses,  color='blue', linewidth=3, linestyle = '-')
ax.plot(val_losses,  color='orange', linewidth=3, linestyle = '-')
ax.set_title('Loss: Training vs Validation')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend(('Train','Valid'), frameon = False)

# Accuracy Plot
ax2.plot(acc,  color='blue', linewidth=3, linestyle = '-.')
ax2.plot(val_acc,  color='orange', linewidth=3, linestyle = '-.')
ax2.set_title('Accuracy: Training vs Validation')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('Accuracy')
ax2.legend(('Train','Valid'), frameon = False)
plt.show()

In [None]:
score = model.evaluate_generator(test_generator,
                                 steps = 624, # steps = num_samples / batch_size,
                                 workers = 1,
                                 pickle_safe=False)
    
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

scores = model.evaluate_generator(test_generator, steps = 624)
print(scores)

y_pred = model.predict_generator(test_generator, steps = 624)

y_true=test_generator.classes
print(y_true.shape)
y_pred1 = np.rint(y_pred) #rounding at 0.5 cutoff
print(y_pred1.shape)

print(classification_report(y_true, y_pred1, labels=[0,1]))

---
<a id='NN'></a>
# Neural Network, MLP
- no convolutions
- Additional layers, added dropout as well

In [None]:
from keras.layers import Dropout 

model = models.Sequential()
model.add(layers.Dense(64,activation='relu',input_shape=(224, 224, 3)))
model.add(Dropout(0.2))
model.add(layers.Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(layers.Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))


model.compile(loss='binary_crossentropy',
              optimizer= optimizer, 
              metrics=['accuracy'])

#Set the model to train; see warnings above
history = model.fit_generator(
      train_generator,
      steps_per_epoch= num_samples / batch_size,
      epochs=epochs,
      verbose = 1,
      validation_data=validation_generator,
      validation_steps=50,
      callbacks=[early_stopping_monitor])

# saving the model
model.save('static/artifacts/chest_xray_ann_better_data.h5')

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

train_losses = history.history['loss']
val_losses = history.history['val_loss']
acc = history.history['acc']
val_acc = history.history['val_acc']

# Define figure and subplot
new_figure = plt.figure(figsize=(14,4))
ax = new_figure.add_subplot(121)
ax2 = new_figure.add_subplot(122)

# Loss Plot
ax.plot(train_losses,  color='blue', linewidth=3, linestyle = '-')
ax.plot(val_losses,  color='orange', linewidth=3, linestyle = '-')
ax.set_title('Loss: Training vs Validation')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend(('Train','Valid'), frameon = False)

# Accuracy Plot
ax2.plot(acc,  color='blue', linewidth=3, linestyle = '-.')
ax2.plot(val_acc,  color='orange', linewidth=3, linestyle = '-.')
ax2.set_title('Accuracy: Training vs Validation')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('Accuracy')
ax2.legend(('Train','Valid'), frameon = False)
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

scores = model.evaluate_generator(test_generator, steps = 624)
print(scores)

y_pred = model.predict_generator(test_generator, steps = 624)

y_true=test_generator.classes
print(y_true.shape)
y_pred1 = np.rint(y_pred) #rounding at 0.5 cutoff
print(y_pred1.shape)

print(classification_report(y_true, y_pred1, labels=[0,1]))

---
<a id='CNN'></a>
# CNN

In [None]:
from keras.layers import Dropout
from keras.callbacks import EarlyStopping
    
early_stopping_monitor = EarlyStopping(patience=2) # 2 epochs no improvement

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(Dropout(0.25))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

print(model.summary())


history = model.fit_generator(
      train_generator,
      steps_per_epoch= num_samples / batch_size,
      epochs=epochs,
      verbose = 1,
      validation_data=validation_generator,
      validation_steps=50,
      callbacks=[early_stopping_monitor])

#saving the model
model.save('static/artifacts/chest_xray_cnn_data.h5')

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

train_losses = history.history['loss']
val_losses = history.history['val_loss']
acc = history.history['acc']
val_acc = history.history['val_acc']

# Define figure and subplot
new_figure = plt.figure(figsize=(14,4))
ax = new_figure.add_subplot(121)
ax2 = new_figure.add_subplot(122)

# Loss Plot
ax.plot(train_losses,  color='blue', linewidth=3, linestyle = '-')
ax.plot(val_losses,  color='orange', linewidth=3, linestyle = '-')
ax.set_title('Loss: Training vs Validation')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend(('Train','Valid'), frameon = False)

# Accuracy Plot
ax2.plot(acc,  color='blue', linewidth=3, linestyle = '-.')
ax2.plot(val_acc,  color='orange', linewidth=3, linestyle = '-.')
ax2.set_title('Accuracy: Training vs Validation')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('Accuracy')
ax2.legend(('Train','Valid'), frameon = False)
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

scores = model.evaluate_generator(test_generator, steps = 624)
print(scores)

y_pred = model.predict_generator(test_generator, steps = 624)

y_true=test_generator.classes
print(y_true.shape)
y_pred1 = np.rint(y_pred) #rounding at 0.5 cutoff
print(y_pred1.shape)

print(classification_report(y_true, y_pred1, labels=[0,1]))

---
<a id='transfer'></a>
# Transfer Learning Model

In [None]:
#Initialize Base
from keras.applications import VGG19
cnn_base = VGG19(weights='imagenet',
                 include_top=False,
                 input_shape=(224, 224, 3))

#Define Model Architecture
model = models.Sequential()
model.add(cnn_base)
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

cnn_base.trainable = False

#You can check whether a layer is trainable (or alter its setting) through the layer.trainable attribute:
for layer in model.layers:
    print(layer.name, layer.trainable)
    
#Similarly, we can check how many trainable weights are in the model:
print(len(model.trainable_weights))

model.summary()

In [None]:
#Compilation
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#Set the model to train; see warnings above
history = model.fit_generator(
      train_generator,
      steps_per_epoch= num_samples / batch_size,
      epochs=epochs,
      validation_data=validation_generator,
      validation_steps=50,
      callbacks=[early_stopping_monitor])

model.compile(loss='binary_crossentropy',
              optimizer= optimizer, 
              metrics=['accuracy'])

#Set the model to train; see warnings above
history = model.fit_generator(
      train_generator,
      steps_per_epoch= num_samples / batch_size,
      epochs=epochs,
      verbose = 1,
      validation_data=validation_generator,
      validation_steps=50,
      callbacks=[early_stopping_monitor])

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

train_losses = history.history['loss']
val_losses = history.history['val_loss']
acc = history.history['acc']
val_acc = history.history['val_acc']

# Define figure and subplot
new_figure = plt.figure(figsize=(14,4))
ax = new_figure.add_subplot(121)
ax2 = new_figure.add_subplot(122)

# Loss Plot
ax.plot(train_losses,  color='blue', linewidth=3, linestyle = '-')
ax.plot(val_losses,  color='orange', linewidth=3, linestyle = '-')
ax.set_title('Loss: Training vs Validation')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend(('Train','Valid'), frameon = False)

# Accuracy Plot
ax2.plot(acc,  color='blue', linewidth=3, linestyle = '-.')
ax2.plot(val_acc,  color='orange', linewidth=3, linestyle = '-.')
ax2.set_title('Accuracy: Training vs Validation')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('Accuracy')
ax2.legend(('Train','Valid'), frameon = False)
plt.show()


In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

scores = model.evaluate_generator(test_generator, steps = 624)
print(scores)

y_pred = model.predict_generator(test_generator, steps = 624)

y_true=test_generator.classes
print(y_true.shape)
y_pred1 = np.rint(y_pred) #rounding at 0.5 cutoff
print(y_pred1.shape)

print(classification_report(y_true, y_pred1, labels=[0,1]))

---
<a id='resources'></a>
# Resources

- https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
- https://towardsdatascience.com/understanding-neural-networks-from-neuron-to-rnn-cnn-and-deep-learning-cd88e90e0a90
- https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf
- https://medium.com/swlh/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad
- https://fairyonice.github.io/Learn-about-ImageDataGenerator.html
- https://stackoverflow.com/questions/52270177/how-to-use-predict-generator-on-new-images-keras?noredirect=1&lq=1