# Data Science - Module 4 - Final Project Submission

* Student Name: **James Toop**
* Student Pace: **Self Paced**
* Scheduled project review date/time: TBC
* Instructor name: **Jeff Herman**
* Blog post URL: **https://toopster.github.io**

## Table of Contents
1. [Business Case](#business-case)
2. [Data Discovery](#data-discovery)    
3. [Deep Learning Neural Networks](#deep-learning-neural-networks)
    1. [Baseline Densely Connected Network](#baseline-densely-connected-network)

---
<a name="business-case"></a>
## 1. Business Case and Project Purpose

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

---
<a name="data-discovery"></a>
## 2. Data Discovery

This section presents an initial step to investigate, understand and document the available data fields and relationships, highlighting any potential issues / shortcomings within the datasets supplied.

In [1]:
# Import the relevant libraries for data discovery
import os

#### Training Data

In [2]:
# Specify directory structure for images
train_folder = 'chest_xray/train/'
train_normal = 'chest_xray/train/NORMAL/'
train_pneumonia = 'chest_xray/train/PNEUMONIA/'

# Store all the relevant image names in specific objects
train_images_normal = [file for file in os.listdir(train_normal) if file.endswith('.jpeg')]
train_images_pneumonia = [file for file in os.listdir(train_pneumonia) if file.endswith('.jpeg')]

In [3]:
# Preview filenames for "normal" training images
train_images_normal[0:10]

['NORMAL2-IM-0927-0001.jpeg',
 'NORMAL2-IM-1056-0001.jpeg',
 'IM-0427-0001.jpeg',
 'NORMAL2-IM-1260-0001.jpeg',
 'IM-0656-0001-0001.jpeg',
 'IM-0561-0001.jpeg',
 'NORMAL2-IM-1110-0001.jpeg',
 'IM-0757-0001.jpeg',
 'NORMAL2-IM-1326-0001.jpeg',
 'NORMAL2-IM-0736-0001.jpeg']

In [4]:
# Preview filenames for "pneumonia" training images
train_images_pneumonia[0:10]

['person63_bacteria_306.jpeg',
 'person1438_bacteria_3721.jpeg',
 'person755_bacteria_2659.jpeg',
 'person478_virus_975.jpeg',
 'person661_bacteria_2553.jpeg',
 'person276_bacteria_1296.jpeg',
 'person1214_bacteria_3166.jpeg',
 'person1353_virus_2333.jpeg',
 'person26_bacteria_122.jpeg',
 'person124_virus_238.jpeg']

In [5]:
print('Number of training chest x-ray images that are normal:', len(train_images_normal))
print('Number of training chest x-ray images that have pneumonia:', len(train_images_pneumonia))
print('\nTotal training chest x-ray images:', len(train_images_normal)+len(train_images_pneumonia))

Number of training chest x-ray images that are normal: 1341
Number of training chest x-ray images that have pneumonia: 3875

Total training chest x-ray images: 5216


#### Test Data

In [6]:
# Specify directory structure for images
test_folder = 'chest_xray/test/'
test_normal = 'chest_xray/test/NORMAL/'
test_pneumonia = 'chest_xray/test/PNEUMONIA/'

# Store all the relevant image names in specific objects
test_images_normal = [file for file in os.listdir(test_normal) if file.endswith('.jpeg')]
test_images_pneumonia = [file for file in os.listdir(test_pneumonia) if file.endswith('.jpeg')]

print('Number of test chest x-ray images that are normal:', len(test_images_normal))
print('Number of test chest x-ray images that have pneumonia:', len(test_images_pneumonia))
print('\nTotal test chest x-ray images:', len(test_images_normal)+len(test_images_pneumonia))

Number of test chest x-ray images that are normal: 234
Number of test chest x-ray images that have pneumonia: 390

Total test chest x-ray images: 624


#### Validation Data

In [7]:
# Specify directory structure for images
val_folder = 'chest_xray/val/'
val_normal = 'chest_xray/val/NORMAL/'
val_pneumonia = 'chest_xray/val/PNEUMONIA/'

# Store all the relevant image names in specific objects
val_images_normal = [file for file in os.listdir(val_normal) if file.endswith('.jpeg')]
val_images_pneumonia = [file for file in os.listdir(val_pneumonia) if file.endswith('.jpeg')]

print('Number of validation chest x-ray images that are normal:', len(val_images_normal))
print('Number of validation chest x-ray images that have pneumonia:', len(val_images_pneumonia))
print('\nTotal validation chest x-ray images:', len(val_images_normal)+len(val_images_pneumonia))

Number of validation chest x-ray images that are normal: 8
Number of validation chest x-ray images that have pneumonia: 8

Total validation chest x-ray images: 16


---
<a name="deep-learning-neural-networks"></a>
# 3. Deep Learning Neural Networks

<a name="baseline-densely-connected-network"></a>
### 3A. Baseline Densely Connected Network

In [8]:
# Import the relevant libraries for creating neural networks
import time
import matplotlib.pyplot as plt
import scipy
import numpy as np
from PIL import Image
from scipy import ndimage
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

np.random.seed(123)

Using TensorFlow backend.


**NOTE:**
Need to better understand these preprocessing steps >>>

In [9]:
# Get all the data in the directory chest_xrays/train (5216 images), and reshape them
train_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        train_folder, 
        target_size = (64, 64), 
        batch_size = 5216)

# Get all the data in the directory chest_xrays/test (624 images), and reshape them
test_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        test_folder, 
        target_size = (64, 64), 
        batch_size = 624) 

# Get all the data in the directory chest_xrays/validation (16 images), and reshape them
val_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        val_folder, 
        target_size = (64, 64), 
        batch_size = 16)

Found 5216 images belonging to 2 classes.
Found 624 images belonging to 2 classes.
Found 16 images belonging to 2 classes.


In [10]:
# Create the datasets
train_images, train_labels = next(train_generator)
test_images, test_labels = next(test_generator)
val_images, val_labels = next(val_generator)

In [11]:
# Explore the dataset again
m_train = train_images.shape[0]
num_px = train_images.shape[1]
m_test = test_images.shape[0]
m_val = val_images.shape[0]

print ("Number of training samples: " + str(m_train))
print ("Number of testing samples: " + str(m_test))
print ("Number of validation samples: " + str(m_val))
print ("train_images shape: " + str(train_images.shape))
print ("train_labels shape: " + str(train_labels.shape))
print ("test_images shape: " + str(test_images.shape))
print ("test_labels shape: " + str(test_labels.shape))
print ("val_images shape: " + str(val_images.shape))
print ("val_labels shape: " + str(val_labels.shape))

Number of training samples: 5216
Number of testing samples: 624
Number of validation samples: 16
train_images shape: (5216, 64, 64, 3)
train_labels shape: (5216, 2)
test_images shape: (624, 64, 64, 3)
test_labels shape: (624, 2)
val_images shape: (16, 64, 64, 3)
val_labels shape: (16, 2)


In [12]:
train_img = train_images.reshape(train_images.shape[0], -1)
test_img = test_images.reshape(test_images.shape[0], -1)
val_img = val_images.reshape(val_images.shape[0], -1)

print(train_img.shape)
print(test_img.shape)
print(val_img.shape)

(5216, 12288)
(624, 12288)
(16, 12288)


In [13]:
train_y = np.reshape(train_labels[:,0], (5216,1))
test_y = np.reshape(test_labels[:,0], (624,1))
val_y = np.reshape(val_labels[:,0], (16,1))

In [14]:
# Build a baseline fully connected model
from keras import models
from keras import layers
np.random.seed(123)
model = models.Sequential()
model.add(layers.Dense(20, activation='relu', input_shape=(12288,))) # 2 hidden layers
model.add(layers.Dense(7, activation='relu'))
model.add(layers.Dense(5, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [15]:
model.compile(optimizer='sgd',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_img,
                    train_y,
                    epochs=50,
                    batch_size=32,
                    validation_data=(val_img, val_y))

Train on 5216 samples, validate on 16 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [16]:
results_train = model.evaluate(train_img, train_y)
results_train



[0.07591057675919193, 0.9727760736196319]

In [17]:
results_test = model.evaluate(test_img, test_y)
results_test



[1.0983491830336742, 0.7323717948717948]