# Data Science - Module 4 - Final Project Submission

* Student Name: **James Toop**
* Student Pace: **Self Paced**
* Scheduled project review date/time: **Wednesday, 8th September 2021 - 9.30pm BST**
* Instructor name: **Jeff Herman**
* Blog post URL: **https://toopster.github.io**

## Table of Contents
1. [Business Case](#business-case)
2. [Exploratory Data Analysis](#eda)
    1. [Discovery](#data-discovery)
    2. [Preprocessing](#data-preprocessing)
3. [Deep Learning Neural Networks](#deep-learning-neural-networks)
    1. [Model 1: Create a baseline network](#model-1)
    2. [Model 2: Deepen the network and increase the number of neurons in each layer](#model-2)
    3. [Model 3: A deeper network but with a different activation type and reduce the number of neurons](#model-3)

---
<a name="business-case"></a>
## 1. Business Case and Project Purpose

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

---
<a name="eda"></a>
## 2. Exploratory Data Analysis (EDA)


In [65]:
# Import the relevant libraries
import os
import time
import matplotlib.pyplot as plt
%matplotlib inline

import scipy
from scipy import ndimage

import numpy as np
from PIL import Image

import keras
from keras import models
from keras import layers
from keras import regularizers
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

np.random.seed(123)

<a name="data-discovery"></a>
### 2A. Data Discovery

This section presents an initial step to investigate, understand and document the available data fields and relationships, highlighting any potential issues / shortcomings within the datasets supplied.

#### Training Data

In [66]:
# Specify directory structure for images
train_folder = 'chest_xray/train/'
train_normal = 'chest_xray/train/NORMAL/'
train_pneumonia = 'chest_xray/train/PNEUMONIA/'

# Store all the relevant image names in specific objects
train_images_normal = [file for file in os.listdir(train_normal) if file.endswith('.jpeg')]
train_images_pneumonia = [file for file in os.listdir(train_pneumonia) if file.endswith('.jpeg')]

In [67]:
# Preview filenames for "normal" training images
train_images_normal[0:10]

['NORMAL2-IM-0927-0001.jpeg',
 'NORMAL2-IM-1056-0001.jpeg',
 'IM-0427-0001.jpeg',
 'NORMAL2-IM-1260-0001.jpeg',
 'IM-0656-0001-0001.jpeg',
 'IM-0561-0001.jpeg',
 'NORMAL2-IM-1110-0001.jpeg',
 'IM-0757-0001.jpeg',
 'NORMAL2-IM-1326-0001.jpeg',
 'NORMAL2-IM-0736-0001.jpeg']

In [68]:
# Preview filenames for "pneumonia" training images
train_images_pneumonia[0:10]

['person63_bacteria_306.jpeg',
 'person1438_bacteria_3721.jpeg',
 'person755_bacteria_2659.jpeg',
 'person478_virus_975.jpeg',
 'person661_bacteria_2553.jpeg',
 'person276_bacteria_1296.jpeg',
 'person1214_bacteria_3166.jpeg',
 'person1353_virus_2333.jpeg',
 'person26_bacteria_122.jpeg',
 'person124_virus_238.jpeg']

In [None]:
# Ascertain the size of the training dataset
print('Number of training chest x-ray images that are normal:', len(train_images_normal))
print('Number of training chest x-ray images that have pneumonia:', len(train_images_pneumonia))
print('\nTotal training chest x-ray images:', len(train_images_normal)+len(train_images_pneumonia))

#### Test Data

In [None]:
# Specify directory structure for images
test_folder = 'chest_xray/test/'
test_normal = 'chest_xray/test/NORMAL/'
test_pneumonia = 'chest_xray/test/PNEUMONIA/'

# Store all the relevant image names in specific objects
test_images_normal = [file for file in os.listdir(test_normal) if file.endswith('.jpeg')]
test_images_pneumonia = [file for file in os.listdir(test_pneumonia) if file.endswith('.jpeg')]

# Ascertain the size of the test dataset
print('Number of test chest x-ray images that are normal:', len(test_images_normal))
print('Number of test chest x-ray images that have pneumonia:', len(test_images_pneumonia))
print('\nTotal test chest x-ray images:', len(test_images_normal)+len(test_images_pneumonia))

#### Validation Data

In [None]:
# Specify directory structure for images
val_folder = 'chest_xray/val/'
val_normal = 'chest_xray/val/NORMAL/'
val_pneumonia = 'chest_xray/val/PNEUMONIA/'

# Store all the relevant image names in specific objects
val_images_normal = [file for file in os.listdir(val_normal) if file.endswith('.jpeg')]
val_images_pneumonia = [file for file in os.listdir(val_pneumonia) if file.endswith('.jpeg')]

# Ascertain the size of the validation dataset
print('Number of validation chest x-ray images that are normal:', len(val_images_normal))
print('Number of validation chest x-ray images that have pneumonia:', len(val_images_pneumonia))
print('\nTotal validation chest x-ray images:', len(val_images_normal)+len(val_images_pneumonia))

<a name="data-preprocessing"></a>
### 2B. Preprocessing

In [None]:
# Get all the data in the directory chest_xrays/train (5216 images), and reshape them
train_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        train_folder, 
        target_size = (128, 128), 
        batch_size = 5216)

# Get all the data in the directory chest_xrays/test (624 images), and reshape them
test_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        test_folder, 
        target_size = (128, 128), 
        batch_size = 624) 

# Get all the data in the directory chest_xrays/validation (16 images), and reshape them
val_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        val_folder, 
        target_size = (128, 128), 
        batch_size = 16)

In [None]:
# Create the datasets
train_images, train_labels = next(train_generator)
test_images, test_labels = next(test_generator)
val_images, val_labels = next(val_generator)

In [None]:
# Explore the dataset again
m_train = train_images.shape[0]
num_px = train_images.shape[1]
m_test = test_images.shape[0]
m_val = val_images.shape[0]

print ("Number of training samples: " + str(m_train))
print ("Number of testing samples: " + str(m_test))
print ("Number of validation samples: " + str(m_val))
print ("train_images shape: " + str(train_images.shape))
print ("train_labels shape: " + str(train_labels.shape))
print ("test_images shape: " + str(test_images.shape))
print ("test_labels shape: " + str(test_labels.shape))
print ("val_images shape: " + str(val_images.shape))
print ("val_labels shape: " + str(val_labels.shape))

In [None]:
# Preview the training labels
train_labels[:10]

In [None]:
train_img = train_images.reshape(train_images.shape[0], -1)
test_img = test_images.reshape(test_images.shape[0], -1)
val_img = val_images.reshape(val_images.shape[0], -1)

print(train_img.shape)
print(test_img.shape)
print(val_img.shape)

In [None]:
n_features = train_img.shape[1]
n_features

In [None]:
train_y = np.reshape(train_labels[:,0], (5216,1))
test_y = np.reshape(test_labels[:,0], (624,1))
val_y = np.reshape(val_labels[:,0], (16,1))

In [None]:
train_y[:10]

In [None]:
# Function for visualising results
def visualize_results(results):
    history = results.history

    plt.figure(figsize=(20,8))
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    
    plt.subplot(1, 2, 1)
    plt.plot(history['val_loss'])
    plt.plot(history['loss'])
    plt.legend(['Validation Loss', 'Training Loss'], fontsize=12)
    plt.title('Loss', fontsize=18)
    plt.xlabel('Epochs', fontsize=14)
    plt.ylabel('Loss', fontsize=14)
    
    plt.subplot(1, 2, 2)
    plt.plot(history['val_acc'])
    plt.plot(history['acc'])
    plt.legend(['Validation Accuracy', 'Training Accuracy'], fontsize=12)
    plt.title('Accuracy', fontsize=18)
    plt.xlabel('Epochs', fontsize=14)
    plt.ylabel('Accuracy', fontsize=14)
    plt.show()

---
<a name="deep-learning-neural-networks"></a>
# 3. Deep Learning Neural Networks

<a name="model-1"></a>
### 3A. Model 1: Create a baseline network

In [None]:
np.random.seed(123)

# Build a baseline model
model_1 = models.Sequential()
model_1.add(layers.Dense(64, activation='tanh', input_shape=(n_features,)))
model_1.add(layers.Dense(2, activation='softmax'))

# View summary for model
model_1.summary()

In [None]:
# Compile the baseline model
model_1.compile(loss='categorical_crossentropy', 
                optimizer='sgd', 
                metrics=['accuracy'])

In [None]:
# Fit the baseline model
results_1 = model_1.fit(train_img, 
                        train_labels, 
                        epochs=30, 
                        batch_size=64, 
                        validation_data=(test_img, test_labels))

In [None]:
# Visualise the loss and accuracy of the training and validation sets across epochs
visualize_results(results_1)

In [None]:
# Evaluate the training results
results_1_train = model_1.evaluate(train_img, train_labels)
results_1_train

In [None]:
# Evaluate the test results
results_1_test = model_1.evaluate(test_img, test_labels)
results_1_test

<a name="model-2"></a>
### 3B. Model 2: Deepen the network and increase the number of neurons in each layer

In [None]:
np.random.seed(123)

# Build a deeper model
model_2 = models.Sequential()
model_2.add(layers.Dense(300, activation='tanh', input_shape=(n_features,)))
model_2.add(layers.Dense(100, activation='tanh'))
model_2.add(layers.Dense(2, activation='softmax'))

# View summary for model
model_2.summary()

In [None]:
# Compile the deeper model
model_2.compile(loss='categorical_crossentropy',
                optimizer='sgd',
                metrics=['accuracy'])

# Fit the deeper model
results_2 = model_2.fit(train_img, 
                        train_labels,
                        batch_size=64, 
                        epochs=30, 
                        validation_data=(test_img, test_labels))

In [None]:
# Visualise the loss and accuracy of the training and validation sets across epochs
visualize_results(results_2)

In [None]:
# Evaluate the training results
results_2_train = model_2.evaluate(train_img, train_labels)
results_2_train

In [None]:
# Evaluate the test results
results_2_test = model_2.evaluate(test_img, test_labels)
results_2_test

<a name="model-3"></a>
### 3C. Model 3: A deeper network but with a different activation type and reduce the number of neurons

In [None]:
np.random.seed(123)

# Build a deeper model with less neurons and change activation type
model_3 = models.Sequential()
model_3.add(layers.Dense(64, activation='relu', input_shape=(n_features,)))
model_3.add(layers.Dense(32, activation='relu'))
model_3.add(layers.Dense(16, activation='relu'))
model_3.add(layers.Dense(2, activation='softmax'))

model_3.summary()

In [None]:
# Compile model
model_3.compile(loss='categorical_crossentropy',
                optimizer='sgd',
                metrics=['accuracy'])

# Fit model
results_3 = model_3.fit(train_img, 
                        train_labels,
                        batch_size=64, 
                        epochs=30, 
                        validation_data=(test_img, test_labels))

In [None]:
# Visualise the loss and accuracy of the training and validation sets across epochs
visualize_results(results_3)

In [None]:
# Evaluate the training results
results_3_train = model_3.evaluate(train_img, train_labels)
results_3_train

In [None]:
# Evaluate the test results
results_3_test = model_3.evaluate(test_img, test_labels)
results_3_test

<a name="model-4"></a>
### 3D. Model 4

In [69]:
np.random.seed(123)

# Build a baseline model
model_4 = models.Sequential()
model_4.add(layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.005), input_shape=(n_features,)))
model_4.add(layers.Dense(2, activation='softmax'))

# View summary for model
model_4.summary()

NameError: name 'regularizers' is not defined

In [None]:
# Compile model
optimizer = keras.optimizers.SGD(0.001)

model_4.compile(loss='categorical_crossentropy',
                optimizer=optimizer,
                metrics=['accuracy'])

# Fit model
results_4 = model_4.fit(train_img, 
                        train_labels,
                        batch_size=64, 
                        epochs=30, 
                        validation_data=(test_img, test_labels))

In [None]:
# Visualise the loss and accuracy of the training and validation sets across epochs
visualize_results(results_4)

In [None]:
# Evaluate the training results
results_4_train = model_4.evaluate(train_img, train_labels)
results_4_train

In [None]:
# Evaluate the training results
results_3_test = model_3.evaluate(test_img, test_labels)
results_3_test

<a name="model-X"></a>
### 3X. Model X

In [None]:
# Build a baseline fully connected model

model_X = models.Sequential()
model_X.add(layers.Dense(20, activation='relu', input_shape=(n_features,))) # 2 hidden layers
model_X.add(layers.Dense(7, activation='relu'))
model_X.add(layers.Dense(5, activation='relu'))
model_X.add(layers.Dense(1, activation='sigmoid'))

In [None]:
model_X.compile(loss='binary_crossentropy',
                optimizer='sgd',
                metrics=['accuracy'])

In [None]:
results_X = model_X.fit(train_img,
                    train_y,
                    epochs=50,
                    batch_size=32,
                    validation_data=(val_img, val_y))

In [None]:
visualize_results(results_X)

In [None]:
results_X_train = model_X.evaluate(train_img, train_y)
results_X_train

In [None]:
results_X_test = model_X.evaluate(test_img, test_y)
results_X_test

<a name="model-X"></a>
### 3Y. CNN Model

In [None]:
model_Y = models.Sequential()
model_Y.add(layers.Conv2D(32, (3, 3), activation='relu',
                        input_shape=(64 ,64,  3)))
model_Y.add(layers.MaxPooling2D((2, 2)))

model_Y.add(layers.Conv2D(32, (4, 4), activation='relu'))
model_Y.add(layers.MaxPooling2D((2, 2)))

model_Y.add(layers.Conv2D(64, (3, 3), activation='relu'))
model_Y.add(layers.MaxPooling2D((2, 2)))

model_Y.add(layers.Flatten())
model_Y.add(layers.Dense(64, activation='relu'))
model_Y.add(layers.Dense(1, activation='sigmoid'))

model_Y.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

results_Y = model.fit(train_img,
                    train_labels,
                    epochs=30,
                    batch_size=32,
                    validation_data=(val_images, val_y))