# CSCK506 Deep Learning Group Project
To train a *Convolutional Neural Network* (CNN) model to be able to detect healthy lungs from pneumonia infected ones.

Table of Contents
=================
1. [Import Libraries](#Import-Libraries)
2. [Data Preprocessing](#Data-Preprocessing)
    1. [Load Data](#Load-Data)
    2. [Understanding the Data](#Understanding-the-Data)
    3. [Data Visualization](#Data-Visualization)
    4. [Check for Imbalance Data](#Check-for-Imbalance-Data)
    5. [Data Augmentation](#Data-Augmentation)
    6. [Dataloader for Batching](#Dataloader-for-Batching)
 3. [Model Development](#Model-Development)
    1. [Build the CNN Model](#Build-the-CNN-Model)
    2. [Train the CNN Model](#Train-the-CNN-Model)
    3. [Evaluate the CNN Model](#Evaluate-the-CNN-Model)
    4. [Save the CNN Model](#Save-the-CNN-Model)
 4. [Model Testing](#Model-Testing)
    1. [Load the CNN Model](#Load-the-CNN-Model)
    2. [Test the CNN Model](#Test-the-CNN-Model)

## Import Libraries

In [22]:
import os
import hashlib
import zipfile

import numpy
from keras.models import Sequential
from keras.layers import Dense,Dropout,Flatten,Conv2D,MaxPooling2D
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
from keras.models import load_model 

## Data Preprocessing

### Unzip File into data folder
- Download the dataset from [Kaggle](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) and extract it to the same directory as this notebook.
- To re-extract the dataset, delete the data folder and run the following code.

In [2]:
""" if not os.path.exists('data'):
    DATA_EXIST = False
    os.makedirs('data')
else:
    DATA_EXIST = True
    EXTRACT_FROM_ZIP = False
    print('Data folder already exists')

# Check if downloaded data is correct
FILENAME = 'archive.zip'
SHA256SUM ='f569fe885b0f921e836f3d6bcc8d7b3442f5e0ca4db4533d06b8cf25d2114ea1'

if os.path.exists(FILENAME) and not DATA_EXIST:
    with open(FILENAME, 'rb') as f:
        read_bytes = f.read() # read entire file as bytes
        READABLE_HASH = hashlib.sha256(read_bytes).hexdigest()
        if READABLE_HASH != SHA256SUM:
            print('Data corrupted, please download again')
            os.remove(FILENAME)
            EXTRACT_FROM_ZIP = False
        else:
            EXTRACT_FROM_ZIP = True # Ready to extract data from zip file

folder_to_extract = ['chest_xray/test', 'chest_xray/train', 'chest_xray/val']

# Extract data from zip file
if not DATA_EXIST and EXTRACT_FROM_ZIP:
    with zipfile.ZipFile(FILENAME, 'r') as zip_ref:
        for fol in folder_to_extract:
            for file in zip_ref.namelist():
                if file.startswith(fol):
                    zip_ref.extract(file, 'data')
    for fol in folder_to_extract:
        os.rename('data/'+fol, 'data/'+fol.split('/')[1])
    os.rmdir('data/chest_xray') """

Data folder already exists


In [None]:
train_dir = './data/test/'
test_dir = './data/val/'

target_size = (224,224)

# define the data generator and normalize pixel values
train_datagen = ImageDataGenerator(rescale=1./255,
    rotation_range=20,
    width_shift_range=0.1,
    height_shift_range=0.1,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.2)

test_datagen = ImageDataGenerator(rescale=1./255)

# load and iterate over the image data in batches
train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=target_size,
        batch_size=32,
        color_mode='grayscale',
        class_mode='categorical')

test_generator = test_datagen.flow_from_directory(
        test_dir,
        target_size=target_size,
        batch_size=32,
        color_mode='grayscale',
        class_mode='categorical')

### Understanding the Data

### Data Visualization

### Check for Imbalance Data

### Data Augmentation
Alter the training data with the following transformations:
- Randomly rotate some training images by 10 degrees
- Randomly resize and crop some training images

The purpose of data augmentation is to increase the number of training data to improve the performance and ability of the model to generalize, invariant to the changes in the input data.

### Dataloader for Batching
Load the data into batches of images and labels using PyTorch's DataLoader class.

## Model Development

### Build the CNN Model
Use the training data to train the model with CNN which has the minimum loss and maximum accuracy for detecting the images with pneumonia.

1. Sequential: This is a Keras model type that allows you to build a model by adding layers sequentially. In this case, we use it to define a feedforward neural network.

2. Conv2D: This layer performs convolution on the input image. In this case, we use a 2D convolution with a kernel size of 3x3 and 32 output filters. We also specify the relu activation function to introduce non-linearity into the model. The input_shape parameter specifies the shape of the input data, which is a 224x224 RGB image with single channels (monochrome).

3. MaxPooling2D: This layer performs max pooling on the output of the convolution layer. In this case, we use a 2x2 pooling window to reduce the spatial dimensions of the output by a factor of 2.

4. We repeat steps 2 and 3 twice with different numbers of output filters (64 and 128) and pooling windows to extract higher-level features from the image.

5. Flatten: This layer flattens the output of the convolutional layers into a 1D array, which can be fed into a fully connected layer.

6. Dense: This layer is a fully connected layer that computes the output of the network using a linear transformation of the input followed by a non-linear activation function. In this case, we use a dense layer with 128 units and the relu activation function.

7. Dropout: This layer applies dropout regularization to the output of the previous layer. Dropout randomly sets a fraction of the output units to zero during training, which helps prevent overfitting.

8. We add another dense layer with 2 units and the softmax activation function as the output layer. The softmax function normalizes the output so that it represents a probability distribution over the two output classes.

In [25]:
# Create the model 
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

### Train the CNN Model
Choose:
- Number of convolution-pooling building blocks,
- The strides, padding and activation function that give you the maximum accuracy,
- A solution to avoid overfitting problem in your code. --> Regularization

Use SDG as the optimizer
- Created a new SGD object with a learning rate of 0.01 and a momentum of 0.9. SGD shold have better stability compared to Adaam

In [None]:
# Compile the model with SGD optimizer
sgd = SGD(lr=0.01, momentum=0.9)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_generator, epochs=10, validation_data=test_generator)

### Evaluate and Tune the CNN Model
Use validation dataset to tune the hyperparameters.

In [None]:
scores = model.evaluate(test_generator, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

### Save the CNN Model

In [None]:
model.save('xray_model.h5')

## Model Testing

### Load the CNN Model

### Test the CNN Model
Use the test dataset after the final tuning to obtain the maximum test accuracy