# CSCK506 Deep Learning Group Project
To train a *Convolutional Neural Network* (CNN) model to be able to detect healthy lungs from pneumonia infected ones.

Table of Contents
=================
1. [Import Libraries](#Import-Libraries)
2. [Data Preprocessing](#Data-Preprocessing)
    1. [Load Data](#Load-Data)
    2. [Understanding the Data](#Understanding-the-Data)
    3. [Data Visualization](#Data-Visualization)
    4. [Check for Imbalance Data](#Check-for-Imbalance-Data)
    5. [Data Augmentation](#Data-Augmentation)
    6. [Dataloader for Batching](#Dataloader-for-Batching)
 3. [Model Development](#Model-Development)
    1. [Build the CNN Model](#Build-the-CNN-Model)
    2. [Train the CNN Model](#Train-the-CNN-Model)
    3. [Evaluate the CNN Model](#Evaluate-the-CNN-Model)
    4. [Save the CNN Model](#Save-the-CNN-Model)
 4. [Model Testing](#Model-Testing)
    1. [Load the CNN Model](#Load-the-CNN-Model)
    2. [Test the CNN Model](#Test-the-CNN-Model)

## Import Libraries

In [1]:
import os
import hashlib
import zipfile

## Data Preprocessing

### Unzip File into data folder
- Download the dataset from [Kaggle](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) and extract it to the same directory as this notebook.
- To re-extract the dataset, delete the data folder and run the following code.

In [2]:
if not os.path.exists('data'):
    DATA_EXIST = False
    os.makedirs('data')
else:
    DATA_EXIST = True
    EXTRACT_FROM_ZIP = False
    print('Data folder already exists')

# Check if downloaded data is correct
FILENAME = 'archive.zip'
SHA256SUM ='f569fe885b0f921e836f3d6bcc8d7b3442f5e0ca4db4533d06b8cf25d2114ea1'

if os.path.exists(FILENAME) and not DATA_EXIST:
    with open(FILENAME, 'rb') as f:
        read_bytes = f.read() # read entire file as bytes
        READABLE_HASH = hashlib.sha256(read_bytes).hexdigest()
        if READABLE_HASH != SHA256SUM:
            print('Data corrupted, please download again')
            os.remove(FILENAME)
            EXTRACT_FROM_ZIP = False
        else:
            EXTRACT_FROM_ZIP = True # Ready to extract data from zip file

folder_to_extract = ['chest_xray/test', 'chest_xray/train', 'chest_xray/val']

# Extract data from zip file
if not DATA_EXIST and EXTRACT_FROM_ZIP:
    with zipfile.ZipFile(FILENAME, 'r') as zip_ref:
        for fol in folder_to_extract:
            for file in zip_ref.namelist():
                if file.startswith(fol):
                    zip_ref.extract(file, 'data')
    for fol in folder_to_extract:
        os.rename('data/'+fol, 'data/'+fol.split('/')[1])
    os.rmdir('data/chest_xray')

Data folder already exists


### Understanding the Data

### Data Visualization

### Check for Imbalance Data

### Data Augmentation
Alter the training data with the following transformations:
- Randomly rotate some training images by 10 degrees
- Randomly resize and crop some training images

The purpose of data augmentation is to increase the number of training data to improve the performance and ability of the model to generalize, invariant to the changes in the input data.

### Dataloader for Batching
Load the data into batches of images and labels using PyTorch's DataLoader class.

## Model Development

### Build the CNN Model
Use the training data to train the model with CNN which has the minimum loss and maximum accuracy for detecting the images with pneumonia.

### Train the CNN Model
Choose:
- Number of convolution-pooling building blocks,
- The strides, padding and activation function that give you the maximum accuracy,
- A solution to avoid overfitting problem in your code. --> Regularization

### Evaluate and Tune the CNN Model
Use validation dataset to tune the hyperparameters.

### Save the CNN Model

## Model Testing

### Load the CNN Model

### Test the CNN Model
Use the test dataset after the final tuning to obtain the maximum test accuracy