# Preface

In this notebook, we build a simple CNN to classify chest x-ray images into two categories, NORMAL or PNEUMONIA. In so doing, we also introduce a practically useful image data processing pipeline based on `ImageDataGenerator` in `keras`.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pathlib
sns.set(font_scale=1.5, style='dark')
np.random.seed(123)

# Downloading Dataset from Kaggle

We will download the chest x-ray image directly from Kaggle. We will use the [kaggle API](https://github.com/Kaggle/kaggle-api).

Alternatively, you can also download the data manually from [here](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia).

In [None]:
import kaggle
kaggle.api.authenticate()

kaggle.api.dataset_download_files(
    'paultimothymooney/chest-xray-pneumonia',
    path='./data',
    quiet=False,
    unzip=True,
    force=False,
)

We will look at some images. We will use the `load_img` function from `keras.preprocessing` module, which uses `PIL`.

In [None]:
from tensorflow.keras.preprocessing.image import load_img

In [None]:
data_dir = pathlib.Path('./data/chest_xray')
train_dir = data_dir.joinpath('train')
val_dir = data_dir.joinpath('val')
test_dir = data_dir.joinpath('test')

We compare some normal and pneumonia images.

In [None]:
for i in range(5):
    normal_image = load_img(list(train_dir.glob('NORMAL/*'))[i])
    pneumonia_image = load_img(list(train_dir.glob('PNEUMONIA/*'))[i])
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    ax1.imshow(normal_image)
    ax1.set_title('NORMAL')
    ax2.imshow(pneumonia_image)
    ax2.set_title('PNEUMONIA')

 To the untrained eye, it is not always straight-forward to tell which is which. Let us now build a CNN model to classify these images into NORMAL vs PNEUMONIA.

# Building A Simple CNN for Pneumonia Classification

## Image Processing Pipeline

Before building a model, let us take a look at the input data shapes and scales.

In [None]:
for i in range(5):
    normal_image = np.array(load_img(list(train_dir.glob('NORMAL/*'))[i]))
    print(f'Shape: {normal_image.shape}, Min: {normal_image.min()}, Max: {normal_image.max()}')    

Of course, the images will require normalization. Moreover, the image sizes are huge ($\mathcal{O}(10^8)$ pixels) so there should be some preprocessing done. 

The dataset size is non-trivial (~2.5G) so it is not very efficient to load everything into memory. Instead, we can try to load the data on-the-fly during training. 

Both of these can be achieved by the `ImageDataGenerator` class found in `tensorflow.keras.preprocessing.image`. We will use its bare-basic functionalities for now and gradually expand on it in later lectures.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
data_generator = ImageDataGenerator(rescale=1./255)

In [None]:
img_size = (128, 128)
batch_size = 16

The `flow_from_directory` method allows on-the-fly loading and training from the directory, without the need to load in memory. This will be faster if you are using a SSD.

In [None]:
train_gen = data_generator.flow_from_directory(
    train_dir,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='binary')

val_gen = data_generator.flow_from_directory(
    val_dir,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='binary')

test_gen = data_generator.flow_from_directory(
    test_dir,
    target_size=img_size,
    batch_size=batch_size,
    shuffle=False,
    class_mode='binary')

## Build CNN Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D, Flatten, Dense

In [None]:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(img_size[0], img_size[1], 3)))
model.add(MaxPool2D())

model.add(Conv2D(32, (3, 3),activation='relu'))
model.add(MaxPool2D())

model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'],
)

In [None]:
model.summary()

## Train CNN

In [None]:
from tqdm.keras import TqdmCallback

Since we are now working with data generators, we will need to find out how many data points are there.

In [None]:
num_train = len(list(train_dir.glob('./*/*')))
num_val = len(list(val_dir.glob('./*/*')))
num_test = len(list(test_dir.glob('./*/*')))
print(f'Num train: {num_train} Num val: {num_val} Num test: {num_test}')

In [None]:
model_save_dir = pathlib.Path('cnn_pneumonia_weights.h5')

In [None]:
if model_save_dir.exists():
    model.load_weights(str(model_save_dir))
else:
    history = model.fit(
        train_gen,  # Instead of providing (x_train, y_train), we will use the generator constructed before
        steps_per_epoch=num_train // batch_size,  # This is required for data generator feeding
        epochs=10,
        validation_data=val_gen,
        validation_steps=num_val // batch_size,
        verbose=0,
        callbacks=[TqdmCallback(verbose=1)],
        workers=8,
    )
    model.save_weights(str(model_save_dir))
    results = pd.DataFrame(history.history)
    results['epoch'] = history.epoch

## Evaluate the Model

Accuracy can be obtained from `evaluate`.

In [None]:
loss, acc = model.evaluate(test_gen, workers=8, verbose=0)
print(f'Loss: {loss}  Accuracy: {acc}')

We can also look at the precision/recall and the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
y_pred = model.predict(test_gen)
y_pred = 1 * (y_pred.squeeze() > 0.5)
y_true = test_gen.classes

print(classification_report(y_true, y_pred))

In [None]:
cmatrix = confusion_matrix(y_true, y_pred)

ax = sns.heatmap(cmatrix, annot=True, fmt="d")
ax.set_xticklabels(['NORMAL', 'PNEUMONIA'])
ax.set_yticklabels(['NORMAL', 'PNEUMONIA'], rotation=0);

# Exercise

Tweak the network and/or training procedure to improve performance. In this course we will introduce a number of such techniques along the way.