# Processing TFRecords - NIH Chest XRays
Here I load in a series of TFRecords files, each containing a set of images from the XRays Dataset.

As there are multiple labels, this is a multilabel classification task.

The aim is to run a CNN to get labels for each possible condition/state.

# Loading Required Packages

In [None]:
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import IPython.display as display
import matplotlib.pyplot as plt
import seaborn as sns
import random
from functools import partial
import sys
from numpy import load
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
import time as timer

# Starting timer for Code Execution

In [None]:
start_time = timer.time()

# Count of TFRecords
There are 256 files in total.

In [None]:
data_dir = '/kaggle/input/nih-chest-xrays-tfrecords/'

image_dir = data_dir + 'data/'

tfrlist_suffix = os.listdir(image_dir)

print('TFRecord file count: ' + str(len(tfrlist_suffix)))

# Viewing data distribution
A CSV with metadata relating to the data is provided. The data has a series of True-False columns relating to conditions (or lack thereof).

From the graphs, it is clear the data is highly imbalanced per condition, although the No Finding vs Finding overall is fairly balanced.

Since the No Finding feature is an immediate consequence of the other features, I will treat this as the null hypothesis. Therefore, this feature in all modelling will be excluded.

In [None]:
df = pd.read_csv(data_dir + 'preprocessed_data.csv')

In [None]:
sns.countplot(x='No Finding', data=df)

In [None]:
heads = list(df.columns)[2:]
cols = int(np.ceil(len(heads)/2))

_, axs = plt.subplots(cols,2, figsize=(15, 30))

for i, _ in enumerate(heads):
    if i % 2 == 0:
        sns.countplot(x=heads[i], data=df, ax=axs[int(i/2),0])
    else:
        sns.countplot(x=heads[i], data=df, ax=axs[int((i-1)/2),1])

# Grabbing TFRecords
The list of TFRecords is stored in a TF glob object.

In [None]:
tfrlist = [image_dir + x for x in tfrlist_suffix]

FILENAMES = tf.io.gfile.glob(tfrlist)

# Creating indexes for test/valid/train
I split all the files into train, validation and test sets by random sampling.
I first randomly sample the entire list to a 80-20% split, then set aside 10% of the train sets randomly as a validation set.

In [None]:
ALL = list(range(len(FILENAMES)))

TRAIN_AND_VALID_INDEX = random.sample(ALL, int(len(ALL) * 0.8))
TEST_INDEX = list(set(ALL) - set(TRAIN_AND_VALID_INDEX))

TRAIN_INDEX = random.sample(TRAIN_AND_VALID_INDEX, int(len(TRAIN_AND_VALID_INDEX) * 0.9))
VALID_INDEX = list(set(TRAIN_AND_VALID_INDEX) - set(TRAIN_INDEX))

# Getting lists for Training, Validation and Test
I use the above indices to split the entire list of files into the respective categories.

In [None]:
TRAINING_FILENAMES, VALID_FILENAMES, TEST_FILENAMES = [FILENAMES[index] for index in TRAIN_INDEX], [FILENAMES[index] for index in VALID_INDEX], [FILENAMES[index] for index in TEST_INDEX]

# Entire count of samples
This indicates the total number of files (256) split into each bucket.

In [None]:
print("Train TFRecord Files:", len(TRAINING_FILENAMES))
print("Validation TFRecord Files:", len(VALID_FILENAMES))
print("Test TFRecord Files:", len(TEST_FILENAMES))

# Feature space for parsing
I use the column headers from the dataframe to parse a feature dictionary.

In [None]:
feature_description = {}

for elem in list(df.columns)[2:]:
    feature_description[elem] = tf.io.FixedLenFeature([], tf.int64)
    
feature_description['image'] = tf.io.FixedLenFeature([], tf.string)

# Setting parameters
I use a batch size of 64, to factor for the imbalanced dataset.

I also set the image size to 100 x 100 to limit memory consumption per iteration (9 hour limit on Kaggle).

In [None]:
BATCH_SIZE = 64
IMAGE_ONE_AXIS = 100
IMAGE_SIZE = [IMAGE_ONE_AXIS, IMAGE_ONE_AXIS]
AUTOTUNE = tf.data.experimental.AUTOTUNE

# Function for reading the file
This function uses the above defined feature description to decode the image and its label. With the loop, I extract the feature's labels as a one-hot encoded list.

No Findings would be a zero vector.

In [None]:
def read_tfrecord(example):
    example = tf.io.parse_single_example(example, feature_description)
    image = tf.io.decode_jpeg(example["image"], channels=3)
    image = tf.image.resize(image, IMAGE_SIZE)
    image = tf.cast(image, tf.float32) / 255.0
    
    label = []
    
    for val in heads: label.append(example[val])
    
    return image, label

# Loading the data
The TFRecordDataset weaves together the individiual TFRecords, essentially treating them as one dataset. Randomness is introduced (deterministic = False) as no real order exists in the images. This speeds up loads.

In [None]:
def load_dataset(filenames):
    ignore_order = tf.data.Options()
    ignore_order.experimental_deterministic = False
    dataset = tf.data.TFRecordDataset(filenames)
    dataset = dataset.with_options(ignore_order)
    dataset = dataset.map(read_tfrecord)
    
    return dataset

# Batch and shuffle
This step takes the above loaded data, shuffles it with $N = 2048$, and defines the batch for feeding into the model.

In [None]:
def get_dataset(filenames):
    dataset = load_dataset(filenames)
    dataset = dataset.shuffle(2048)
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    dataset = dataset.batch(BATCH_SIZE)
    
    return dataset

# Generating the data
Each of the train, validation and test datasets are now put into Batch objects. These objects can be fed directly into a CNN, without any further manipulation needed.

In [None]:
train_dataset = get_dataset(TRAINING_FILENAMES)
valid_dataset = get_dataset(VALID_FILENAMES)
test_dataset = get_dataset(TEST_FILENAMES)

# Visualising some examples
This step is just to get some view on the data. The label is a translation of the one-hot encoded labels.

In [None]:
image_viz, label_viz = next(iter(train_dataset))

def show_batch(X, Y):
    plt.figure(figsize=(20, 20))
    for n in range(25):
        ax = plt.subplot(5, 5, n + 1)
        plt.imshow(X[n])
        
        result = [x for i, x in enumerate(heads) if Y[n][i]]
        title = "+".join(result)
        
        if result == []: title = "No Finding"
        
        plt.title(title)
        plt.axis("off")

show_batch(image_viz.numpy(), label_viz.numpy())

# Defining learning rate + early stop parameters
This enables the learning rate to be adjusted through epochs. It also enables the model to stop if within 10 epochs, the weights do not change within a threshold.

In [None]:
initial_learning_rate = 0.01
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate, decay_steps=5, decay_rate=0.96, staircase=True
)

# Define CNN model
The CNN is defined similar to an ImageNet model, as defined in the article: [How to Use The Pre-Trained VGG Model to Classify Objects in Photographs](https://machinelearningmastery.com/use-pre-trained-vgg-model-classify-objects-photographs/).

In [None]:
def define_model(in_shape=(IMAGE_SIZE[0], IMAGE_SIZE[1], 3), out_shape=len(heads)):
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=in_shape))
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(out_shape, activation='sigmoid'))

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
                  loss='binary_crossentropy',
                  metrics=[tf.keras.metrics.AUC(name="auc")])
    return model

# Calculating steps_per_epoch and validation_steps
These quantities are defined most straightforwardly as the number of images in the dataset divided by the batch size, i.e. $ \textrm{steps for sample} = \lceil \frac{\textrm{sample size}}{\textrm{batch size}} \rceil $.

However, the sizes are not known, so I calculate them here.

In [None]:
train_size = sum(1 for _ in tf.data.TFRecordDataset(TRAINING_FILENAMES))
validation_size = sum(1 for _ in tf.data.TFRecordDataset(VALID_FILENAMES))

epoch_steps = int(np.ceil(train_size/BATCH_SIZE))
validation_steps = int(np.ceil(validation_size/BATCH_SIZE))

epochs = 10

print("steps_per_epoch: " + str(epoch_steps))
print("validation_steps: " + str(validation_steps))

# Running the model
Due to a glitch, I could only set validation_steps and not steps_per_epoch (it kept stopping at epoch 2!).

It just means in epoch 1, it shows /Unknown. However it is known from the frame above how many steps are to be taken.

In [None]:
model = define_model()

history = model.fit(
    train_dataset,
    epochs=epochs,
    validation_data=valid_dataset,
    validation_steps = validation_steps
)

# Model evaluation

In [None]:
_, test_auc = model.evaluate(test_dataset, verbose=0)

print('Test auc:', test_auc)

# Plot model loss and AUC by epochs
This gives an idea of model performance.

In [None]:
# plot loss
ax = plt.subplot(211)
plt.title('Cross Entropy Loss')
plt.plot(history.history['loss'], color='blue', label='train')
plt.plot(history.history['val_loss'], color='orange', label='validation')
ax.axes.xaxis.set_visible(False)

# plot accuracy
plt.subplot(212)
plt.title('AUC')
plt.plot(history.history['auc'], color='blue', label='train')
plt.plot(history.history['val_auc'], color='orange', label='validation')

# Using model to predict on Test data

In [None]:
fitted_model = model.predict(test_dataset)

# Visualising predictions
The predictions are plotted here. The output is an array with values in $[0,1]$.

I have also added a "My interpretation" element, which is where I reverse the encoding by saying

$y_i =
\left\{
	\begin{array}{ll}
		1  & \mbox{if } \textrm{Pr}(x_i) > 0.5 \\
		0 & \mbox{if } \textrm{Pr}(x_i) \leq 0
	\end{array}
\right.$

where $y_i$ is the $i$th assigned feature prediction and $Pr(x_i)$ is the probability if the $i$th feature prediction from the CNN.

If the resultant vector is the 0 vector, then this is No Finding.



In [None]:
image_viz, label_viz = next(iter(test_dataset))

def show_batch(X, Y_act):
    plt.figure(figsize=(25, 30))
    for n in range(9):
        
        ax = plt.subplot(3, 3, n + 1)
        ax = plt.imshow(X[n])
        
        result = [x for i, x in enumerate(heads) if Y_act[n][i]]
        
        title = "+".join(result)
        
        if result == []: title = "No Finding"
        
        title = "Actual:\n" + title
        
        title += "\n\n Prediction:\n" + str(fitted_model[n]) + "\n\n My interpretation:\n"
        
        threshold = 0.5
        
        result = []
        for i, _ in enumerate(heads):
            if fitted_model[n][i] > threshold:
                result.append(1)
            else:
                result.append(0)
        
        result = np.asarray(result)

        if np.linalg.norm(result) == 0:
            title += "No Finding"
        else:
            result = [x for i, x in enumerate(heads) if result[i]]
            additional_title = "+".join(result)
            title += additional_title
            
        plt.title(title)
        plt.axis("off")

show_batch(image_viz.numpy(), label_viz.numpy())

In [None]:
end_time = timer.time()

time = end_time - start_time

day = time // (24 * 3600)
time = time % (24 * 3600)
hour = time // 3600
time %= 3600
minutes = time // 60
time %= 60
seconds = np.round(time,0)
print(f"Total code execution time: {day} days, {hour} hours, {minutes} minutes, {seconds} seconds")

# Future work
*  Checking the effect of the size of the image on AUC.
* Checking if other metrics (e.g. Fbeta) are better metrics.
* Check whether applying Learning rate ($\alpha$) decay improves the result.
* Adjust the decay steps and decay rate to see influence.
* Check if different learning scheduler for $\alpha$ improves the result.
* Apply transformations to enhance the training set (e.g. rotating and zooming/cropping images).
* See if Transfer learning is beneficial for this use case.
* See if there is any correlation between conditions present in dataset.

# References
1. [How to train a Keras model on TFRecord files](https://keras.io/examples/keras_recipes/tfrecord/)
2. [Basics of image classification with Keras](http://https://towardsdatascience.com/basics-of-image-classification-with-keras-43779a299c8b)
3. [Multi-Label Classification of Satellite Photos of the Amazon Rainforest](https://machinelearningmastery.com/how-to-develop-a-convolutional-neural-network-to-classify-satellite-photos-of-the-amazon-rainforest/)