**Problem:** Classifcation of chest x-rays for malpositioned lines and tubes in patients.

**Evaluation:** Mean AUROC scores.

## Imports

Lets also install `tensorflow_addons` to train using Cyclical Learning Rate.


In [None]:
!pip3 install tensorflow_addons -q

In [None]:
# inbuilt imports
import os
import glob
import pathlib
import tempfile
import functools

# numeric imports
import numpy as np
import pandas as pd

# visual imports
import seaborn as sns
import matplotlib.pyplot as plt

# modeling imports
import tensorflow as tf
import tensorflow_addons as tfa

In [None]:
%matplotlib inline
sns.set()

## Load dataset and analyse

In [None]:
base_path = pathlib.Path('..')
path = pathlib.Path('../input/ranzcr-clip-catheter-line-classification')

### Set up constants and hyperparameters

Lets us set some values that will remain constant and hyperparameters so that they are easy to modify.

In [None]:
IMG_SIZE = 224
BATCH_SIZE = 16

all_labels = ['ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal', 
              'NGT - Abnormal', 'NGT - Borderline','NGT - Incompletely Imaged', 
              'NGT - Normal', 'CVC - Abnormal', 'CVC - Borderline', 'CVC - Normal', 
              'Swan Ganz Catheter Present']

In [None]:
train = pd.read_csv(path/'train.csv', low_memory=False)

train.head()

Hmmm..I can see a PatientID column here. This rings some bells. Lets check the total examples/data points and total unique patient ids.

Another thing that we can see that a record can have multiple detections. That is, it is multiclass classification problem.

### Further analysis

Lets further check what we stated above is true.

In [None]:
len(train)

In [None]:
len(train.PatientID.unique())

In [None]:
grouped_df = train.set_index('PatientID')

grouped_df.head()

In [None]:
grouped_df.loc['ec89415d1'].sort_values(by=['StudyInstanceUID'])

As suspected, there are multiple records for the same patient. We need to handle this while splitting to prevent any data leaks. If we simply split the data into training and validation set then some of the records of the same patient might end up in validation set and hence the model would have already seen the records of those patient during training. In this case the model can use the already seen information to make predictions during evaluation while validation. This will show good result during validation whereas in reality the model has not learnt anything useful.

So, instead we will group the data by patient and the split the grouped data instead. In this way we can ensure that all the records of the same patient end up in the same split. Splitting this way is not very optimal as the data is not split stratificaly, but atleast we can address the data leak problem.

## Data set-up and data pipeline creation

Now, lets add the path of the images in our dataframe so that we can easily utilize the ImageDataGenerator.

In [None]:
train['path'] = train['StudyInstanceUID'].map(lambda x: str(path/'train'/(x+'.jpg')))

print(train.loc[0, ['StudyInstanceUID', 'path']].values)

### Analyse the data per classes

In [None]:
fig = plt.figure(figsize=(6, 5))

data = train.loc[:, all_labels].sum().sort_values()
dist = pd.DataFrame({'Class': data.index, 'Pos': data.values}, 
                    columns=['Class', 'Pos'])

sns.barplot(data=dist, y='Class', x='Pos')
plt.xticks(rotation=45)
plt.show()

The shows that our data is very imbalanced for each class. We will address this problem later in the notebook.

In [None]:
dist['Neg'] = len(train) - dist.Pos
dist.sort_values(by='Pos')

### Splitting Data

Remember prviously we talked about splitting the data by grouping them by patient id. Now's the time to do so. Let's go.

In [None]:
grouped_df = train.groupby('PatientID')

train_list = [group for _, group in grouped_df]

In [None]:
def train_valid_splitter(d, train_size=0.8):
    n = len(d)
    trains = d[:int(train_size*n)]
    valids = d[int(train_size*n):]
    return trains, valids

train_split, valid_split = train_valid_splitter(train_list)
train_df = pd.concat(train_split, axis=0)
valid_df = pd.concat(valid_split, axis=0)

print(f'Train Size: {len(train_df)}, Valid Size: {len(valid_df)}')

In [None]:
train_df.head()

### Creating data generators

Lets use the `ImagDataGenerator` module to build the input pipeline for the model. 

As for augmentations, we are only horizontally flipping and rotating (by 0.4 radians, just a randomly chosen number) images.

In [None]:
train_generator = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./255,
    horizontal_flip=True,
    rotation_range=0.40
)

train_datagen = train_generator.flow_from_dataframe(
    dataframe=train_df,
    x_col='path',
    y_col=all_labels,
    target_size=(IMG_SIZE, IMG_SIZE),
    batch_size=BATCH_SIZE,
    seed=42,
    class_mode='raw'
)

In [None]:
valid_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

valid_datagen = valid_generator.flow_from_dataframe(
    dataframe=valid_df,
    x_col='path',
    y_col=all_labels,
    target_size=(IMG_SIZE, IMG_SIZE),
    batch_size=BATCH_SIZE,
    seed=42,
    class_mode='raw'
)

In [None]:
def display_batch(batch, n_imgs=9):
    r = int(n_imgs**0.5)
    fig, axs = plt.subplots(r, r, figsize=(12, 15))
    imgs, labels = batch[0], batch[1]
    for i, ax in zip(range(n_imgs), axs.flatten()):
        title = '\n'.join(list(np.array(all_labels)[labels[i].flatten()==1]))
        ax.imshow(imgs[i], cmap='bone')
        ax.set_title(title)
        ax.grid(False)
        ax.set_xticks([])
        ax.set_yticks([])
    plt.show()

In [None]:
display_batch(next(train_datagen))

## Model building

Lets start building the model. As a baseline, first I used a pretrained ResNet18. The performance achieved was not very good. So, I moved to the next level, thus using DenseNet121.

**Note: We have not addressed the data imbalance problem yet. The simplest way of dealing with it, will be using weighted binary crossentropy.**

**Note: All the code for training has been commented and the logs been put in markdown to prevent wastage of compute during commit.**

In [None]:
# !pip3 install image-classifiers -q

In [None]:
# from classification_models.tfkeras import Classifiers

# ResNet18, proc_func = Classifiers.get('resnet18')

As ResNet18 pretrained model is not available in the keras application. We used the one provided by --. But now we have moved to DenseNet121 and hence have commeted the code.

In [None]:
densenet = tf.keras.applications.DenseNet121(weights='imagenet', include_top=False, input_shape=(IMG_SIZE, IMG_SIZE, 3))
densenet.trainable = False

inputs = densenet.inputs
x = densenet(inputs)

x = tf.keras.layers.GlobalAveragePooling2D()(x)
outputs = tf.keras.layers.Dense(len(all_labels), activation='sigmoid')(x)
model = tf.keras.Model(inputs, outputs)

model.summary()

### Learning Rate

Lets build a learning rate finder to find an optimal lr for finetuning the model.

In [None]:
class LRFinder:
    def __init__(self, model, lr_range=[1e-10, 1e1], beta=0.98, stop_factor=4):
        self.model = model
        self.lr_range = lr_range
        self.beta = beta
        self.stop_factor = stop_factor
        self.stop_training = False
        self.iterations = 0
        self.mvg_avg_loss = 0
        self.min_loss = 1e9
        self.lrs = []
        self.losses = []

    def _reset(self):
        self.stop_training = False
        self.iterations = 0
        self.mvg_avg_loss = 0
        self.min_loss = 1e9
        self.lrs = []
        self.losses = []

    def _scheduler(self, start_lr, end_lr, iterations):
        self.lr_factor = (end_lr / start_lr)**(1./iterations)

    def on_train_begin(self, logs=None):
        self._reset()

    def on_batch_end(self, batch, logs=None):
        self.iterations += 1

        lr = tf.keras.backend.get_value(self.model.optimizer.lr)
        self.lrs.append(lr)
        tf.keras.backend.set_value(self.model.optimizer.lr, lr*self.lr_factor)

        loss = logs['loss']
        self.mvg_avg_loss = (self.beta*self.mvg_avg_loss) + ((1-self.beta)*loss)
        smooth_loss = self.mvg_avg_loss / (1-(self.beta**self.iterations))
        self.losses.append(smooth_loss)

        stop_loss = self.stop_factor * self.min_loss
        if self.iterations > 1 and smooth_loss > self.stop_factor:
            self.stop_training = True

        if self.iterations == 0 or smooth_loss < self.min_loss:
            self.min_loss = smooth_loss
#         print(f'\nIterations: {self.iterations}, lr: {lr}, loss: {smooth_loss}/{loss}, lrf: {self.lr_factor}')

    def on_epoch_end(self, epoch, logs=None):
        if self.stop_training:
            self.model.stop_training = True
            return

    def find(self, train_ds, epochs=None, steps_per_epoch=None, batch_size=32):
        if epochs is None:
            raise ValueError(f'Invalid value {epochs} for epochs')

        if steps_per_epoch is None:
            steps_per_epoch = len(train_ds)
            
        self._scheduler(self.lr_range[0], self.lr_range[1], steps_per_epoch*epochs)

        with tempfile.NamedTemporaryFile(prefix='init', suffix='.h5') as init_config:
            # save model config
            self.model.save_weights(init_config.name)
            init_lr = tf.keras.backend.get_value(self.model.optimizer.lr)
            
            tf.keras.backend.set_value(self.model.optimizer.lr, self.lr_range[0])

            lr_finder_cb = tf.keras.callbacks.LambdaCallback(
                on_train_begin= lambda logs: self.on_train_begin(logs),
                on_batch_end= lambda batch, logs: self.on_batch_end(batch, logs),
                on_epoch_end= lambda epoch, logs: self.on_epoch_end(epoch, logs)
            )

            self.model.fit(train_ds, epochs=epochs, steps_per_epoch=steps_per_epoch,
                           callbacks=[lr_finder_cb])

            # restore model config
            tf.keras.backend.set_value(self.model.optimizer.lr, init_lr)
            self.model.load_weights(init_config.name)

    def plot_loss(self, skip_begin=10, skip_end=1, title=""):
        lrs = self.lrs[skip_begin:-skip_end]
        losses = self.losses[skip_begin:-skip_end]
        plt.plot(lrs, losses)
        plt.xscale("log")
        plt.xlabel("Learning Rate (Log Scale)")
        plt.ylabel("Loss")

In [None]:
# model.compile(optimizer='adam',
#               loss='binary_crossentropy',
#               metrics=['binary_accuracy', tf.keras.metrics.AUC()])

# lr_finder = LRFinder(model)
# lr_finder.find(train_datagen, epochs=10)

In [None]:
# lr_finder.plot_loss()

![](https://storage.googleapis.com/kagglesdsdata/datasets/1225983/2068384/lrfinder.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210329%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210329T145149Z&X-Goog-Expires=172799&X-Goog-SignedHeaders=host&X-Goog-Signature=3e445be1f2c3d9455e46b2e540cdb7db59199d602bbe54a7ef185d164b4dc86ae1beaafd9d3817bba456bba35ea0ae45e811653470a36d36018bd896fb299099e18d75bd1349a63097f816fe6ca89832bf2c47e13c8d1d039fa9fa6eea07aef69fc16dedc35f22c2e80c502e53231eddc0a3503a50b22c112524027de61c51eebb9a2c5ecc3255340d47a97f20ccaee668d571dcc96a094ffc4f4a0462f467a316465831e0091058773bb11e661e64a303d3426e4010b13a432eb3b70333d6f161a88db4cc22c921edd4dd46136400b8e2b4de878bbb36d02c280643d8e188c185ae0ff34e1c06f4ab639284cc489cae3b28b8a56e4647e0d450c123d49161b4)

For our CLR we can use a minimum learning rate of 1e-2 and a maximum learning rate of 1e-5.

Lets use the CyclicalLearningRate provided by `tensorflow_addons` library, keeping the initial learning rate to 1e-5 and the maximum learning rate to be 1e-2 and use a 'traingular' approach of CLR. We are keeping the step size = 2 epochs. Hence, to perform 1 cycle we will need to perform 4 epochs.

In [None]:
def scale_fn(x):  return 1.

In [None]:
# clr = tfa.optimizers.CyclicalLearningRate(
#     initial_learning_rate=1e-5,
#     maximal_learning_rate=1e-2,
#     scale_fn=scale_fn,
#     step_size=2*len(train_datagen)*BATCH_SIZE,
#     scale_mode='cyclic'
# )

To evaluate our model we will use 'binary_accuracy' and 'AUROC Score', as our model itself will be evaluated on the AUROC score.

In [None]:
# model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=clr),
#               loss='binary_crossentropy',
#               metrics=['binary_accuracy', tf.keras.metrics.AUC()])

In [None]:
# model.fit(train_datagen, epochs=4, batch_size=BATCH_SIZE, 
#           steps_per_epoch=len(train_datagen), 
#           validation_data=valid_datagen,
#           validation_steps=len(valid_datagen))

### Logs for fine-tuning for 1 cycle (Frozen pre-trained model)

Epoch 1/4
1507/1507 [==============================] - 1215s 801ms/step - loss: 0.4272 - binary_accuracy: 0.7996 - auc_2: 0.7669 - val_loss: 0.2668 - val_binary_accuracy: 0.8886 - val_auc_2: 0.8993

Epoch 2/4
1507/1507 [==============================] - 1221s 811ms/step - loss: 0.2634 - binary_accuracy: 0.8909 - auc_2: 0.9037 - val_loss: 0.2560 - val_binary_accuracy: 0.8892 - val_auc_2: 0.9080

Epoch 3/4
1507/1507 [==============================] - 1217s 807ms/step - loss: 0.2555 - binary_accuracy: 0.8929 - auc_2: 0.9105 - val_loss: 0.2560 - val_binary_accuracy: 0.8916 - val_auc_2: 0.9098

Epoch 4/4
1507/1507 [==============================] - 1217s 808ms/step - loss: 0.2507 - binary_accuracy: 0.8938 - auc_2: 0.9145 - val_loss: 0.2523 - val_binary_accuracy: 0.8927 - val_auc_2: 0.9134


Now lets unfreeze our model and finetune the entire model.

In [None]:
densenet.trainable = True
model.summary()

In [None]:
# clr = tfa.optimizers.CyclicalLearningRate(
#     initial_learning_rate=1e-9,
#     maximal_learning_rate=1e-4,
#     scale_fn=scale_fn,
#     step_size=2*len(train_datagen)*BATCH_SIZE,
#     scale_mode='cyclic'
# )

In [None]:
# model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=clr),
#               loss='binary_crossentropy',
#               metrics=['binary_accuracy', tf.keras.metrics.AUC()])

In [None]:
# model.fit(train_datagen, epochs=4, batch_size=BATCH_SIZE, 
#           steps_per_epoch=len(train_datagen), 
#           validation_data=valid_datagen,
#           validation_steps=len(valid_datagen))

### Logs for fine-tuning for 1 cycle (Unfrozen pre-trained model)

Epoch 1/4
1507/1507 [==============================] - 1338s 878ms/step - loss: 0.3342 - binary_accuracy: 0.8664 - auc_3: 0.8492 - val_loss: 0.2725 - val_binary_accuracy: 0.8832 - val_auc_3: 0.8958

Epoch 2/4
1507/1507 [==============================] - 1297s 861ms/step - loss: 0.2648 - binary_accuracy: 0.8899 - auc_3: 0.9045 - val_loss: 0.2523 - val_binary_accuracy: 0.8907 - val_auc_3: 0.9123

Epoch 3/4
1507/1507 [==============================] - 1291s 857ms/step - loss: 0.2425 - binary_accuracy: 0.8967 - auc_3: 0.9216 - val_loss: 0.2401 - val_binary_accuracy: 0.8957 - val_auc_3: 0.9219

Epoch 4/4
1507/1507 [==============================] - 1283s 851ms/step - loss: 0.2243 - binary_accuracy: 0.9048 - auc_3: 0.9343 - val_loss: 0.2319 - val_binary_accuracy: 0.8941 - val_auc_3: 0.9277

In [None]:
# model.save('finetuned.h5')

In [None]:
model = tf.keras.models.load_model(base_path/'input'/'extras'/'ranzcr-pretrained.h5',
                                   custom_objects={'scale_fn': scale_fn})

In [None]:
test_files = {
    os.path.basename(f)[:-4] : f
    for f in glob.glob(str(path/'test'/'*.jpg'))
}

In [None]:
test_df = pd.DataFrame({'StudyInstanceUID': list(test_files.keys()), 'path': list(test_files.values())})

In [None]:
test_df.head()

In [None]:
def test_predictor(df, model):
    res = {
        l: []
        for l in all_labels
    }
    for f, p in df.values:
        img = tf.keras.preprocessing.image.load_img(p, target_size=(IMG_SIZE, IMG_SIZE))
        img = np.expand_dims(np.asarray(img)/225, axis=0)
        pred = model.predict(img).flatten()
        for i in range(len(all_labels)):
            res[all_labels[i]].append(pred[i])
    return res

res = test_predictor(test_df, model)

In [None]:
r = pd.DataFrame(res)
submission = pd.concat([test_df['StudyInstanceUID'], r], axis=1)
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
model.evaluate(valid_datagen)