# **My Notebook for RANZCR CLiP - Catheter and Line Position Challenge**

This notebook marks my very first attempt to an image classification competition with the use of a CNN by Keras. I hope this would serve as a reference for myself as well as beginners trying to explore the world of image classification by CNN.

DEBUG governs whether this run is a playground mode (or exploration stage), or for competition submission

In [None]:
DEBUG = False

## Step 1 - Import Packages and Modules

The first step is to import packages and modules that will be used in the rest of this notebook.

In [None]:
# Import packages and modules
from IPython.display import FileLink
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.layers import BatchNormalization, Conv2D, Dense, Dropout, Flatten, GlobalAveragePooling2D, Input, MaxPooling2D
from tensorflow.keras.models import Sequential, Model

import glob
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import tensorflow as tf

## Step 2 - Read Input Files

The input files are stored at ../input/ranzcr-clip-catheter-line-classification, and they are:
1. train.csv: This stores image paths and labels for training
2. test: This folder stores the testing images
3. test_tfrecords: This folder stores the testing images in tfrecord format, but I will not be using these tfrecord files
4. train: This folder stores the training images
5. train_tfrecords: This folder stores the training images and labels in tfrecord format, but I will not be using these tfrecord files

The way the training images and labels are loaded:
1. First I will import train.csv into a Pandas DataFrame, and then the paths to the training images and the training labels are loaded.
2. During exploration stage (or debug stage), train.csv will be divided into train, validation and test sets. Then when predictions are to be generated for submission, I will skip the test sets.
3. Pipelines will be created to load the train, valid as test sets into tensor datasets using from_tensor_slices(), with the image files being loaded using the map() function.

In [None]:
# Read train.csv which contains data pointing to the paths of training images and target labels
df_train = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/train.csv')
df_train.head()

In [None]:
# Format paths to training images
df_train['path'] = '../input/ranzcr-clip-catheter-line-classification/train/' + df_train['StudyInstanceUID']+'.jpg'

# Define target labels
labels = ['ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal', 
          'NGT - Abnormal', 'NGT - Borderline', 'NGT - Incompletely Imaged', 'NGT - Normal', 
          'CVC - Abnormal', 'CVC - Borderline', 'CVC - Normal', 
          'Swan Ganz Catheter Present']

In [None]:
# During "Debug" mode, proceed with a smaller train dataset to reduce runtime
if DEBUG:
    df_train = df_train.sample(n = df_train.shape[0] // 5).reset_index(drop = True)

In [None]:
for label in labels:
    print("#"*25)
    print(label)
    print(df_train[label].value_counts(normalize=True) * 100)

In [None]:
# Split training data into train and validating sets
X_train, X_valid = train_test_split(df_train, test_size = 0.1, stratify=np.argmax(df_train[labels].to_numpy(), axis=1))

if DEBUG:
    # Split training data into train and test sets
    X_train, X_test = train_test_split(X_train, test_size = 0.1, stratify=np.argmax(X_train[labels].to_numpy(), axis=1))

In [None]:
for label in labels:
    print("#"*25)
    print(label)
    print(X_train[label].value_counts(normalize=True) * 100)

In [None]:
for label in labels:
    print("#"*25)
    print(label)
    print(X_valid[label].value_counts(normalize=True) * 100)

In [None]:
if DEBUG:
    for label in labels:
        print("#"*25)
        print(label)
        print(X_test[label].value_counts(normalize=True) * 100)

In [None]:
print(df_train.shape)
print(X_train.shape)
print(X_valid.shape)
if DEBUG:
    print(X_test.shape)

In [None]:
# Create trainig and validating tensorflow datasets
train_ds = tf.data.Dataset.from_tensor_slices((X_train.path.values, X_train[labels].values))
valid_ds = tf.data.Dataset.from_tensor_slices((X_valid.path.values, X_valid[labels].values))
if DEBUG:
    # Create test dataset
    test_ds  = tf.data.Dataset.from_tensor_slices((X_test.path.values, X_test[labels].values))

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
batch_size = 32
target_size_dim = 224

# Mapping function for trainig and validating datasets (and test set of course)
def process_data(image_path, label):

    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, [target_size_dim,target_size_dim])
        
    return img, label

def data_augment(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_hue(image, 0.01)
    image = tf.image.random_saturation(image, 0.70, 1.30)
    image = tf.image.random_contrast(image, 0.80, 1.20)
    image = tf.image.random_brightness(image, 0.10)
    return image, label

In [None]:
# Turn trainig and validating datasets into batches (test dataset too of course)
train_ds_batch = train_ds.map(process_data, num_parallel_calls=AUTOTUNE).map(data_augment, num_parallel_calls=AUTOTUNE).shuffle(buffer_size=1024).repeat().batch(batch_size).prefetch(buffer_size=AUTOTUNE)
valid_ds_batch = valid_ds.map(process_data, num_parallel_calls=AUTOTUNE).batch(batch_size)
if DEBUG:
    test_ds_batch  = test_ds.map(process_data, num_parallel_calls=AUTOTUNE).batch(batch_size)

## Step 3 - Train a model using Transfer Learning

In [None]:
# Define the training model using transfer learning
def create_model():
    
    base_model = keras.applications.EfficientNetB0(weights='../input/keras-pretrained-models/EfficientNetB0_NoTop_ImageNet.h5', 
                                                   include_top=False,
                                                   drop_connect_rate=0.4)
    #base_model = keras.applications.NASNetMobile(weights='../input/keras-pretrained-models/NASNetMobile_NoTop_ImageNet.h5', include_top=False)
    base_model.trainable = True
    
    inputs = Input(shape=(target_size_dim, target_size_dim, 3)) 
    #a1     = data_augmentation(inputs)
    bm1    = base_model(inputs)
    avg1   = GlobalAveragePooling2D()(bm1)
    d1     = Dense(32, activation='relu')(avg1)
    x1     = Dropout(rate=0.4)(d1)
    d2     = Dense(32, activation='relu')(x1)
    x2     = Dropout(rate=0.4)(d2)
    d3     = Dense(32, activation='relu')(x2)
    x3     = Dropout(rate=0.4)(d3)
    predictions = Dense(len(labels), activation='sigmoid')(x3)
    
    model = Model(inputs=inputs, outputs=predictions)
    model.compile("adam", loss="binary_crossentropy", metrics=[tf.keras.metrics.AUC(multi_label=True)])
    
    return model

In [None]:
# Create the training model and show its summary
model = create_model()
model.summary()

In [None]:
early_stopping = EarlyStopping(
    patience=5, # how many epochs to wait before stopping
    monitor='val_loss', 
    mode='min',
    restore_best_weights=True,
)

reduceLROnPlat = ReduceLROnPlateau(
    monitor='val_loss', 
    factor=0.8, 
    patience=2, 
    mode='auto', 
    cooldown=3,
    min_lr=0.00001
)

In [None]:
history = model.fit(train_ds_batch, 
                    validation_data = valid_ds_batch, 
                    epochs = 30, 
                    steps_per_epoch = len(X_train) // batch_size,
                    callbacks = [early_stopping, reduceLROnPlat]
                   )

In [None]:
if DEBUG:
    model.evaluate(test_ds_batch)

# Step 4 - This is to import the competition test dataset, predict and generate the submission file

In [None]:
def process_data_competition(image_path):
    # load the raw data from the file as a string
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, [target_size_dim,target_size_dim])
    #img = tf.keras.applications.mobilenet.preprocess_input(img)
    
    return img

In [None]:
if not DEBUG: 
    comp_images = glob.glob('../input/ranzcr-clip-catheter-line-classification/test/*.jpg')

    df_comp = pd.DataFrame(np.array(comp_images), columns=['Path'])
    df_comp.head()

    comp_ds = tf.data.Dataset.from_tensor_slices(df_comp.Path.values)
    comp_ds_batch = comp_ds.map(process_data_competition, num_parallel_calls=AUTOTUNE).batch(batch_size)

    pred_y = model.predict(comp_ds_batch, verbose=1)
    #pred_y = np.array([[1 if i > 0.5 else 0 for i in j] for j in pred_y])

    df_result = pd.DataFrame(pred_y, columns = labels)
    df_result['StudyInstanceUID'] = df_comp.Path.str.split('/').str[-1].str[:-4]
    df_result.head()

    cols_reordered = ['StudyInstanceUID', 'ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal', 'NGT - Abnormal',
           'NGT - Borderline', 'NGT - Incompletely Imaged', 'NGT - Normal',
           'CVC - Abnormal', 'CVC - Borderline', 'CVC - Normal',
           'Swan Ganz Catheter Present']

    df_result = df_result[cols_reordered]

    df_result.to_csv('submission.csv', index=False)

# Reference:
* EfficientNetB3 tf2/Keras Baseline (https://www.kaggle.com/harveenchadha/efficientnetb3-tf2-keras-baseline)
* <日本語>RANZCR機械学習初心者向け (https://www.kaggle.com/tomohiroh/ranzcr)