# 3D ConvNet with 3D Augmentations

This is a kernel I was playing around with to see if a 3D convolutional network could produce useful features from scan images preprocessed to the same size using code as published in my previous [kernel](https://www.kaggle.com/donkeys/preprocessing-images-to-normalize-colors-and-sizes). To recap, this includes:

- resizing all images to same size (width, height)
- rescaling the z-axis of all 3D arrays to the same depth.
- removing boundaries from images there present.

Since there are not that many images to start with, I also added augmentations to the 3D images, according to my earlier [kernel](https://www.kaggle.com/donkeys/ct-slices-basic-eda):

- 3D gaussian blur
- 3D flips on x- and y-axis
- 3D rotation on x-axis
- 3D shift on x- and y-axis
- 3D zoom on x- and y-axis

The 3D in the above list simply refers to applying the augmentations/transformations on the whole z-axis of the 3D numpy array at once.

One point of this kernel was also to allow faster and separate iterations of trying to build a 3D CNN to use as part of other models (e.g., with the tabular dataset).

The accuracy of the model in this kernel is low, but it could provide some useful building blocks for other kernels to build on. Or maybe someone will spot some errors and let me know..

In [None]:
import numpy as np
import pandas as pd

import cv2

import scipy
from scipy import ndimage
from scipy.ndimage.filters import gaussian_filter
from scipy.ndimage import zoom
import random

from tqdm.auto import tqdm
tqdm.pandas()

from skimage.io import imread
from skimage.transform import resize
from keras.utils import Sequence
import math

import PIL
from PIL import Image, ImageOps
import matplotlib.pylab as plt

from sklearn.model_selection import StratifiedKFold, GroupKFold, GroupShuffleSplit # Used to use Kfold to train our model
from sklearn.model_selection import train_test_split

import tensorflow
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense, Flatten, Input
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization, Dropout, Activation
from tensorflow.keras.layers import Conv3D, MaxPooling3D
from tensorflow.keras import layers
from tensorflow.keras import backend as K # The backend give us access to tensorflow operations and allow us to create the Attention class
#from tensorflow.keras.models import Sequential
import matplotlib.pylab as plt

from tensorflow.keras.callbacks import (ModelCheckpoint, LearningRateScheduler,
                             EarlyStopping, ReduceLROnPlateau,CSVLogger)
                             
from sklearn.model_selection import train_test_split


# A Few Configuration Variables

In [None]:
batch_size = 8
epochs = 30
#how many splits to do on the training data, or how many cross-validation rounds to use
N_SPLITS = 5
#my_test_pct is a percentage of values left out of test/validation data to compare the final model against known results
my_test_pct = 0.05

#there are some different sizes of images in my rescaled dataset, so using some of those here
#because the 3D arrays are quite large, they tend to take memory and this (192) was a size that did not cause out of memory errors
img_size = 192
#img_size = 256
#img_size = 512
img_depth = 30


# Brief Overview of the Data

The preprocessed data is in a [dataset](https://www.kaggle.com/donkeys/osic-pulmonary-fibrosispreprocessed) I previously uploaded. This is where it mounts:

In [None]:
!ls /kaggle/input/osic-pulmonary-fibrosispreprocessed

In this case I used a continous value for the smoking feature, although you can say it is categorical. But 0 for non-smoking, 1 for currently smoking, and 0.5 for used to smoke.

In [None]:
df_train_orig = pd.read_csv("/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv")
df_train = pd.read_csv("/kaggle/input/osic-pulmonary-fibrosispreprocessed/dataset/df_train_scaled_continous_smoke.csv").drop("Unnamed: 0", axis=1)
df_train.head()

In [None]:
df_train["SmokingStatus"].unique()

In [None]:
df_test = pd.read_csv("/kaggle/input/osic-pulmonary-fibrosis-progression/test.csv")
df_test.head()

In [None]:
DATA_DIR = "/kaggle/input/osic-pulmonary-fibrosispreprocessed/dataset"

In [None]:
df_train[df_train["Patient"] == "ID00126637202218610655908"]

List of patients in the training set. This can be useful, for example, to make patient-grouped data-splits.

In [None]:
patient_ids = df_train["Patient"].unique()
patient_ids.shape

In [None]:
patient_id = "ID00047637202184938901501"
patient_fvc = df_train[df_train["Patient"] == patient_id]
patient_fvc

In [None]:
#this is the original data from kaggle competition, no preprocessing done
!ls /kaggle/input/osic-pulmonary-fibrosis-progression

In [None]:
df_train.head()

## Create Train and Test Rows

This is just a little something I ended up with to format my data after playing with various approaches. Maybe not really necessary but it works for this kernel.. I really just need the patient ID for accessing the image in this kernel, the rest are just someting I used in other kernels.

In [None]:
patient_ids = df_train["Patient"].unique()
training_rows = []
patient_count = 0
for patient_id in tqdm(patient_ids):
    df_patient = df_train[df_train["Patient"] == patient_id]
    patient_row_count = 0
    row = df_patient.iloc[0]
    row_fvc = row["FVC"]
    patient_row_count += 1
    training_row = {}
    training_row["patient_id"] = row["Patient"]
    training_row["base_week"] = row["Weeks"]
    training_row["pct"] = row["Percent"]
    training_row["age"] = row["Age"]
    training_row["gender_female"] = row["Sex_Female"]
    training_row["gender_male"] = row["Sex_Male"]
    training_row["smoking_status"] = row["SmokingStatus"]
    training_row["target_fvc"] = row["fvc_raw"]
    training_rows.append(training_row)
    patient_count += 1
print(f"processed {patient_count} patients")


In [None]:
df_test = pd.read_csv(f"{DATA_DIR}/df_test_scaled_continous_smoke.csv")
df_test.head()

In [None]:
patient_ids = df_test["Patient"].unique()
test_rows = []
patient_count = 0
for patient_id in tqdm(patient_ids):
    df_patient = df_test[df_test["Patient"] == patient_id]
    patient_row_count = 0
    for idx, row in df_patient.iterrows():
        row_fvc = row["FVC"]
        patient_row_count += 1
        test_row = {}
        test_row["patient_id"] = row["Patient"]
        test_row["base_fvc"] = row_fvc
        test_row["base_week"] = row["Weeks"]
        test_row["pct"] = row["Percent"]
        test_row["age"] = row["Age"]
        test_row["gender_female"] = row["Sex_Female"]
        test_row["gender_male"] = row["Sex_Male"]
        test_row["smoking_status"] = row["SmokingStatus"]
        test_rows.append(test_row)
    print(f"created {patient_row_count} instances for patient {patient_id}")
    patient_count += 1
print(f"processed {patient_count} patients")
    #break

In [None]:
df_new_train = pd.DataFrame(training_rows)
df_new_train


In [None]:
df_new_train.columns

In [None]:
df_train[df_train["Patient"] == "ID00007637202177411956430"]

I have scaled the columns to range 0..1, except the column now named *target_fvc*. I use this column as the target variable to predict here. It is the same as the base fvc given for each patient. Just renamed target here, as it is the prediction target..

In [None]:
#the only thing this kernel uses from the x_cols is actually the patient id, as that can be used to find the filename of the image.
#some other kernels i publish use the other colums as well, which is why they are there
x_cols = [col for col in df_new_train.columns if col != "target_fvc"]
df_x = df_new_train[x_cols]
df_y = df_new_train["target_fvc"]


In [None]:
df_x.head()

In [None]:
df_y.head()

# Custom Keras Sequence Generator

Here I define a Keras Sequence generator to build the augmented dataset on the fly.

First, a small utility function to shuffle x and y at the end of each training epoch:

In [None]:
def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    new_a = a.iloc[p]
    new_b = b.iloc[p]
    return new_a, new_b

A custom Keras generator sequence to produce batches of 3D augmented images:

In [None]:
class MySequence3D(Sequence):

    def __init__(self, x_set, y_set, batch_size, mode="train", augment=True):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.mode = mode
        self.max_idx = math.ceil(len(x_set)/batch_size)
        self.augment = augment

    def __len__(self):
        #TODO: check is correct
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        idx2 = idx % self.max_idx
        start = idx2 * self.batch_size
        end = min(start + batch_size, len(self.x))
        batch_x = self.x.iloc[start : end]
        batch_y = self.y.iloc[start : end]
        
        next_batch_img = []
        next_batch_num = []
        for index, row in batch_x.iterrows():
            #print(row)
            #nums = [row[col] for col in dense_cols]
            nums = []
            nums = np.array(nums)
            file_name = row["patient_id"]
            #file_path = row["path"]
            augmented = img_augment_3d(self.mode, self.x, file_name, self.augment)
            next_batch_num.append(nums)
            next_batch_img.append(augmented)
        np_y = np.array(batch_y)
        np_x_img = np.array(next_batch_img)
        np_x_num = np.array(next_batch_num)
        del next_batch_img
        del next_batch_num
        del batch_y
        #print(f"loaded shape: {np_x.shape}, batch={idx}")
#        result = [np_x_img, np_x_num], np_y
        result = [np_x_img], np_y

        #print(f"shapes: {result[0].shape}, {result[1].shape}")
        return result

    def on_epoch_end(self):
        self.x, self.y = unison_shuffled_copies(self.x, self.y)

# 3D Augmentation Helpers

A number of functions to perform augmentations, one per function, on given 3D images. These mostly start with a random chance of whether it will be applied at all or not. Just to avoid applying all augmentations all the time. Then the augmentation itself.

### Rotate

First, a helper to rotage a given 3D image by a random angle between given min and max angle:

In [None]:
#https://stackoverflow.com/questions/43922198/how-to-rotate-a-3d-image-by-a-random-angle-in-python
def random_rotation_3d(img, min_angle, max_angle):
    """ Randomly rotate an image by a random angle (-max_angle, max_angle).

    Arguments:
    max_angle: `float`. The maximum rotation angle.

    Returns:
    rotated 3D image
    """
    if random.randint(1,100) > 30:
        #with some chance, do not rotate at all
        return img
    img_rot = np.zeros(img.shape)
    angle = random.uniform(min_angle, max_angle)
    if random.randint(1,100) > 50:
        #in half the cases, rotate left. in other half, rotate right.
        angle *= -1
        # Following lines would rotate on z and y axis as well, but not using them in this kernel
#        # rotate along z-axis
#        image2 = scipy.ndimage.interpolation.rotate(image1, angle, mode='nearest', axes=(0, 1), reshape=False)
#        # rotate along y-axis
#        image3 = scipy.ndimage.interpolation.rotate(image2, angle, mode='nearest', axes=(0, 2), reshape=False)

    # rotate along x-axis
    img_rot = scipy.ndimage.interpolation.rotate(img, angle, mode='nearest', axes=(1, 2), reshape=False)
    return img_rot.reshape(img.shape)

### Gaussian Blur

A helper to do a gaussian blur on a 3D array (of image pixels):

In [None]:
#https://stackoverflow.com/questions/29920114/how-to-gauss-filter-blur-a-floating-point-numpy-array
def gaussian_blur_3d(img):
    if random.randint(1,100) > 15:
        return img
    sigma = random.uniform(0.1,0.9)
    blurred = gaussian_filter(img, sigma=sigma)
    return blurred


### Horizontal Flip

Flip an entire 3D array along x-axis:

In [None]:
#https://stackoverflow.com/questions/7416170/numpy-reverse-multidimensional-array
def x_flip(img):
    if random.randint(1,100) > 50:
        flipped = img[:, :, ::-1]
    else:
        flipped = img
    return flipped

### Vertical Flip

Flip an entire 3D array along y-axis (some images seem to be upside down as noted in my [previous kernel](https://www.kaggle.com/donkeys/ct-slices-basic-eda)):

In [None]:
def y_flip(img):
    if random.randint(1,100) > 70:
        flipped = img[:, ::-1, :]
    else:
        flipped = img
    return flipped

### Shift X-axis

Shift a 3D array on x-axis by random number of pixels between given min and max. Left or right, random choice of direction.

In [None]:
def x_shift(img, min_shift, max_shift):
    if random.randint(1,100) > 30:
        return img
    shift_dir = 1
    if random.randint(1,100) > 50:
        shift_dir = -1
    roll_amount = random.randint(min_shift, max_shift)
    roll_amount *= shift_dir
    img = np.roll(img, roll_amount, axis=2)
    #z,y,x?
    if shift_dir > 0:
        img[:, :, 0:roll_amount] = 0
    else:
        img[:, :, roll_amount:] = 0
#    print(img)
    return img
    

### Shift Y-axis

Shift a 3D array on y-axis by random number of pixels between given min and max. Left or right, random choice of direction.

In [None]:
def y_shift(img, min_shift, max_shift):
    if random.randint(1,100) > 30:
        return img
    shift_dir = 1
    if random.randint(1,100) > 50:
        shift_dir = -1
    roll_amount = random.randint(min_shift, max_shift)
    roll_amount *= shift_dir
    img = np.roll(img, roll_amount, axis=1)
    #z,y,x?
    if shift_dir > 0:
        img[:, 0:roll_amount, :] = 0
    else:
        img[:, roll_amount:, :] = 0
#    print(img)
    return img
    

### Zoom

Zoom the 3D image by a random zoom factor (same for both axis) between given min and max on both x- and y-axis:

In [None]:
#https://stackoverflow.com/questions/37119071/scipy-rotate-and-zoom-an-image-without-changing-its-dimensions
def zoom_xy(img, min_zoom, max_zoom):
    if random.randint(1,100) > 20:
        return img
    zoom_factor = random.uniform(min_zoom, max_zoom)
    h, w = img_size, img_size

    # For multichannel images we don't want to apply the zoom factor to the RGB
    # dimension, so instead we create a tuple of zoom factors, one per array
    # dimension, with 1's for any trailing dimensions after the width and height.
    zoom_tuple = (1, zoom_factor, zoom_factor)

    # Zooming out
    if zoom_factor < 1:

        # Bounding box of the zoomed-out image within the output array
        zh = int(np.round(h * zoom_factor))
        zw = int(np.round(w * zoom_factor))
        top = (h - zh) // 2
        left = (w - zw) // 2

        # Zero-padding
        out = np.zeros_like(img)
        zoomed_img = zoom(img, zoom_tuple, order=0)
        #print(f"zoomed shape: {zoomed_img.shape}")
        #print(f"out shape:{out.shape}")
        #print(f"w:{w},h:{h},l:{left},t:{top},zw:{zw}, zh:{zh}")
        out[:, top:top+zh, left:left+zw] = zoomed_img

    # Zooming in
    elif zoom_factor > 1:

        # Bounding box of the zoomed-in region within the input array
        zh = int(np.ceil(h / zoom_factor))
        zw = int(np.ceil(w / zoom_factor))
        top = (h - zh) // 2
        left = (w - zw) // 2

        #out_template = np.zeros_like(img)
        out = zoom(img[:, top:top+zh, left:left+zw], zoom_tuple, order=0)
        #print(f"out shape:{out.shape}")
        #print(f"w:{w},h:{h},l:{left},t:{top},zw:{zw}, zh:{zh}")

        # `out` might still be slightly larger than `img` due to rounding, so
        # trim off any extra pixels at the edges
        trim_top = ((out.shape[1] - h) // 2)
        trim_left = ((out.shape[2] - w) // 2)
        #print(f"out shape before:{out.shape}")
        out = out[:, trim_top:trim_top+h, trim_left:trim_left+w]
        #print(f"out shape after:{out.shape}")
        #print(f"w:{w},h:{h},l:{left},trimtop:{trim_top},trimleft:{trim_left}")

    # If zoom_factor == 1, just return the input array
    else:
        out = img
    #print(out.shape)
    return out

### Failed Attempt at Zoom

This was an attempt at using CV2 library for zooming, keeping it here for posterity:

In [None]:
#https://stackoverflow.com/questions/37119071/scipy-rotate-and-zoom-an-image-without-changing-its-dimensions
#open cv does not seem to support 3d image resizing so cannot do that..
def cv2_zoom_xy(img, min_zoom, max_zoom):
    if random.randint(1,100) > 20:
        return img
    zoom_factor = random.uniform(min_zoom, max_zoom)
    zoom_factor = 2.0
    print(f"zf:{zoom_factor}")
    height, width = img_size, img_size
    new_height, new_width = int(height * zoom_factor), int(width * zoom_factor)

    ### Crop only the part that will remain in the result (more efficient)
    # Centered bbox of the final desired size in resized (larger/smaller) image coordinates
    y1, x1 = max(0, new_height - height) // 2, max(0, new_width - width) // 2
    y2, x2 = y1 + height, x1 + width
    bbox = np.array([y1,x1,y2,x2])
    # Map back to original image coordinates
    bbox = (bbox / zoom_factor).astype(np.int)
    y1, x1, y2, x2 = bbox
    cropped_img = img[:, y1:y2, x1:x2]

    # Handle padding when downscaling
    resize_height, resize_width = min(new_height, height), min(new_width, width)
    pad_height1, pad_width1 = (height - resize_height) // 2, (width - resize_width) //2
    pad_height2, pad_width2 = (height - resize_height) - pad_height1, (width - resize_width) - pad_width1
    pad_spec = [(0,0), (pad_height1, pad_height2), (pad_width1, pad_width2)] + [(0,0)] * (img.ndim - 2)

    result = cv2.resize(cropped_img, (img.shape[0], resize_width, resize_height))
    result = np.pad(result, pad_spec, mode='constant')
    assert result.shape[0] == height and result.shape[1] == width
    return result

### The One Function to Bind them Together

The actual augmentation function that aggregates all of the above into one:

In [None]:
def img_augment_3d(df_name, df, patient_id, do_augment):
    filename = f"{DATA_DIR}/scaled_png/{df_name}_{img_depth}/{patient_id}/full_3d_{img_size}.npy"
    img = np.load(filename)
    img = img / 255.0
    if do_augment:
        #give it 15% chance of not doing any augmentation
        if random.randint(1,100) > 15:
            #you can just comment all but one of below calls to just see effects of a single one later
            #in which case, you might want to modify above if the > 0 to make it always true and see changes
            img = gaussian_blur_3d(img)
            img = x_flip(img)
            img = y_flip(img)
            img = random_rotation_3d(img, 1, 5)
            img = x_shift(img, 5, 15)
            img = y_shift(img, 5, 15)
            img = zoom_xy(img, 0.9, 1.1)
            #pass
    return img


## Smoke Test

Sometimes the custom sequence generator has issues when some combination of augmentations hits, so this allows testing it with various combinations to see if it crashes (since it loops the batch generator with augmentation, it could also be used as a performance test):

In [None]:
experiment_gen = MySequence3D(df_x, df_y, batch_size, augment=True)

In [None]:
for x in tqdm(range(10)):
    batch = experiment_gen.__getitem__(0)

## Visualizing Augmentations to See How They Work

A helper to plot the 3D augmented images the above custom generator produces (for debugging/testing):

In [None]:
def plot_gen_batch(generator, idx):
    # configure batch size and retrieve one batch of images
    plt.clf() #clears matplotlib data and axes
    #for batch in train_generator:
    rows = (batch_size / 3)+1
    plt.figure(figsize=[30,10*rows])
    batch = generator.__getitem__(idx)
    print(f"showing {len(batch[0])} images")
    #have to use len(batch[0] here, as batch size can vary if it is the last part of the images (truncated to dataset length)
    for x in range(0, len(batch[0][0])):
    #    print(train_generator.filenames[x])
    
        plt.subplot(rows, 3, x+1)
        #batch[0] is x, which is [imgs, nums]
        #batch[0][0] is imgs
        #so the following line just takes the first slice of each 3D image in the batch
        img_2d_plane = batch[0][0][x][0]
        plt.imshow(img_2d_plane, interpolation='nearest')

        num = "disabled"
        y = batch[1][x]
        print(f"num: {num}, y: {y}, img min: {img_2d_plane.min()} max: {img_2d_plane.max()}")

    plt.show()

In [None]:
#this gives length 2, since it is an array of (x,y) batch items
len(batch)

In [None]:
#with batch size 8 we get 8 images
len(batch[0][0])

In [None]:
#with img_depth=30 and img_size=192, we get 8 images in a batch, each of size depth(z)=30, height(y)=192, width(x)=192
batch[0][0].shape

## Plot Some Example Augmented Batches

Every time `__getitem__()` is called on the generator, it should produce differenct augmentations on the fly. So here we call it twice to see that it works as expected:

In [None]:
plot_gen_batch(experiment_gen, 0)

In [None]:
plot_gen_batch(experiment_gen, 0)

Just a little cleanup as we are resource-wasting paranoid:

In [None]:
del batch
del experiment_gen

# Create the Model

Here, I build the 3D CNN to play with. First basic config for image sizes etc:

In [None]:
h = img_size
w = img_size
input_shape = (h, w, img_depth, 1) #the image has just one color channel


The actual `create_model()` function further below will call `create_cnn_model()` to create the CNN. This is because some other kernels I reused the model I build in this kernel and this allows me to combine them easier. I built this here separately since it allowed me to focus on the CNN only, and run it faster with a smaller dataset. To iterate model architectures faster and with less use of the limited GPU resources on Kaggle.

In [None]:
def create_cnn_model():
    img_input = Input(shape=(input_shape), name="img_input")
    cnn = Conv3D(32, kernel_size=(15), strides=(2), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_1")(img_input)
    #cnn = Dropout(0.5)(cnn)
    #strides=(1,1)?
    #batchnorm seems to really wreck havoc on the results if added anywhere in the cnn
    #cnn = BatchNormalization()(cnn)
    #cnn = MaxPooling3D(pool_size=(2,2,2))(cnn) #strides=(2,2)?
    #cnn = Dropout(0.6)(cnn)
    cnn = Conv3D(32, kernel_size=(7), strides=(2), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_2")(img_input)
    #cnn = Dropout(0.5)(cnn)
    #strides=(1,1)?
    #cnn = BatchNormalization()(cnn)
    cnn = MaxPooling3D(pool_size=(2,2,2), strides=(2, 2, 2))(cnn) #strides=(2,2)?
    #cnn = Dropout(0.45)(cnn)
    cnn = Conv3D(64, kernel_size=(3), strides=(1), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_3")(cnn)
    cnn = Conv3D(64, kernel_size=(3), strides=(2), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_4")(cnn)
    #cnn = BatchNormalization()(cnn)
    #cnn = MaxPooling3D(pool_size=(2,2,2), strides=(2, 2, 2))(cnn) #strides=(2,2)?
    #cnn = Dropout(0.45)(cnn)
    #cnn = Conv3D(128, kernel_size=(5), strides=(1), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_5")(cnn)
    #cnn = Conv3D(128, kernel_size=(3), strides=(2), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_6")(cnn)
    #cnn = BatchNormalization()(cnn)
    cnn = MaxPooling3D(pool_size=(2,2,2), strides=(2, 2, 2))(cnn) #strides=(2,2)?
    #cnn = Dropout(0.3)(cnn)
    flatten = Flatten()(cnn)
    final_cnn_dense = Dense(100, activation='relu')(flatten)
    model = keras.Model(
        inputs=[img_input],
        outputs=[final_cnn_dense],
    )
    return model


Another alternative I tried, just for illustration:

In [None]:
#tried a few options, just rename the one to try as "create_cnn_model" and run the thing..
def create_cnn_model_small():
    img_input = Input(shape=(input_shape), name="img_input")
    cnn = Conv3D(32, kernel_size=(5), strides=(2), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_1")(img_input)
    cnn = MaxPooling3D(pool_size=(2,2,2), strides=(2))(cnn)
    cnn = Conv3D(64, kernel_size=(3), strides=(2), padding='same', activation='relu', kernel_initializer='he_uniform', name="conv_3d_2")(img_input)
    cnn = MaxPooling3D(pool_size=(2,2,2), strides=(2, 2, 2))(cnn)
    flatten = Flatten()(cnn)
    final_cnn_dense = Dense(100, activation='relu')(flatten)
    model = keras.Model(
        inputs=[img_input],
        outputs=[final_cnn_dense],
    )
    return model


This creates the actual model for training folds:

In [None]:
from tensorflow.keras.optimizers import Adam

def create_model():
    cnn_model = create_cnn_model()
    x = Dropout(0.5)(cnn_model.output)
    x = Dense(200, activation="relu")(x)
    x = Dropout(0.5)(x)
    x = Dense(1, activation="linear", name="final_dense")(x)
    model = keras.Model(inputs=[cnn_model.input], outputs=x)
    adam = tensorflow.keras.optimizers.Adam(lr=0.01)
#    adam = tensorflow.keras.optimizers.Adam(lr=0.001)
#    adam = tensorflow.keras.optimizers.Adam(lr=1e-2, decay=1e-2/epochs)

    model.compile(loss='mean_squared_error',
                  optimizer=adam,  #keras.optimizers.SGD(lr=0.01),
                  metrics=[tf.keras.metrics.RootMeanSquaredError(name='rmse')])
    return model

And our regular Keras callbacks to save the best model during training, lower training rate on plateuau, and stop a bit earlier if no gains:

In [None]:
def create_callbacks(idx):
    checkpoint = ModelCheckpoint(f'../working/weights_best_{idx}.h5', monitor='val_loss', verbose=1, 
                                 save_best_only=True, mode='min', save_weights_only = True)
    reduceLROnPlat = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6,
                                       verbose=1, mode='auto', epsilon=0.0001)
    early = EarlyStopping(monitor="val_loss", 
                          mode="min", 
                          patience=22) #OK, patience is silly / too high for this number of epochs but whatever :)

    csv_logger = CSVLogger(filename='../working/training_log.csv',
                           separator=',',
                           append=True)

    callbacks_list = [checkpoint, reduceLROnPlat, csv_logger, early]
    return callbacks_list

## Function to Train the Model on Given Data

This function runs the training for a single fold with given data:

In [None]:
def fit_model(model, callbacks_list, df_x_train, df_y_train, df_x_val, df_y_val):
    train_gen = MySequence3D(df_x_train, df_y_train, batch_size, augment=True)
    valid_gen = MySequence3D(df_x_val, df_y_val, batch_size, augment=False)

    #the total number of images we have:
    train_size = df_x_train.shape[0]
    #train_steps is how many steps per epoch Keras runs the genrator. One step is batch_size*images
    train_steps = train_size/batch_size
    train_steps = int(train_steps)
    #same for the validation set
    valid_size = df_x_val.shape[0]
    valid_steps = valid_size/batch_size
    valid_steps = int(valid_steps)

    fit_history = model.fit_generator(
            train_gen,
            steps_per_epoch=train_steps,
            epochs = epochs,
            validation_data=valid_gen,
            validation_steps=valid_steps,
            callbacks=callbacks_list,
        use_multiprocessing=True,
        workers=2,
        verbose = 1
    )
    return fit_history

## Make Predictions

Make predictions for the Kaggle test set, using target week numbers -12...133. This is just to see how the prediction runs here, did not use this for actual submission, since I made this kernel more to explore the CNN architecture. It does not use any time related data, or the tabular data, so its not really very good for the actual final prediction over all weeks. But keeping these here anyway.

In [None]:
def make_predictions_dict(model, test_rows):
    if isinstance(test_rows, pd.DataFrame):
        test_rows = [row.to_dict() for (idx, row) in test_rows.iterrows()]
    predictions = {}
    col_names = []
    print(f"predicting {test_rows}")
    for target_week in tqdm(range(-12,134)):
        for idx, row in enumerate(test_rows):
            row["target_week"] = (target_week+12)/(133+12)
            patient_id = row["patient_id"]
            img = img_augment_3d("train", None, patient_id, False)
            img = np.array([img])
            #print(img)
            #print(img.shape)
            pred = model.predict([img])
            col_name = f"{idx+1}_{target_week}"
            col_names.append(col_name)
            predictions[col_name] = pred.flatten()[0]
            #print(f"target week: {row['target_week']}, pred: {pred}")
        
    return predictions, col_names

Predict also for my own test set that was put aside from the training data. So I can compare later how well/bad it did with the final trained model (since training set has known target FVC values):

In [None]:
def make_predictions_my_test(model, test_rows):
    test_rows = [row.to_dict() for (idx, row) in test_rows.iterrows()]
    predictions = []
    for idx, row in tqdm(enumerate(test_rows), total=len(test_rows)):
        patient_id = row["patient_id"]
        img = img_augment_3d("train", None, patient_id, False)
        img = np.array([img])
        pred = model.predict([img])
        predictions.append(pred.flatten()[0])
        #print(f"target week: {row['target_week']}, pred: {pred}")
        
    return predictions

## Split into N Groups to Train N Models

In [None]:
indices = np.arange(df_x.shape[0])
#https://github.com/scikit-learn/scikit-learn/issues/9193
train_indices, my_test_indices = next(GroupShuffleSplit(test_size=my_test_pct, random_state=8).split(indices, groups=df_x["patient_id"]))
my_test_X = df_x.iloc[my_test_indices]
my_test_y = df_y.iloc[my_test_indices]

full_indices = indices
indices = train_indices

## Run the Training on Model

In [None]:

#split the training data by patient id, so if patient has multiple rows we put them in the same group
splits = list(GroupKFold(n_splits=N_SPLITS).split(indices, groups=df_x.iloc[indices]["patient_id"]))
preds_test = []
preds_my_test = []
col_names = []
fit_histories = []

for idx, (train_idx, val_idx) in enumerate(splits):
    K.clear_session() # start Keras from clean state in each iteration
    print("Beginning fold {}".format(idx+1))
    # use the indexes to extract the folds in the train and validation data
    train_X, train_y, val_X, val_y = df_x.iloc[train_idx], df_y.iloc[train_idx], df_x.iloc[val_idx], df_y.iloc[val_idx]
    # instantiate the model for this fold
    model = create_model()
    callbacks = create_callbacks(idx)
    #train the model on this fold
    history = fit_model(model, callbacks, train_X, train_y, val_X, val_y)
    fit_histories.append(history)
    # loads the best weights saved by the checkpoint
    model.load_weights(f'weights_best_{idx}.h5')
    pred_test, col_names = make_predictions_dict(model, test_rows)
    preds_test.append(pred_test)
    pred_test = make_predictions_my_test(model, my_test_X)
    preds_my_test.append(pred_test)
    del model
    del callbacks


In [None]:
my_test_X


In [None]:
model = create_model()
model.summary()

In [None]:
from keras.utils import plot_model
plot_model(model)

## Plot Training Losses per Fold

In [None]:
def plot_loss_and_accuracy(fit_history, n=0):
    plt.clf()
    plt.plot(fit_history.history['rmse'][n:])
    plt.plot(fit_history.history['val_rmse'][n:])
    plt.title('model rmse')
    plt.ylabel('rmse')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()
    plt.clf()
    #since the loss is just mse here, the graph is practically identical to RMSE (metric above). Save space and plot it just once..
    # summarize history for loss
    #plt.plot(fit_history.history['loss'][n:])
    #plt.plot(fit_history.history['val_loss'][n:])
    #plt.title('model loss')
    #plt.ylabel('loss')
    #plt.xlabel('epoch')
    #plt.legend(['train', 'test'], loc='upper left')
    plt.show()

Full loss history for all folds:

In [None]:
for idx, fit_history in enumerate(fit_histories):
    print(f"fold {idx}")
    plot_loss_and_accuracy(fit_history)

The above charts often have a few versions where the first few epochs have very high loss, and then these drop down significatnly. But because the start is high, variance in the later epochs is hard to spot in the chart. To avoid this, here we plot loss history starting at epoch 4 for all folds:
    

In [None]:
for idx, fit_history in enumerate(fit_histories):
    print(f"fold {idx}")
    plot_loss_and_accuracy(fit_history, 4)

## Look at Custom Test Set Predictions vs Actual Values

In [None]:
len(preds_my_test[0])

In [None]:
preds_my_test_mean = np.mean(preds_my_test, axis=0)

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(my_test_y, preds_my_test_mean, squared=False)

Convert our test set predictions into a dataframe for easier manipulation:

In [None]:
df_mytest_predictions = pd.DataFrame()
for idx, mtp in enumerate(preds_my_test):
    df_mytest_predictions[f"{idx+1}"] = mtp


In [None]:
df_mytest_predictions.T

In [None]:
df_mytest_predictions.T.describe()

In [None]:
df_mytest_predictions.T.describe().T.describe()

## Check Diff per Patient in Test Set

By looking at how much the FVC predictions differ from the actual FVC value per patient, maybe we could identify some trends.

Lets see.

In [None]:
df_mytest_predictions_diff = pd.DataFrame()
for idx, mtp in enumerate(preds_my_test):
    df_mytest_predictions_diff[f"{idx+1}"] = np.abs(np.array(mtp) - my_test_y.values)


In [None]:
df_mytest_predictions_diff.T

Above shows one row per fold (5 folds = 5 rows), where each column is the prediction difference vs actual value for a patient (0-8 patients, for a total of 9 patients in this test set).

In [None]:
df_mytest_predictions_diff.T.describe()

Above shows the mean, str, etc over the 5 folds for all patients in the test set.

By using describe() on describe(), we can get aggregated statistics on the overall dataset (so mean over all patients means, etc):

In [None]:
df_mytest_predictions_diff.T.describe().T.describe()

So on average, when I ran this, we were off across all folds and patients by 566 units. The results on this posted version should be little bit different but not too much, due to randomness of the process. When the range of units is roughly about 2000-3000, I'd say thats a pretty bad result.

# Predictions for Kaggle Test Set

This would be the test set Kaggle provices, and the one that would be providing the actual competition results if we submitted this. But not going there with these poor results. Lets play anyway.

Partly my idea was to look at the differences above, and compare with these values to see if I could find some trend to apply. Of course not.

In [None]:
df_test_predictions = pd.DataFrame(preds_test)
df_test_predictions = df_test_predictions[col_names] #to get sorted order

In [None]:
df_test_predictions

In [None]:
df_test_predictions.describe()

In [None]:
descriptions = df_test_predictions.describe()

In [None]:
descriptions.max(axis=1)

In [None]:
descriptions.T.describe()