# KERAS - TRAINING WITH FLOAT16 - Test Kernel 1

## Introduction

Due to the big size of the images and the required detail for a good model to work in the Human Protein Atlas Image Classification Challenge, one of the possibilities of reducing the amount of memory needed is training with `float16` precision.    

In Keras, this should be done simply by setting `K.set_floatx('float16')`, however a few other things must be done to avoid `nan` values and to use the BatchNormalization layer using Tensorflow backend, which requires `float32` in all cases. (This Kernel was not tested for other backends and they may work differently)   

Also, after fixing the normalization layer, it will be necessary to fix the optimizer for conflicting types. 

### Warning:

Although I believed this would bring faster training or allow bigger batches, it doesn't seem that's the case. I can't explain why.    
If you find a bug or solve the issue, please let us know :)   

It may be a matter of tunning things properly, but my training results with this weren't great either. Kernels with `float32` were able to train to reasonable results, while kernels with `float16` reached limits very early. 

## Differences between Test Kernel 1 and Test Kernel 2

In this kernel, we use the original batch normalization from Keras in `float32`. We just make sure Keras won't automatically create `float16` weights.   
In [Test Kernel 2](https://www.kaggle.com/danmoller/keras-training-with-float16-test-kernel-2), we change the batch normalization layer to use `float16`. (This skips using Tensorflow's fused batch normalization and uses a regular batch normalization) 
Test Kernel 1 resulted in faster training and less memory consumption.    

## Setting to float16 and avoiding NaNs

In order to do this, simply run these commands before anything else.    
The purpose of setting `epsilon` to a bigger value is because the default value is too little for `float16` and will cause `nan` loss values during training. 

In [None]:
import keras.backend as K
K.set_floatx('float16')
K.set_epsilon(1e-4) #default is 1e-7

isTestMode = False #use this for training short epochs to quickly see the results
epochs = 2 
batchSize = 168

## Fixing the BatchNormalization layer

Because of Tensorflow's requirement of using `float32` in batch normalization, the setting above will break some things because Keras will send `float16` values to Tensorflow.

Thus, we will create custom weight initializers and a custom BatchNormalization layer:

In [None]:
from keras.layers import BatchNormalization
from keras.initializers import Initializer

#custom initializers to force float32
class Ones32(Initializer):
    def __call__(self, shape, dtype=None):
        return K.constant(1, shape=shape, dtype='float32')

class Zeros32(Initializer):
    def __call__(self, shape, dtype=None):
        return K.constant(0, shape=shape, dtype='float32')
    


class BatchNormalizationF16(BatchNormalization):

    #class creator with same params as a regular batch normalization
    #uses the float32 initializers as default
    def __init__(self,
                 beta_initializer=Zeros32(), 
                 gamma_initializer=Ones32(),
                 moving_mean_initializer=Zeros32(), 
                 moving_variance_initializer=Ones32(),
                 **kwargs):
        
        super(BatchNormalizationF16, self).__init__(
                            beta_initializer = beta_initializer, 
                            gamma_initializer = gamma_initializer,
                            moving_mean_initializer = moving_mean_initializer, 
                            moving_variance_initializer = moving_variance_initializer, 
                            **kwargs)
        

    #method that creates and initializes the weights - forcing float32
    def build(self, input_shape):
        dim = input_shape[self.axis]
        if dim is None:
            raise ValueError('Axis ' + str(self.axis) + ' of '
                             'input tensor should have a defined dimension '
                             'but the layer received an input with shape ' +
                             str(input_shape) + '.')
        self.input_spec = InputSpec(ndim=len(input_shape),
                                    axes={self.axis: dim})
        shape = (dim,)

        #forcing float32 here
        if self.scale:
            self.gamma = self.add_weight(shape=shape,
                                         name='gamma',
                                         dtype='float32',
                                         initializer=self.gamma_initializer,
                                         regularizer=self.gamma_regularizer,
                                         constraint=self.gamma_constraint)
        else:
            self.gamma = None
        if self.center:
            #forcing float32 here
            self.beta = self.add_weight(shape=shape,
                                        name='beta',
                                        dtype='float32',
                                        initializer=self.beta_initializer,
                                        regularizer=self.beta_regularizer,
                                        constraint=self.beta_constraint)
        else:
            self.beta = None
        
        #forcing float32 here
        self.moving_mean = self.add_weight(
            shape=shape,
            name='moving_mean',
            dtype='float32',
            initializer=self.moving_mean_initializer,
            trainable=False)
        
        #forcing float32 here
        self.moving_variance = self.add_weight(
            shape=shape,
            name='moving_variance',
            dtype='float32',
            initializer=self.moving_variance_initializer,
            trainable=False)
        self.built = True

    #here we need to cast to and back from float32
    def call(self, inputs, training=None):
        inputs = K.cast(inputs, 'float32')
        result = super(BatchNormalizationF16, self).call(inputs, training)
        return K.cast(result,K.floatx())



## Fixing the optimizers

Now, because we're using a model that has weights of different types, the optimizers will need to be updated, so weights and gradients that are `float16` be operated with `float16` learning rates, momentums, etc.; and weights and gradients that are `float32` be operated with `float32` values.

Here we will be fixing the `SGD` optimizer. Others should follow similar patterns.

In [None]:
from keras.optimizers import SGD

#Comments added to parts of the code changed from original
class SGDMultiType(SGD):
    
    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]

        lr = self.lr
        if self.initial_decay > 0:
            lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                      K.dtype(self.decay))))
            
        #Adjusting learning rate for matching each weight type
        learning_rates = [K.cast(lr, K.dtype(p)) for p in params]
            
        # momentum
        shapes = [K.int_shape(p) for p in params]
        
        #adding custom types to moments
        moments = [K.zeros(shape, dtype=K.dtype(p)) for p,shape in zip(params,shapes)]
        self.weights = [self.iterations] + moments
        
        #adjusting "self.momentum" value to weight types
        momentums = [K.cast(self.momentum,K.dtype(p)) for p in params]
        
        #using the typed learning rate and momentums
        for p, g, m, lr, momentum in zip(params, grads, moments, learning_rates, momentums):
            v = momentum * m - lr * g  # velocity
            self.updates.append(K.update(m, v))

            if self.nesterov:
                new_p = p + momentum * v - lr * g
            else:
                new_p = p + v

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            self.updates.append(K.update(p, new_p))
        return self.updates



# Training

Now we're testing our changes in a simple model (this model is not really well thought for this competition, just an example). 

Lets create the model after a few definitions for loading data and organizing it in images with 4 channels (RGBY). Don't forget to convert your data to `float16`.

In [None]:
competitionFolder = '../input/' #human-protein-atlas-image-classification/'
trainFolder = competitionFolder + 'train/'
testFolder = competitionFolder + 'test/'
nClasses = 28
side=512
originalSide = 512
cropSide = 256


In [None]:
import numpy as np
from PIL import Image
import random

%matplotlib inline
import matplotlib.pyplot as plt

def loadClasses():
    trainFile = competitionFolder + 'train.csv'
    filesAndClasses = list()
    
    with open(trainFile, 'r') as f:
        _ = next(f)
        for row in f:
            fields = row.split(',')
            file = fields[0]
            
            classesNp = np.zeros((nClasses,), dtype='float16')
            classes = fields[1].split(' ')
            for c in classes: classesNp[int(c)] = 1
                
            filesAndClasses.append((trainFolder + file, classesNp))
    return filesAndClasses

def loadImage(file):
    colors = ['_red.png', '_green.png', '_blue.png', '_yellow.png']
    images = [Image.open(file + color) for color in colors]
    return np.stack(images, axis=-1)

#flips a batch of images, flipMode is an integer in range(8)
def flip(x, flipMode):
    if flipMode in [4,5,6,7]:
        x = np.swapaxes(x,1,2)
    if flipMode in [1,3,5,7]:
        x = np.flip(x,1)
    if flipMode in [2,3,6,7]:
        x = np.flip(x,2)
        
    return x

def inspect(x, name):
    print(name + ": ", 'shape:', x.shape, 'min:', x.min(), 'max:',x.max())
    
def plotChannels(img, minVal = 0, maxVal = 255):
    fig, ax = plt.subplots(1, img.shape[-1], figsize=(20,10))
    for i in range(img.shape[-1]):
        ax[i].imshow(img[:,:,i], vmin = minVal, vmax= maxVal)
        
    plt.show()
    
def competitionMetric(true,pred):
    pred = K.cast(K.greater(pred,0.5), K.floatx())
    
    groundPositives = K.sum(true, axis=0) + K.epsilon()
    correctPositives = K.sum(true * pred, axis=0)
    predictedPositives = K.sum(pred, axis=0) + K.epsilon()

    precision = correctPositives / predictedPositives
    recall = correctPositives / groundPositives

    m = (2 * precision * recall) / (precision + recall + K.epsilon())

    return K.mean(m)

## Data generator with cropping and flipping

In order to train even faster, we're creating a data generator that crops from the 512x512 images.   
It was not studied if this crop may hide areas that contain the target proteins, but it's very probable that the protein be present in this crop if it's big enough.

Here we will be training with crops of size 256x256



In [None]:
from keras.utils import Sequence
from random import shuffle

#works with channels last
class ImageLoader(Sequence):
    
    #class creator, use generationMode = 'predict' for returning only images without labels
        #when using 'predict', pass only a list of files, not files and classes
    def __init__(self, filesAndClasses, batchSize, generationMode = 'train'):
        
        self.filesAndClasses = filesAndClasses
        self.batchSize = batchSize
        self.generationMode = generationMode
        
        assert generationMode in ['train', 'predict']
            

    #gets the number of batches this generator returns
    def __len__(self):
        l,rem = divmod(len(self.filesAndClasses), self.batchSize)
        return (l + (1 if rem > 0 else 0))
    
    #shuffles data on epoch end
    def on_epoch_end(self):
        if self.generationMode == 'train':
            shuffle(self.filesAndClasses)
        
    #gets a batch with index = i
    def __getitem__(self, i):
        
        #x are images   
        #y are labels
        
        pairs = self.filesAndClasses[i*self.batchSize:(i+1)*self.batchSize]
        if self.generationMode == 'train':
            files, classes = zip(*pairs) 
            y = np.stack(classes, axis=0)
            x = [loadImage(f) for f in files]
        elif self.generationMode == 'predict':
            files = pairs
            x = [loadImage(f) for f in files]
        else:
            raise Exception("ImageLoader does not support 'generationMode' of type " + self.generationMode)
    
        x = np.stack(x, axis=0)
        
        #cropping and flipping when training
        if self.generationMode == 'train':
            
            startH = random.randint(0,side - cropSide)
            startW = random.randint(0,side - cropSide)
            
            x = x[:, startW:startW + cropSide, startH:startH + cropSide]
            
            flipMode = random.randint(0,7) #see flip functoin defined above
            x = flip(x, flipMode)

        if self.generationMode == 'predict':
            return x
        else:
            return x, y
        


## Loading data and creating generator

Let's load the data an make a quick inspection of the generator. 

In [None]:
valBatchSize = batchSize//4

trainFiles = loadClasses()

#creating a fold for validation
fold = 0
testLen = len(trainFiles)//5
testStart = fold * testLen
testEnd = testStart + testLen

valFiles = trainFiles[testStart:testEnd]
trainFiles = trainFiles[0:testStart] + trainFiles[testEnd:]

if isTestMode:
    valFiles = valFiles[:5*valBatchSize]
    trainFiles = trainFiles[:5*batchSize]

#creating train and val generators
trainGenerator = ImageLoader(trainFiles, batchSize)
valGenerator = ImageLoader(valFiles, valBatchSize)

#quick check
trainFileList, trainLabels = zip(*trainFiles)
predictGenerator = ImageLoader(trainFileList, batchSize, generationMode='predict')
for i in range(5):
    originalX = predictGenerator[i]
    x,y = trainGenerator[i]
    inspect(x,'images')
    inspect(y,'labels')
    print('unique y: ', np.unique(y))
    print('original')
    plotChannels(originalX[0])
    print('cropped')
    plotChannels(x[0])

## Model Creator

Here, a simple model using the custom layer and optimizer, similar to a ResNet. 



In [None]:
from keras.layers import *
from keras.models import Model

def modelCreator(convFilters, denseFilters):
                
    ######################################## definitions for layers ###########################3
    
    def denseBN(inputs, filters, activation, name):
        out = Dense(filters, name = name, use_bias=False)(inputs)
        out = BatchNormalizationF16(name = name + "BN")(out)
        out = Activation(activation, name = name + "ACT")(out)
        
        return out
    
    def convBN(inputs, filters, kernelSize, activation, name):
        out = Conv2D(filters, kernelSize, name=name, padding='same', use_bias=False)(inputs)
        out = BatchNormalizationF16(name = name + "BN")(out)
        out = Activation(activation, name = name + 'ACT')(out)
            
        return out
    
    ##################################### block definitions ####################################

    def downBlock(i, filters, inputs):
        
        name = str(i)
        out = inputs
        
        #make maxpooling and resnet connection if not first block
        if (i != 0):
            
            out = MaxPooling2D(poolSizes[i], name='Down' + name)(out)
            connection = convBN(out, filters, 3, activation = 'linear', name = 'resConnDownB' + name)


        out = convBN(out,filters, 3, activation='relu', name = 'downConvA' + name)
        out = convBN(out, filters, 3, activation='relu', name = 'downConvB' + name )
        out = convBN(out, filters, 3, activation='relu', name = 'downConvC' + name )
        
        #resnet connection
        if i != 0:
            out = convBN(out, filters, 3, activation = 'linear', name = 'resConnDownA' + name)
            out = Add(name='resAddDown' + name)([out,connection])
            out = BatchNormalizationF16(name = 'resNormDown' + name)(out)
            out = Activation('relu', name = 'downAct' + name)(out)
        
        return out
    
    ####################################### model creation #############################################
    
        
    poolSizes = [0,4,4,4,4]
    
    #notice we are training with 256x256 and validating with 512x512, thus None as size
    inp = Input((None,None,4))
    out = BatchNormalizationF16(name='initNorm')(inp)

    for i,filts in enumerate(convFilters):
        out = downBlock(i,filts,out)
    
    out = GlobalMaxPooling2D(name='globalPool')(out)
    
    for i, filts in enumerate(denseFilters):
        out = denseBN(out, filts, 'relu', name = 'dense' + str(i))
    
    out = denseBN(out, nClasses, activation='sigmoid', name="FinalDense")

    model = Model(inp,out)
                       
    return model



## Creating, compiling and fitting

Let's do it. 

**Warning:** the loss function selected may not be the best for this competition. 

In [None]:
model = modelCreator(convFilters =  [20,40,90,130,200],
                     denseFilters = [100,50,30])

#confirm dtype is float16
print("type is: ", K.dtype(model.get_layer('downConvA0').kernel))

model.compile(optimizer = SGDMultiType(lr=0.01,momentum=.9), loss = 'categorical_crossentropy', metrics=[competitionMetric])

model.fit_generator(trainGenerator,len(trainGenerator), 
                    validation_data = valGenerator, validation_steps = len(valGenerator),
                   epochs = epochs, workers=5, max_queue_size=10)

# Saving and Loading

In order to save and load the model using custom layers and optimizers, one needs to create a custom objects dictionary to tell Keras how to recreate these objects.

So we should include our layer, initializers, custom metric and optimizer in this object.

In [None]:
from keras.models import load_model

customObjects = {
    'BatchNormalizationF16': BatchNormalizationF16,
    'SGDMultiType': SGDMultiType,
    'competitionMetric': competitionMetric,
    'Ones32': Ones32,
    'Zeros32': Zeros32
}

model.save('savedModel')
loadedModel = load_model('savedModel', customObjects)

#training starts from where it ended, including otimizer state
loadedModel.fit_generator(trainGenerator,len(trainGenerator), 
                    validation_data = valGenerator, validation_steps = len(valGenerator),
                   epochs = epochs, workers=5, max_queue_size=10)

# Comparing with float32

Let's try to train the same generator with same batch size on a model with precision `float32`.   and see the GPU return an "out of memory (OOM)" error.

Curiously, this does not bring an OOM error (even though this batch size was the maximum I could use in the `float16` tests above)   

In [None]:
dtype='float32'
K.set_floatx(dtype)


model = modelCreator(convFilters =  [20,40,90,130,200],
                     denseFilters = [100,50,30])

#confirm dtype is float32
print("type is: ", K.dtype(model.get_layer("downConvA0").kernel))

#use a regular SGD
model.compile(optimizer = SGD(lr=0.01,momentum=.9), loss = 'categorical_crossentropy', metrics=[competitionMetric])

model.fit_generator(trainGenerator,len(trainGenerator), 
                    validation_data = valGenerator, validation_steps = len(valGenerator),
                   epochs = epochs, workers=5, max_queue_size=10)