#### Author: Sameer Kesava

## Examining the Batch Normalization method

    * From the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" by
      Sergey Ioffe and Christian Szegedy
      
    * Coding the method using keras API layers and custom layers in tensorflow 2.1.0 to understand the math and comparing against the API
    
    * Using MNIST data set

    * This document uses only the existing Keras BatchNorm layers. v2 will use layers with batch normalization explicitly coded.

    * From analysis, it appears that during training, the mini-batch mean and variance are used in the calculation of batch normalization,
      and not the moving mean and variance.

    * The definition of the arguments 'training' and 'trainable' need to be differentiated, can be confusing if not understood.

In [1]:
import tensorflow as tf

In [3]:
tf.__version__

'2.1.0'

In [341]:
tf.keras.__version__

'2.2.4-tf'

In [2]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [144]:
# garbage collection clear memory
import gc

### Loading MNIST data

In [4]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

In [5]:
print(x_train.shape, x_test.shape)

(60000, 28, 28) (10000, 28, 28)


#### Checking min and max values of the input

In [6]:
np.max(x_train)

255

In [7]:
np.min(x_train)

0

#### Scaling the data using the Standard Scaler

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
stdscaler = StandardScaler(with_mean=True, with_std=True)

In [10]:
stdscaler_fit = stdscaler.fit(x_train.reshape(-1, 28*28))

In [11]:
stdscaler_fit.mean_.shape

(784,)

In [12]:
# average mean
np.sqrt(stdscaler_fit.mean_.mean())

5.772211140440891

In [13]:
stdscaler_fit.n_samples_seen_

60000

In [14]:
stdscaler_fit.scale_.shape

(784,)

In [15]:
# average std
np.sqrt(stdscaler_fit.var_.mean())

66.12879201995706

In [16]:
x_train_scaled =  stdscaler_fit.transform(x_train.reshape(-1,28*28)).reshape(-1,28,28)
x_train_scaled.shape

(60000, 28, 28)

In [17]:
x_test_scaled =  stdscaler_fit.transform(x_test.reshape(-1,28*28)).reshape(-1,28,28)
x_test_scaled.shape

(10000, 28, 28)

#### Checking the min and max values

In [18]:
np.min(x_train_scaled)

-1.2742078920822268

In [19]:
np.max(x_train_scaled)

244.94693302873063

In [20]:
np.var(x_train_scaled)

0.9145408163265558

In [21]:
np.mean(x_train_scaled)

-2.1974863349995617e-18

#### One-hot encoding the y labels

In [22]:
np.unique(y_train, return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8),
 array([5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851, 5949]))

    * The labels are more or less balanced

In [23]:
np.unique(y_test, return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8),
 array([ 980, 1135, 1032, 1010,  982,  892,  958, 1028,  974, 1009]))

In [24]:
y_train_coded =  tf.keras.utils.to_categorical(y_train, num_classes=10)
y_train_coded.shape

(60000, 10)

In [25]:
print("Label: ", y_train[0],'\n',"One-hot encoded: ", y_train_coded[0])

Label:  5 
 One-hot encoded:  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]


In [26]:
y_test_coded =  tf.keras.utils.to_categorical(y_test, num_classes=10)
y_test_coded.shape

(10000, 10)

#### Creating validation data

In [27]:
from sklearn.utils import shuffle

In [28]:
# for reproducibility
random_seed = 100 

In [29]:
x_train_scaled.shape

(60000, 28, 28)

In [30]:
y_train_coded.shape

(60000, 10)

In [31]:
x_train_scaled, y_train_coded = shuffle(x_train_scaled, y_train_coded, random_state = random_seed) 

In [32]:
x_train_scaled.shape

(60000, 28, 28)

In [33]:
y_train_coded.shape

(60000, 10)

In [34]:
x_valid_scaled = x_train_scaled[:5000]
y_valid_coded = y_train_coded[:5000]

In [35]:
x_valid_scaled.shape

(5000, 28, 28)

In [36]:
y_valid_coded.shape

(5000, 10)

In [37]:
x_train_scaled = x_train_scaled[5000:]
y_train_coded = y_train_coded[5000:]

#### Creating a tf dataset for training on a Model with dense layers

In [38]:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train_scaled.reshape(-1, 784), y_train_coded))

In [39]:
valid_dataset = tf.data.Dataset.from_tensor_slices((x_valid_scaled.reshape(-1, 784), y_valid_coded))

In [40]:
test_dataset = tf.data.Dataset.from_tensor_slices((x_test_scaled.reshape(-1,784), y_test_coded))

In [41]:
minibatch =  60 # in the paper

In [42]:
buffer_size = len(y_train)
buffer_size

60000

In [43]:
# shuffle first, batch 2nd, then prefetch
train_dataset = train_dataset.shuffle(buffer_size=buffer_size, seed=random_seed, 
                                      reshuffle_each_iteration=True).batch(batch_size=minibatch, 
                                                                           drop_remainder=True).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

In [44]:
valid_dataset = valid_dataset.shuffle(buffer_size=buffer_size, seed=random_seed, 
                                      reshuffle_each_iteration=False).batch(batch_size=minibatch).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

In [45]:
test_dataset = test_dataset.shuffle(buffer_size=buffer_size, seed=random_seed, 
                                      reshuffle_each_iteration=False).batch(batch_size=minibatch).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

In [46]:
for i in test_dataset.take(1):

    print(i[0][0].shape, '\n')
    print(i[1][0])

(784,) 

tf.Tensor([0. 1. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(10,), dtype=float32)


### Building a Model using Dense Layers and Batch Norm from Keras

    * Using the model architecture from the paper

In [47]:
tf.keras.backend.clear_session()

In [48]:
input_layer = tf.keras.Input(shape = (784,), name = 'input')

In [49]:
units1, units2, units3 = 100,100,100

In [50]:
kerasdense1 = tf.keras.layers.Dense(units = units1, activation=None, 
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(input_layer)

In [51]:
# mean and variance are calculated for the minibatch.
# setting momentum = 0 for now. Population statistics will be calculated separately for inference.
# Initializing moving mean and variance using the statistics from the StandardScaler fit

# """bn_layer = tf.keras.layers.BatchNormalization(axis = [-1], momentum=0.0, epsilon=0.001, center=True, scale=True, 
#                                              beta_initializer = tf.keras.initializers.zeros(), 
#                                               gamma_initializer = tf.keras.initializers.ones(), 
#                                              moving_mean_initializer = tf.keras.initializers.zeros(), 
#                                              moving_variance_initializer = tf.keras.initializers.ones(), trainable=True)"""
# the "bn_layer" object can be used only once. Have to write a new batch_norm for every layer. Hence, will use a function

In [52]:
def bn_layer(axis = [-1]):
    
    """
    returns a batch norm layer
    
    Parameters:
    axis: list of integers. Default is [-1] which is confusing because the mean and averages are
    calculated across the minibatch rather than the features.   
    
    Anyways, will be cross-checking against custom code later
    
    """
    
    
    return tf.keras.layers.BatchNormalization(axis = axis, momentum=0.99, epsilon=0.001, center=True, scale=True, 
                                             beta_initializer = tf.keras.initializers.zeros(), 
                                              gamma_initializer = tf.keras.initializers.ones(), 
                                             moving_mean_initializer = tf.keras.initializers.zeros(), 
                                             moving_variance_initializer = tf.keras.initializers.ones(), trainable=True)

In [53]:
kerasdensebn1 = bn_layer(axis = [-1])(kerasdense1)

In [54]:
activation1 = tf.keras.layers.Activation(activation = tf.nn.tanh)(kerasdensebn1)

In [55]:
kerasdense2 = tf.keras.layers.Dense(units = units2, activation=None, 
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(activation1)
kerasdensebn2 = bn_layer(axis = [-1])(kerasdense2, training=True)

activation2 = tf.keras.layers.Activation(activation = tf.nn.tanh)(kerasdensebn2)

In [56]:
kerasdense3 = tf.keras.layers.Dense(units = units3, activation=None, 
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(activation2)
kerasdensebn3 = bn_layer(axis = [-1])(kerasdense3, training=True)
activation3 = tf.keras.layers.Activation(activation = tf.nn.tanh)(kerasdensebn3)

In [57]:
output_layer = tf.keras.layers.Dense(units = 10, activation=None,
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(activation3)
# no softmax activation

In [58]:
keras_fc_model = tf.keras.Model(inputs = [input_layer], outputs = [output_layer], name = 'kerasmodel1')

In [59]:
keras_fc_model.summary()

Model: "kerasmodel1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, 784)]             0         
_________________________________________________________________
dense (Dense)                (None, 100)               78500     
_________________________________________________________________
batch_normalization (BatchNo (None, 100)               400       
_________________________________________________________________
activation (Activation)      (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
batch_normalization_1 (Batch (None, 100)               400       
_________________________________________________________________
activation_1 (Activation)    (None, 100)               

In [346]:
# saving model to csv
model_config = keras_fc_model.get_config()
with open('Issues/test.csv', 'w') as f:
    for key in model_config.keys():
        f.write("%s,%s\n"%(key,model_config[key]))

In [348]:
import json

In [349]:
# saving as json config
model_config_json = keras_fc_model.to_json()
with open('Issues/model.json', 'w') as f:
    json.dump(model_config_json, f)

In [60]:
keras_fc_model.losses

[]

In [61]:
len(keras_fc_model.weights)

20

In [62]:
# batch norm layer variables
keras_fc_model.layers[2].weights

[<tf.Variable 'batch_normalization/gamma:0' shape=(100,) dtype=float32, numpy=
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       dtype=float32)>,
 <tf.Variable 'batch_normalization/beta:0' shape=(100,) dtype=float32, numpy=
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

#### Compiling

In [63]:
keras_fc_model.compile(optimizer =  tf.keras.optimizers.Adam(learning_rate=0.001), 
                      loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True), 
                      metrics = [tf.keras.metrics.CategoricalAccuracy()])

In [64]:
# USing tf.keras.metrics.AUC() give:

# """InvalidArgumentError:  assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] 
# [x (kerasmodel1/dense_3/BiasAdd:0) = ] [[0.7673769 2.2196033 1.78565049...]...] [y (metrics/auc/Cast_1/x:0) = ] [0]
#	 [[{{node metrics/auc/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]] [Op:__inference_distributed_function_103831]"""

# To use this metric, the predictions have to be between 0 and 1, implies softmax needs to be used.
                      
   

In [65]:
keras_fc_model.name

'kerasmodel1'

#### Callbacks

##### Learning rate
    * Adam optimizer can do adaptive learning rate feature

##### Not setting EarlyStopping

##### Save best model

In [66]:
import os

In [67]:
keras_savedmodels = 'keras_savedmodels'

if os.path.exists(keras_savedmodels):
    pass
else:
    os.mkdir(keras_savedmodels)

In [68]:
cb_savemodel = tf.keras.callbacks.ModelCheckpoint(os.path.join(keras_savedmodels, 'model_{epoch}-{val_loss:.3f}.h5'), mode = 'min',
                                                  monitor = 'val_loss',
                                                 verbose = 1, save_best_only=True)

##### Tensorboard

In [69]:
keras_models_logs = 'keras_models_logs'

if os.path.exists(keras_models_logs):
    pass
else:
    os.mkdir(keras_models_logs)

In [70]:
cb_tboard = tf.keras.callbacks.TensorBoard(log_dir=keras_models_logs,histogram_freq=1, write_graph=True,
                                          write_images=True)

In [71]:
cblist = [cb_savemodel, cb_tboard]

#### Fitting

In [72]:
epochs1=10

In [73]:
keras_history1 = keras_fc_model.fit(train_dataset, epochs=epochs1, verbose = 1, callbacks=cblist,
                                   validation_data=valid_dataset, shuffle = True)

Train for 916 steps, validate for 84 steps
Epoch 1/10
Epoch 00001: val_loss improved from inf to 0.21771, saving model to keras_savedmodels/model_1-0.218.h5
Epoch 2/10
Epoch 00002: val_loss improved from 0.21771 to 0.16106, saving model to keras_savedmodels/model_2-0.161.h5
Epoch 3/10
Epoch 00003: val_loss improved from 0.16106 to 0.15292, saving model to keras_savedmodels/model_3-0.153.h5
Epoch 4/10
Epoch 00004: val_loss improved from 0.15292 to 0.13310, saving model to keras_savedmodels/model_4-0.133.h5
Epoch 5/10
Epoch 00005: val_loss improved from 0.13310 to 0.13114, saving model to keras_savedmodels/model_5-0.131.h5
Epoch 6/10
Epoch 00006: val_loss improved from 0.13114 to 0.12137, saving model to keras_savedmodels/model_6-0.121.h5
Epoch 7/10
Epoch 00007: val_loss did not improve from 0.12137
Epoch 8/10
Epoch 00008: val_loss did not improve from 0.12137
Epoch 9/10
Epoch 00009: val_loss did not improve from 0.12137
Epoch 10/10
Epoch 00010: val_loss improved from 0.12137 to 0.11878,

#### Test data

In [80]:
keras_fc_model.evaluate(test_dataset)



[0.11143299131751797, 0.9711]

### Testing

In [337]:
# Taking the first batch in the test dataset
for i in test_dataset.take(1):
    test_x = i[0]
    test_y = i[1]
    print(tf.argmax(i[1][-1]))

tf.Tensor(7, shape=(), dtype=int64)


In [104]:
test_x.shape

TensorShape([60, 784])

In [106]:
test_prediction = keras_fc_model.predict(test_x, batch_size=60)

In [107]:
test_prediction.shape

(60, 10)

In [108]:
test_labels = tf.argmax(test_prediction, axis=1)

In [109]:
test_labels

<tf.Tensor: shape=(60,), dtype=int64, numpy=
array([1, 9, 7, 6, 1, 9, 5, 9, 1, 6, 2, 2, 4, 4, 7, 1, 4, 5, 8, 6, 4, 3,
       3, 5, 6, 1, 2, 5, 9, 5, 4, 8, 2, 5, 3, 3, 4, 5, 0, 1, 2, 4, 2, 1,
       2, 8, 3, 2, 1, 1, 8, 3, 8, 4, 8, 3, 6, 5, 4, 7])>

In [120]:
# call output for input of batch size = 1
keras_fc_model(tf.reshape(test_x[-1], shape=(1,-1)), training = False)

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[-1.6399686 , -1.8865588 ,  0.6251957 ,  2.1267636 , -1.6664417 ,
        -0.23559336, -1.7577692 , -0.3431458 ,  2.1472526 ,  0.9643423 ]],
      dtype=float32)>

In [123]:
# different output using bs=60
test_prediction[-1]

array([-0.47502184, -2.1236982 ,  0.6776048 ,  0.5870802 , -4.3017907 ,
        1.2738266 , -7.350608  , 16.93022   , -6.412077  ,  0.8857768 ],
      dtype=float32)

In [121]:
# output label
tf.argmax(keras_fc_model(tf.reshape(test_x[-1], shape=(1,-1)), training = False), axis=1)

<tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>

In [122]:
# setting training = True does not change the output
keras_fc_model(tf.reshape(test_x[-1], shape=(1,-1)), training = True)

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[-1.6399686 , -1.8865588 ,  0.6251957 ,  2.1267636 , -1.6664417 ,
        -0.23559336, -1.7577692 , -0.3431458 ,  2.1472526 ,  0.9643423 ]],
      dtype=float32)>

    * No change in model output between training = True and False

In [124]:
keras_fc_model.state_updates

[]

In [74]:
keras_fc_model.layers

[<tensorflow.python.keras.engine.input_layer.InputLayer at 0x7f33f44998d0>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f340c27a8d0>,
 <tensorflow.python.keras.layers.normalization_v2.BatchNormalization at 0x7f340c228dd8>,
 <tensorflow.python.keras.layers.core.Activation at 0x7f340c228f60>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f340c1faeb8>,
 <tensorflow.python.keras.layers.normalization_v2.BatchNormalization at 0x7f340c1fe048>,
 <tensorflow.python.keras.layers.core.Activation at 0x7f340c1fe2b0>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f340c1a4d68>,
 <tensorflow.python.keras.layers.normalization_v2.BatchNormalization at 0x7f340c1ac198>,
 <tensorflow.python.keras.layers.core.Activation at 0x7f340c1ac438>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f340c1fab70>]

### Exploring the weights of the 1st dense and the following bnorm layer to see what's happening with the variables

In [258]:
del keras_fc_model
tf.keras.backend.clear_session()
gc.collect()

# loading the best model
keras_fc_model = tf.keras.models.load_model(
    os.path.join(keras_savedmodels,'model_10-0.119.h5'), compile=False)

In [222]:
keras_fc_model.trainable

True

In [223]:
for layer in keras_fc_model.layers:
    print(layer.trainable)

True
True
True
True
True
True
True
True
True
True
True


    * All layers are trainable

In [224]:
keras_fc_model.summary()

Model: "kerasmodel1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, 784)]             0         
_________________________________________________________________
dense (Dense)                (None, 100)               78500     
_________________________________________________________________
batch_normalization (BatchNo (None, 100)               400       
_________________________________________________________________
activation (Activation)      (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
batch_normalization_1 (Batch (None, 100)               400       
_________________________________________________________________
activation_1 (Activation)    (None, 100)               

In [259]:
# Copying the weights of the 1st Batch Norm layer before calling or prediction
bn1wghts_before = [i.numpy() for i in keras_fc_model.layers[2].weights]

    * Doing "layer.weights.copy()" does not work as still only a pointer between the tensors of the weights and the dummy variable is created.
      Using the copy module also does not work. The right way is to used the .numpy() method or convert the tensor to a list using the .to_list()

In [260]:
type(bn1wghts_before)

list

In [261]:
bn1wghts_before

[array([1.2269483, 1.1349252, 1.3410976, 0.9179288, 1.2856462, 1.3862823,
        1.3595926, 1.2745571, 1.5197902, 1.3166919, 1.4408834, 1.2638034,
        1.2629784, 1.2740654, 1.415783 , 1.4326637, 1.4493729, 1.3970267,
        1.3195894, 1.3734947, 1.3420596, 1.1721978, 1.2968619, 1.5249897,
        1.0370381, 1.3912894, 1.5404277, 1.3771281, 1.4850557, 1.3972007,
        1.4163164, 1.1762176, 1.1790816, 1.3884237, 1.3342997, 1.303082 ,
        1.1480755, 1.2126306, 1.2919899, 1.1597459, 1.485534 , 1.2900969,
        1.221908 , 1.2401632, 1.4433057, 1.4404896, 1.508135 , 1.3095189,
        1.4880893, 1.3983922, 1.2017698, 1.2355157, 1.4851977, 1.5640281,
        1.5173646, 1.3068354, 1.62453  , 1.5176066, 1.4924428, 1.2629938,
        1.327193 , 1.4941337, 1.6335568, 1.4370104, 1.5315871, 1.3623064,
        1.284233 , 1.1915743, 1.4893486, 1.3606577, 1.1597953, 1.3838465,
        1.2619832, 1.2759168, 1.5620822, 1.2270515, 1.5284364, 1.2064832,
        1.3470151, 1.4076749, 1.360958

In [262]:
# running a prediction bs=1
keras_fc_model.predict(test_x[-1].numpy().reshape(1,-1), batch_size=1)

array([[-1.6399686 , -1.8865588 ,  0.6251957 ,  2.1267636 , -1.6664417 ,
        -0.23559336, -1.7577692 , -0.3431458 ,  2.1472526 ,  0.9643423 ]],
      dtype=float32)

In [235]:
keras_fc_model.reset_states()# does nothing as there are no "states"

In [263]:
tf.equal(bn1wghts_before, keras_fc_model.layers[2].weights)

<tf.Tensor: shape=(4, 100), dtype=bool, numpy=
array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,

In [265]:
# Checking if the weights of the bn1 layer changed  
tf.reduce_all(tf.equal(bn1wghts_before, keras_fc_model.layers[2].weights), axis=1)

<tf.Tensor: shape=(4,), dtype=bool, numpy=array([ True,  True, False, False])>

    * So the moving mean and variance have changed.

#### Changing the layer properties to False

In [267]:
keras_fc_model.trainable = False

In [268]:
for layer in keras_fc_model.layers:
    print(layer.trainable)

False
False
False
False
False
False
False
False
False
False
False


In [269]:
# again, saving the weights of the 1st bn layer before calling/prediction
bn1wghts_before = [i.numpy() for i in keras_fc_model.layers[2].weights]

In [270]:
# moving variance weights
bn1wghts_before[-1]

array([20.866217, 16.674694, 20.625671, 23.798456, 39.635723, 27.115152,
       21.22787 , 34.49331 , 33.87815 , 16.109518, 23.215616, 17.046965,
       30.134575, 20.38069 , 27.3178  , 20.839792, 22.857933, 19.267096,
       39.58637 , 22.35907 , 26.873194, 23.787848, 18.168957, 22.845085,
       15.745033, 21.828392, 25.764683, 20.84526 , 22.719908, 24.619804,
       26.2426  , 20.258219, 15.888993, 17.197428, 36.061516, 25.197315,
       25.331924, 22.368452, 16.280714, 25.014906, 24.380272, 22.902052,
       15.686818, 15.817322, 26.417524, 18.015802, 24.712282, 28.918398,
       33.78847 , 23.302309, 23.56081 , 20.972233, 28.08964 , 26.214094,
       16.348656, 19.581034, 27.686527, 19.609587, 23.421644, 25.146906,
       28.153814, 24.018486, 26.390099, 18.504633, 19.971884, 14.986987,
       14.20498 , 17.22383 , 25.607748, 19.73726 , 23.27182 , 16.208055,
       20.27844 , 21.736525, 67.823204, 18.464857, 31.59728 , 16.72421 ,
       25.090193, 25.58346 , 22.373672, 24.414333, 

In [271]:
# prediction on input with bs=1
keras_fc_model.predict(test_x[-2].numpy().reshape(1,-1))

array([[-1.6399693 , -1.886558  ,  0.62519604,  2.1267643 , -1.6664422 ,
        -0.23559408, -1.757769  , -0.34314552,  2.1472523 ,  0.9643422 ]],
      dtype=float32)

In [272]:
# 1st bn layer moving variance weights after prediction
keras_fc_model.layers[2].weights[-1]

<tf.Variable 'batch_normalization/moving_variance:0' shape=(100,) dtype=float32, numpy=
array([20.866217, 16.674694, 20.625671, 23.798456, 39.635723, 27.115152,
       21.22787 , 34.49331 , 33.87815 , 16.109518, 23.215616, 17.046965,
       30.134575, 20.38069 , 27.3178  , 20.839792, 22.857933, 19.267096,
       39.58637 , 22.35907 , 26.873194, 23.787848, 18.168957, 22.845085,
       15.745033, 21.828392, 25.764683, 20.84526 , 22.719908, 24.619804,
       26.2426  , 20.258219, 15.888993, 17.197428, 36.061516, 25.197315,
       25.331924, 22.368452, 16.280714, 25.014906, 24.380272, 22.902052,
       15.686818, 15.817322, 26.417524, 18.015802, 24.712282, 28.918398,
       33.78847 , 23.302309, 23.56081 , 20.972233, 28.08964 , 26.214094,
       16.348656, 19.581034, 27.686527, 19.609587, 23.421644, 25.146906,
       28.153814, 24.018486, 26.390099, 18.504633, 19.971884, 14.986987,
       14.20498 , 17.22383 , 25.607748, 19.73726 , 23.27182 , 16.208055,
       20.27844 , 21.736525, 67.8232

In [273]:
tf.equal(bn1wghts_before[-1],keras_fc_model.layers[2].weights[-1] )

<tf.Tensor: shape=(100,), dtype=bool, numpy=
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])>

    * With setting trainable=False, the moving mean and variance have not changed.

In [275]:
# Loading the best model again
del keras_fc_model
tf.keras.backend.clear_session()
gc.collect()

keras_fc_model = tf.keras.models.load_model(
    os.path.join(keras_savedmodels,'model_10-0.119.h5'), compile=False)

In [276]:
keras_fc_model.trainable = False

In [296]:
for layer in keras_fc_model.layers:
    print(layer.trainable)

False
False
False
False
False
False
False
False
False
False
False


In [277]:
# copying the weights of the 1st dense and bn layers
dns1wght_before = [i.numpy() for i in keras_fc_model.layers[1].weights]
bn1wghts_before = [i.numpy() for i in keras_fc_model.layers[2].weights]

In [416]:
dns1wght_before[0].shape

(784, 100)

In [417]:
dns1wght_before[1].shape

(100,)

In [278]:
# prediction with bs=60
tf.argmax(keras_fc_model.predict(test_x), axis=1)

<tf.Tensor: shape=(60,), dtype=int64, numpy=
array([1, 9, 7, 6, 1, 9, 5, 9, 1, 6, 2, 2, 4, 4, 7, 1, 4, 5, 8, 6, 4, 3,
       3, 5, 6, 1, 2, 5, 9, 5, 4, 8, 2, 5, 3, 3, 4, 5, 0, 1, 2, 4, 2, 1,
       2, 8, 3, 2, 1, 1, 8, 3, 8, 4, 8, 3, 6, 5, 4, 7])>

In [274]:
test_labels # from prediction immediately after fitting (above)

<tf.Tensor: shape=(60,), dtype=int64, numpy=
array([1, 9, 7, 6, 1, 9, 5, 9, 1, 6, 2, 2, 4, 4, 7, 1, 4, 5, 8, 6, 4, 3,
       3, 5, 6, 1, 2, 5, 9, 5, 4, 8, 2, 5, 3, 3, 4, 5, 0, 1, 2, 4, 2, 1,
       2, 8, 3, 2, 1, 1, 8, 3, 8, 4, 8, 3, 6, 5, 4, 7])>

In [339]:
tf.argmax(test_y, axis=1) # true labels.

<tf.Tensor: shape=(60,), dtype=int64, numpy=
array([1, 9, 7, 6, 1, 9, 5, 9, 1, 6, 2, 2, 4, 4, 7, 1, 4, 5, 8, 6, 4, 3,
       3, 5, 6, 1, 2, 5, 9, 5, 9, 8, 2, 5, 3, 3, 4, 5, 0, 1, 2, 4, 2, 1,
       2, 8, 3, 2, 1, 1, 8, 3, 8, 4, 8, 3, 6, 5, 4, 7])>

In [285]:
# copying the weights of the 1st dense and bn layers after prediction
dns1wght_afterpred = [i.numpy() for i in keras_fc_model.layers[1].weights]
bn1wghts_afterpred = [i.numpy() for i in keras_fc_model.layers[2].weights]

In [289]:
# comparing the weights of the dense layer before and after prediction. No change.
tf.reduce_all(tf.equal(dns1wght_before[0], dns1wght_afterpred[0]))

<tf.Tensor: shape=(), dtype=bool, numpy=True>

In [290]:
# comparing the weights of the bn layer before and after prediction. No change, because trainable was set to False
tf.reduce_all(tf.equal(bn1wghts_before, bn1wghts_afterpred), axis=1)

<tf.Tensor: shape=(4,), dtype=bool, numpy=array([ True,  True,  True,  True])>

In [340]:
# Predicting on each sample individually to compare against running on a batch
test_out = []
for i in range(test_x.shape[0]):
    # print(i.numpy().mean())
    test_out.append(tf.argmax(keras_fc_model.predict(test_x[i:i+1]), axis=1))

test_out

[<tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>,
 <tf.Tensor: shape=(1,), dtype=

    * The output does not make sense. Each different sample has resulted in the same output. This suggests that despite setting trainable=False,
      the batch norm layer still use batch mean and batch variance and not the moving mean and moving variance to calculate the output.

    * This implies that using a batch size of greater than 1 is important for prediction. Even in this case, the model uses batch mean and batch
      variance rather than the original population mean and variance. If the bn layer trainable is set to True, the moving mean and var, i.e.the
      population stats, are updated, else not.

In [318]:
# Setting the layers as trainable
keras_fc_model.trainable = True
for layer in keras_fc_model.layers:
    print(layer.trainable)

True
True
True
True
True
True
True
True
True
True
True


In [319]:
# on a batch of size = 60
tf.argmax(keras_fc_model(test_x), axis=1)

<tf.Tensor: shape=(60,), dtype=int64, numpy=
array([1, 9, 7, 6, 1, 9, 5, 9, 1, 6, 2, 2, 4, 4, 7, 1, 4, 5, 8, 6, 4, 3,
       3, 5, 6, 1, 2, 5, 9, 5, 4, 8, 2, 5, 3, 3, 4, 5, 0, 1, 2, 4, 2, 1,
       2, 8, 3, 2, 1, 1, 8, 3, 8, 4, 8, 3, 6, 5, 4, 7])>

In [329]:
# on a batch of size = 5, predictions remain similar
tf.argmax(keras_fc_model(test_x[55:]), axis=1)

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([3, 6, 5, 4, 7])>

In [330]:
# on a batch of size = 3, predictions remain similar
tf.argmax(keras_fc_model(test_x[57:]), axis=1)

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([5, 4, 7])>

In [331]:
# on a batch of size = 2, prediction accuracy is 50%
tf.argmax(keras_fc_model(test_x[58:]), axis=1)

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([4, 3])>

In [332]:
tf.argmax(keras_fc_model(test_x[59:]), axis=1)

<tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>

In [304]:
# on a batch of size = 1, prediction is as before, spitting out '8' for all inputs
tf.argmax(keras_fc_model(tf.expand_dims(test_x[-1], axis=0)), axis=1)

<tf.Tensor: shape=(1,), dtype=int64, numpy=array([8])>

In [320]:
# Checking if the weights of the 1st bn layer have changed
tf.reduce_all(tf.equal(bn1wghts_before, keras_fc_model.layers[2].weights), axis=1)

<tf.Tensor: shape=(4,), dtype=bool, numpy=array([ True,  True, False, False])>

### So what is happening when using only one sample?

In [350]:
test_sample = test_x[-1]
test_sample.shape

TensorShape([784])

In [351]:
# reshaping into bs x input = 1 x 784
test_sample = tf.expand_dims(test_sample, axis=0)
test_sample.shape

TensorShape([1, 784])

In [352]:
x = keras_fc_model.layers[1](test_sample)

In [353]:
x

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 7.2454110e-02,  1.6121851e-01, -5.9279698e-01, -3.5625021e+00,
         5.6486158e+00,  1.8131368e+00,  2.2250998e+00, -9.2394991e+00,
         6.8580999e+00, -3.2521808e+00, -3.8725848e+00, -4.8101954e+00,
         7.5291615e+00, -1.2260950e+01, -2.0900023e+00,  1.2195963e+00,
         6.7551619e-01,  8.5417891e+00, -4.6665339e+00,  4.9510393e+00,
        -3.3225839e+00,  5.9273677e+00,  1.6514000e+00, -3.2149615e+00,
        -6.2358837e+00, -1.3371571e+01,  3.7234601e-01, -6.7863178e+00,
         1.5272391e+00, -7.8243756e+00,  5.1648498e+00,  7.5447268e+00,
         1.5534180e+00,  2.4607325e+00, -7.1062841e+00,  2.2235289e+00,
         5.1989765e+00, -2.8489923e+00,  3.7170084e+00, -4.5380349e+00,
         3.2168820e+00,  2.8787351e-01,  1.9058392e+00,  4.1891546e+00,
         5.2643147e+00,  2.4832835e+00,  5.3992143e+00, -9.1668167e+00,
        -5.2137561e+00,  1.1860637e+01,  4.8817563e+00, -7.7666149e+00,
        -5.140

In [354]:
bn1wghts_before = [i.numpy() for i in keras_fc_model.layers[2].weights]

In [355]:
x_bn = keras_fc_model.layers[2](x, training = False)
x_bn

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 0.15085383,  0.31743085, -0.16629188, -0.65695924,  0.97057295,
         0.3213714 ,  0.5426107 , -2.1190796 ,  1.9093887 , -1.4385841 ,
        -1.6424987 , -1.710846  ,  1.8375623 , -4.998596  , -1.2291566 ,
         0.42653888,  0.29996243,  3.8062656 , -0.8616546 ,  2.1698096 ,
        -2.015799  ,  1.3508562 ,  1.2605891 , -1.0878319 , -2.2595797 ,
        -5.1802454 ,  0.01518461, -2.780314  ,  0.22559097, -2.7845726 ,
         1.7277735 ,  2.6498253 ,  0.5636972 ,  1.3715785 , -2.3905523 ,
         0.8962747 ,  1.680484  , -0.9752423 ,  1.486685  , -2.0054598 ,
         0.94583297,  0.6109275 ,  0.54913896,  2.0541725 ,  1.6639935 ,
         0.944753  ,  2.1851091 , -2.839921  , -1.5892298 ,  4.1117544 ,
         1.4021511 , -2.4831624 , -1.4514474 ,  1.4872153 ,  3.000867  ,
        -3.6977327 , -3.2329707 ,  2.2383838 ,  2.717575  , -0.5416679 ,
        -0.55600256, -0.22708406,  1.176469  , -1.3580976 , -2.952281  ,
 

In [356]:
bn1wghts_aftercall = [i.numpy() for i in keras_fc_model.layers[2].weights]

In [358]:
tf.reduce_all(tf.equal(bn1wghts_aftercall, bn1wghts_before), axis=1)

<tf.Tensor: shape=(4,), dtype=bool, numpy=array([ True,  True,  True,  True])>

    * No change in the moving mean/var courtesy of setting training=False (I presume)

In [359]:
x_activ = keras_fc_model.layers[3](x_bn)
x_activ.shape

TensorShape([1, 100])

In [360]:
x_activ

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 0.14971982,  0.307182  , -0.16477582, -0.57633626,  0.74895597,
         0.31074643,  0.49496162, -0.9715424 ,  0.957034  , -0.8934123 ,
        -0.92782104, -0.9367512 ,  0.95056057, -0.999909  , -0.84233457,
         0.40242475,  0.2912782 ,  0.9990122 , -0.6971092 ,  0.97425276,
        -0.96512705,  0.8742553 ,  0.85122645, -0.7960855 , -0.9784387 ,
        -0.9999367 ,  0.01518344, -0.9923367 ,  0.22184041, -0.9924014 ,
         0.9387922 ,  0.990063  ,  0.5107154 ,  0.87905145, -0.98336613,
         0.71447915,  0.9329244 , -0.750999  ,  0.90271294, -0.96441114,
         0.7378908 ,  0.54477966,  0.4998746 ,  0.9676615 ,  0.93075305,
         0.7373984 ,  0.975019  , -0.993195  , -0.92003113,  0.99946356,
         0.8858157 , -0.9861591 , -0.8959787 ,  0.902811  ,  0.9950633 ,
        -0.9987726 , -0.99689376,  0.97751546,  0.9913167 , -0.49424943,
        -0.5050055 , -0.22325954,  0.826335  , -0.87595093, -0.9945609 ,
 

In [363]:
keras_fc_model.layers[3].activation

<function tensorflow.python.keras.activations.tanh(x)>

In [364]:
x2 = keras_fc_model.layers[4](x_activ)
x2.shape

TensorShape([1, 100])

In [365]:
x2_bn = keras_fc_model.layers[5](x2, training = False)
x2_bn.shape

TensorShape([1, 100])

In [366]:
x2_bn

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 1.4469844 ,  0.81933117,  0.08188009,  5.5216475 ,  1.6466796 ,
         3.1818552 ,  6.1577783 , -6.268153  ,  0.13917148,  2.8146408 ,
         1.6124916 ,  2.125453  ,  1.1283283 , -0.6871972 ,  0.27393034,
         1.8312399 , -0.2513807 ,  0.63431036,  2.1236696 ,  2.640796  ,
        -1.4139752 , -1.0399275 , -0.04609195,  0.09193619, -1.9930353 ,
         0.9907995 , -0.5948348 ,  3.446775  ,  3.2425377 ,  0.26977125,
        -1.5345777 ,  2.3626957 , -0.34507048,  0.826348  , -3.725133  ,
         2.1427503 ,  3.036732  , -2.370662  , -1.422682  ,  1.0785854 ,
        -0.8406631 , -1.0621421 , -1.2442622 , -3.2096956 , -0.37212515,
         0.9509739 ,  0.20337313,  2.1173682 ,  1.1640146 , -3.1060717 ,
         2.06516   , -1.8573284 , -4.5774794 ,  2.2338474 ,  1.2312582 ,
        -4.235283  , -1.2487199 , -3.3013825 ,  2.0299222 , -0.6114552 ,
         1.9935496 ,  0.41533   , -4.1426725 ,  1.5959333 , -1.4516155 ,
 

In [367]:
x2_activ = keras_fc_model.layers[6](x2_bn)
x2_activ

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 0.895095  ,  0.6747057 ,  0.08169758,  0.99996805,  0.92840064,
         0.99656   ,  0.9999912 , -0.99999285,  0.13827984,  0.99284345,
         0.9235274 ,  0.9718979 ,  0.8104464 , -0.5961785 ,  0.2672782 ,
         0.94994724, -0.2462161 ,  0.56101316,  0.97179884,  0.9898829 ,
        -0.8883353 , -0.77785945, -0.04605933,  0.09167803, -0.9635322 ,
         0.7577031 , -0.53336394,  0.99797344,  0.9969525 ,  0.26341194,
        -0.9112047 ,  0.9824213 , -0.33199653,  0.67851025, -0.99883795,
         0.9728405 ,  0.9954043 , -0.9826969 , -0.89015716,  0.79267395,
        -0.68616015, -0.7864824 , -0.8466669 , -0.996746  , -0.35584915,
         0.7402237 ,  0.20061485,  0.97144616,  0.82234395, -0.9959981 ,
         0.9683534 , -0.9524313 , -0.99978864,  0.97731274,  0.8429439 ,
        -0.9995809 , -0.84792435, -0.99729043,  0.9660816 , -0.5451507 ,
         0.9635689 ,  0.39298886, -0.99949574,  0.9210541 , -0.89601177,
 

In [368]:
keras_fc_model.layers[6].activation

<function tensorflow.python.keras.activations.tanh(x)>

In [369]:
x3 = keras_fc_model.layers[7](x2_activ)

In [370]:
x3_bn = keras_fc_model.layers[8](x3, training = False)
x3_bn

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[-2.6575303 ,  4.0253205 , -4.5723295 , -0.8328315 , -2.1877398 ,
         5.7970123 , -2.1550064 ,  1.343811  ,  4.4645624 , -3.1028416 ,
        -0.01754965, -0.06544566, -2.495442  , -0.9473952 , -3.1625395 ,
        -4.8681808 ,  5.214006  ,  3.11476   , -2.796949  ,  4.2898436 ,
        -2.6451263 , -5.971562  ,  2.9875793 ,  2.7249525 , -2.207279  ,
         3.0586452 , -3.7173393 , -5.3977556 , -2.6516745 , -3.4322805 ,
        -1.1940911 , -0.8222678 ,  3.7170627 , -6.393489  , -1.188656  ,
         2.353908  , -3.9905925 , -3.1419573 ,  2.200204  ,  4.343115  ,
         3.1492703 , -5.1185412 , -1.9381793 , -0.46158677,  2.03314   ,
         0.10080343, -0.5024499 ,  1.4315917 ,  2.958827  ,  4.1813436 ,
         5.5145955 , -3.4113038 ,  0.01222157, -3.2777324 , -0.01305201,
         0.6775641 ,  1.1302983 ,  2.2492542 , -3.693972  , -2.5032449 ,
         4.2869644 ,  2.4097655 ,  0.623179  ,  2.629727  , -0.61606425,
 

In [371]:
x3_activ = keras_fc_model.layers[9](x3_bn)
x3_activ

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[-0.99021417,  0.99936235, -0.9997865 , -0.68199354, -0.97514844,
         0.9999816 , -0.97348946,  0.8725846 ,  0.99973506, -0.9959723 ,
        -0.01754784, -0.06535237, -0.9864926 , -0.73860157, -0.99642485,
        -0.9998818 ,  0.9999408 ,  0.99606705, -0.9925866 ,  0.9996244 ,
        -0.98996955, -0.9999869 ,  0.99493057,  0.9914434 , -0.9760894 ,
         0.9956008 , -0.9988198 , -0.99995905, -0.99009955, -0.9979139 ,
        -0.83184344, -0.6763024 ,  0.9988191 , -0.99999446, -0.8301616 ,
         0.9821125 , -0.9993166 , -0.99627477,  0.97575283,  0.9996622 ,
         0.99632883, -0.9999283 , -0.9593894 , -0.43137658,  0.9662956 ,
         0.10046338, -0.46404174,  0.8919924 ,  0.9946314 ,  0.9995333 ,
         0.9999677 , -0.99782467,  0.01222096, -0.9971594 , -0.01305127,
         0.5899336 ,  0.81112134,  0.97799367, -0.99876344, -0.9867003 ,
         0.99962217,  0.98398817,  0.5533376 ,  0.9896575 , -0.54838175,
 

In [372]:
keras_fc_model.layers[9].activation

<function tensorflow.python.keras.activations.tanh(x)>

In [374]:
x_out = keras_fc_model.layers[10](x3_activ)
x_out

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[-1.0399033 ,  0.8363577 ,  0.57764983, -1.0306426 , -3.062043  ,
         1.4964693 , -5.39317   , 17.652716  , -9.008466  , -0.6458832 ]],
      dtype=float32)>

In [377]:
x_label = tf.argmax(x_out, axis=1)
x_label

<tf.Tensor: shape=(1,), dtype=int64, numpy=array([7])>

In [387]:
tf.squeeze(x_label)

<tf.Tensor: shape=(), dtype=int64, numpy=7>

    * Agrees with the prediction using batch size > 2. 

    * Testing on other inputs for confirmation

In [396]:
def pred_fn(x_input, model, training = False):
    """
    Parameters:
    x_input: tensor or np.array of shape: (784,)

    model: keras model for prediction
    """
    model_layers = model.layers

    x_input = tf.expand_dims(x_input, axis=0) 
    # reshaping to batchsize x input: 1 * 784
    x = model_layers[1](x_input) # dense1
    x = model_layers[2](x, training=training) # bn1
    x = model_layers[3](x) # activation1

    x = model_layers[4](x) # dense2
    x = model_layers[5](x, training=training) # bn2
    x = model_layers[6](x) # activation2

    x = model_layers[7](x) # dense3
    x = model_layers[8](x, training=training) # bn3
    x = model_layers[9](x) # activation3

    x_out = model_layers[10](x)
    label = tf.argmax(x_out, axis=1) # output of shape (1,) - a vector, 
                                     # need a scalar of shape: ()
    
    return tf.squeeze(label).numpy()

In [380]:
test_labels

<tf.Tensor: shape=(60,), dtype=int64, numpy=
array([1, 9, 7, 6, 1, 9, 5, 9, 1, 6, 2, 2, 4, 4, 7, 1, 4, 5, 8, 6, 4, 3,
       3, 5, 6, 1, 2, 5, 9, 5, 4, 8, 2, 5, 3, 3, 4, 5, 0, 1, 2, 4, 2, 1,
       2, 8, 3, 2, 1, 1, 8, 3, 8, 4, 8, 3, 6, 5, 4, 7])>

In [392]:
pred_labels = []
for i in test_x:
    pred_labels.append(pred_fn(i, keras_fc_model, training=False))
pred_labels[:5]

[1, 9, 7, 6, 1]

In [393]:
tf.equal(test_labels, pred_labels)

<tf.Tensor: shape=(60,), dtype=bool, numpy=
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True])>

In [394]:
# Not all are true but atleast majority are
tf.reduce_all(tf.equal(test_labels, pred_labels))

<tf.Tensor: shape=(), dtype=bool, numpy=False>

    * Running again and setting training = True

In [397]:
pred_labels = []
for i in test_x:
    pred_labels.append(pred_fn(i, keras_fc_model, training=True))
pred_labels[:5]

[8, 8, 8, 8, 8]

In [398]:
keras_fc_model(tf.expand_dims(test_x[-1], axis=0), training = False)

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[-1.6399686 , -1.8865588 ,  0.6251957 ,  2.1267636 , -1.6664417 ,
        -0.23559336, -1.7577692 , -0.3431458 ,  2.1472526 ,  0.9643423 ]],
      dtype=float32)>

    * This has happened because in the original model definition, the training arg of the batch norm layers has been set to true.
      Hence, in the pred_fn(), when the training argument was to False, the prediction was correct, and when set to true,
      the outputs were as before, i.e. 8. 

    * This also implies that during training, minibatch statistics are used, i.e. minibatch mean/var, rather than moving mean and
      variance. Why is this the case? Why not moving mean/var?

### Redefining the model correctly

In [399]:
input_layer = tf.keras.Input(shape = (784,), name = 'input')
kerasdense1 = tf.keras.layers.Dense(units = units1, activation=None, 
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(input_layer)
kerasdensebn1 = bn_layer(axis = [-1])(kerasdense1)
activation1 = tf.keras.layers.Activation(activation = tf.nn.tanh)(kerasdensebn1)

kerasdense2 = tf.keras.layers.Dense(units = units2, activation=None, 
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(activation1)
kerasdensebn2 = bn_layer(axis = [-1])(kerasdense2)
activation2 = tf.keras.layers.Activation(activation = tf.nn.tanh)(kerasdensebn2)

kerasdense3 = tf.keras.layers.Dense(units = units3, activation=None, 
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(activation2)
kerasdensebn3 = bn_layer(axis = [-1])(kerasdense3)
activation3 = tf.keras.layers.Activation(activation = tf.nn.tanh)(kerasdensebn3)

output_layer = tf.keras.layers.Dense(units = 10, activation=None,
                                    kernel_initializer=tf.keras.initializers.Orthogonal(gain=1,seed=random_seed))(activation3)

In [400]:
keras_fc_model2 = tf.keras.Model(inputs = [input_layer], outputs = [output_layer], name = 'kerasmodel2')

In [402]:
modelconfig = keras_fc_model2.get_config()

In [403]:
del keras_fc_model2
tf.keras.backend.clear_session()
gc.collect()

286785

In [407]:
# initializing a new model without any weights
keras_fc_model2 = tf.keras.Model.from_config(modelconfig)

In [409]:
# loading the weights of the best model
keras_fc_model2.load_weights(os.path.join(keras_savedmodels, 'model_10-0.119.h5'))

In [410]:
keras_fc_model2.trainable

True

In [411]:
# Setting all layers' trainable to False
keras_fc_model2.trainable = False
for layer in keras_fc_model2.layers:
    print(layer.trainable)

False
False
False
False
False
False
False
False
False
False
False


#### Testing on one sample

In [413]:
last_elem = keras_fc_model2(tf.expand_dims(test_x[-1], axis=0))
last_elem

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[-1.0326976 , -1.3163404 ,  1.1134243 ,  1.1836618 , -4.406678  ,
         0.57405937, -7.777385  , 16.786484  , -6.2431746 ,  1.109576  ]],
      dtype=float32)>

In [414]:
tf.argmax(last_elem, axis=1)

<tf.Tensor: shape=(1,), dtype=int64, numpy=array([7])>

    * Matches with the main prediction. The mistake was setting the training argument to True when defining the batch
      norm layers

## Conclusion

    * The lesson learnt is that the training argument of layers such as BatchNorm and Dropout must not be set to True in the original definition
      of the model. It is automatically set to the appropriate value by Keras during fitting and inference.
    
    * The difference between training and trainable needs to be understood.

    * The Keras BatchNorm layer uses mini-batch stats rather than moving/running stats during training. During inference, it uses running stats and
      updates the same if trainable is set to True.