<a href="https://colab.research.google.com/github/zayedmohamed/Crepe/blob/master/train_keras_tpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train Keras model x20 times faster with TPU for free

An overview of the workflow,

- Build a Keras model for training in functional API with static input batch_size.
- Convert Keras model to TPU model.
- Train the TPU model with static batch_size * 8 and save the weights to file.
- Build a Keras model for inference with the same structure but variable batch input size.
- Load the model weights.
- Predict with the inferencing model.

In [0]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.python.keras.layers import Input, LSTM, Bidirectional, Dense, Embedding

In [2]:
# Number of words to consider as features
max_features = 10000
# Cut texts after this number of words (among top max_features most common words)
maxlen = 500

# Load data
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Reverse sequences
x_train = [x[::-1] for x in x_train]
x_test = [x[::-1] for x in x_test]

# Pad sequences
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


## Static input Batch size
Input pipelines running on CPU and GPU are mostly free from the static shape requirement, while in the XLA/TPU environment, static shapes and batch size is imposed.

The Cloud TPU contains 8 TPU cores, which operate as independent processing units. The TPU is not fully utilized unless all eight cores are used. To fully speed up the training with vectorization, we can choose a larger batch size compared to training the same model on a single GPU. A total batch size of 1024 (128 per core) is generally a good starting point.

In case you are going to train a larger model where the batch size is too large, try slowly reduce the batch size until it fits in TPU memory, just making sure that the total batch size is a multiple of 64 (the per-core batch size should be a multiple of 8).

It is also worth to mention when training with larger batch size; it is generally safe to increase the learning rate of the optimizer to allow even faster convergence. You can find a reference in this paper - "[Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://https://arxiv.org/pdf/1706.02677.pdf)".

In Keras, to define a static batch size, we use its functional API and then specify the `batch_size` parameter for the Input layer. Notice that the model builds in a function which takes a `batch_size` parameter so we can come back later to make another model for inferencing runs on CPU or GPU which takes variable batch size inputs.

In [0]:
def make_model(batch_size=None):
  source = Input(shape=(maxlen,), batch_size=batch_size, dtype=tf.int32, name='Input')
  embedding = Embedding(input_dim=max_features, output_dim=128, name='Embedding')(source)
  # lstm = Bidirectional(LSTM(32, name = 'LSTM'), name='Bidirectional')(embedding)
  lstm = LSTM(32, name = 'LSTM')(embedding)
  predicted_var = Dense(1, activation='sigmoid', name='Output')(lstm)
  model = tf.keras.Model(inputs=[source], outputs=[predicted_var])
  model.compile(
      optimizer=tf.train.RMSPropOptimizer(learning_rate=0.01),
      loss='binary_crossentropy',
      metrics=['acc'])
  return model

Also, use `tf.train.Optimizer` instead of a standard Keras optimizer since Keras optimizer support is still experimental for TPU.

In [5]:
tf.keras.backend.clear_session()
training_model = make_model(batch_size = 128)
training_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input (InputLayer)           (128, 500)                0         
_________________________________________________________________
Embedding (Embedding)        (128, 500, 128)           1280000   
_________________________________________________________________
LSTM (LSTM)                  (128, 32)                 20608     
_________________________________________________________________
Output (Dense)               (128, 1)                  33        
Total params: 1,300,641
Trainable params: 1,300,641
Non-trainable params: 0
_________________________________________________________________


# Convert Keras model to TPU model
The `tf.contrib.tpu.keras_to_tpu_model` function converts a tf.keras model to an equivalent TPU version.

In [6]:
import os
# This address identifies the TPU we'll use when configuring TensorFlow.
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
tf.logging.set_verbosity(tf.logging.INFO)

tpu_model = tf.contrib.tpu.keras_to_tpu_model(
    training_model,
    strategy=tf.contrib.tpu.TPUDistributionStrategy(
        tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))

tpu_model.summary()

INFO:tensorflow:Querying Tensorflow master (b'grpc://10.115.195.114:8470') for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 14291680080413639861)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 8114354152376501122)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 14622084438864888236)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 16675989505216646498)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 7336732595318902999)
INFO:tensorflow:*** Available Device: _Dev

We then use the standard Keras methods to train, save the weights and evaluate the model. Notice that the `batch_size` is set to eight times of the model input `batch_size` since the input samples are evenly distributed to run on 8 TPU cores. 

In [7]:
import time
start_time = time.time()

history = tpu_model.fit(x_train, y_train,
                    epochs=20,
                    batch_size=128 * 8,
                    validation_split=0.2)
tpu_model.save_weights('./tpu_model.h5', overwrite=True)

print("--- %s seconds ---" % (time.time() - start_time))

Train on 25000 samples, validate on 5000 samples
Epoch 1/20
INFO:tensorflow:New input shapes; (re-)compiling: mode=train (# of cores 8), [TensorSpec(shape=(128,), dtype=tf.int32, name='core_id0'), TensorSpec(shape=(128, 500), dtype=tf.int32, name='Input_10'), TensorSpec(shape=(128, 1), dtype=tf.float32, name='Output_target_30')]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for Input
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 2.755854845046997 secs
INFO:tensorflow:Setting weights on TPU model.
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for Input
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 3.7938554286956787 secs
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for Input
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 4.910218954086304 secs
INFO:tensorfl

I set up an experiment to compare the training speed between a single GTX1070 running locally on my Windows PC and TPU on Colab, here is the result.

Both GPU and TPU takes the input batch size of 128,

GPU: **179 seconds per epoch.** 20 epochs reach 76.9% validation accuracy, total  3600 seconds.

TPU: **5 seconds per epoch** except for the very first epoch which takes 36 seconds. 20 epochs reach 97.39% validation accuracy, total 104 seconds.

The validation accuracy for TPU after 20 epochs are higher than GPU may be caused by training 8 batches of the mini-batch size of 128 samples at a time. 

# Inferencing on CPU
Once we have the model weights, we can load it as usual and make predictions on another device like CPU or GPU. We also want the inferencing model to accept flexible input batch size, that can be done with the previous make_model() function.

You can see the inferencing model now takes variable input samples,

In [8]:
inferencing_model = make_model(batch_size=None)
inferencing_model.load_weights('./tpu_model.h5')
inferencing_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input (InputLayer)           (None, 500)               0         
_________________________________________________________________
Embedding (Embedding)        (None, 500, 128)          1280000   
_________________________________________________________________
LSTM (LSTM)                  (None, 32)                20608     
_________________________________________________________________
Output (Dense)               (None, 1)                 33        
Total params: 1,300,641
Trainable params: 1,300,641
Non-trainable params: 0
_________________________________________________________________


Then you can use the standard `fit(), evaluate() ` functions with the inferencing model.

In [9]:
inferencing_model.evaluate(x_test, y_test)



[0.5920307851076126, 0.8178]

In [10]:
tpu_model.evaluate(x_test, y_test, batch_size=128 * 8)

INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for Input
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 10.642410278320312 secs


[0.5920634329032898, 0.8177599998283386]

In [11]:
tpu_model.evaluate(x_test, y_test, batch_size=128 * 8)



[0.5920634329032898, 0.8177599998283386]

In [12]:
inferencing_model.predict(x_test[:10])> 0.5

array([[False],
       [ True],
       [False],
       [ True],
       [ True],
       [ True],
       [ True],
       [False],
       [ True],
       [ True]])

In [13]:
y_test[:10]

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1])

In [14]:
tpu_model.predict_on_batch(x_train[:128 * 8])>0.5

INFO:tensorflow:New input shapes; (re-)compiling: mode=infer (# of cores 8), [TensorSpec(shape=(128, 500), dtype=tf.int32, name='Input_10')]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for Input
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 12.780016660690308 secs


array([[ True],
       [False],
       [False],
       ...,
       [False],
       [ True],
       [False]])

# Download the trained model weights to your local file system


In [0]:
from google.colab import files

files.download('./tpu_model.h5')