##### Copyright 2018 The TensorFlow Authors.



In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Distributed Training in TensorFlow

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/distribute/distribution_strategy_keras"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/distribute/distribution_strategy_keras.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/distribute/distribution_strategy_keras.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview

The `tf.distribute.Strategy` API is an easy way to distribute your training
across multiple devices/machines. Our goal is to allow users to use existing
models and training code with minimal changes to enable distributed training.

Currently, core TensorFlow includes `tf.distribute.MirroredStrategy`. This
does in-graph replication with synchronous training on many GPUs on one machine.
Essentially, it create copies of all variables in the model's layers on each
device. It then use all-reduce to combine gradients across the devices before
applying them to the variables to keep them in sync.

Many other strategies will soon be
available in core TensorFlow. You can find more information about them in the
[README](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).



## Example with Keras API

The easiest way to get started with multiple GPUs on one machine using `MirroredStrategy` is with `tf.keras`.

In [0]:
from __future__ import absolute_import, division, print_function

In [0]:
# Import TensorFlow
!pip install tf-nightly-gpu-2.0-preview
import tensorflow_datasets as tfds
import tensorflow as tf

import os

## Download the dataset

 Download the MNIST dataset to train our model on. Use [TensorFlow Datasets](https://www.tensorflow.org/datasets) to load the dataset. This returns a dataset in `tf.data` format.

In [0]:
datasets, ds_info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']

`with_info=True` returns the metadata for the entire dataset.
In this example, `ds_info.splits.total_num_examples = 70000`.


In [0]:
num_train_examples = ds_info.splits['train'].num_examples
num_test_examples = ds_info.splits['test'].num_examples

BUFFER_SIZE = num_train_examples
BATCH_SIZE = 64

## Input data pipeline

In [0]:
def scale(image, label):
  image = tf.cast(image, tf.float32)
  image /= 255
  return image, label

In [0]:
train_dataset = mnist_train.map(scale).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

## Define Distribution Strategy

To distribute a Keras model on multiple GPUs using `MirroredStrategy`, we first instantiate a `MirroredStrategy` object.

In [0]:
strategy = tf.distribute.MirroredStrategy()

In [0]:
print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))

## Create the model

Create and compile the Keras model in the `strategy.scope`.

In [0]:
with strategy.scope():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10, activation='softmax')
  ])
  # TODO(yashkatariya): Add accuracy when b/122371345 is fixed.
  model.compile(loss='sparse_categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam())
                #metrics=['accuracy'])

## Define the callbacks.



The callbacks used here are:

*   *Tensorboard*: This callback writes a log for Tensorboard which allows you to visualize the graphs.
*   *Model Checkpoint*: This callback saves the model after every epoch.
*   *Learning Rate Scheduler*: Using this callback, you can schedule the learning rate to change after every epoch/batch.



In [0]:
# Define the checkpoint directory to store the checkpoints

checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

In [0]:
# Function for decaying the learning rate.
# You can use a complicated decay equation too.
def decay(epoch):
  if epoch < 3:
    return 1e-3
  elif epoch >= 3 and epoch < 7:
    return 1e-4
  else:
    return 1e-5

In [0]:
# Callback for printing the LR at the end of each epoch.
class PrintLR(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    print ('\nLearning rate for epoch {} is {}'.format(epoch + 1, 
                                                       model.optimizer.lr.numpy()))

In [0]:
callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix, 
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
]

## Train and evaluate

To train the model call Keras `fit` API using the input dataset that was
created earlier, same as how it would be called in a non-distributed case.

In [0]:
model.fit(train_dataset, epochs=10, callbacks=callbacks)

As you can see below, the checkpoints are getting saved.

In [0]:
# check the checkpoint directory
!ls {checkpoint_dir}

Let's load the latest checkpoint and see how the model performs on the test dataset.

Call `evaluate` as before using appropriate datasets.

In [0]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

eval_loss = model.evaluate(eval_dataset)
print ('Eval loss: {}'.format(eval_loss))

You can download the tensorboard logs and then use the following command to see the output.

```
tensorboard --logdir=path/to/log-directory
```

In [0]:
!ls -sh ./logs

## What's next?

Read the [distribution strategy guide](../../../guide/distribute_strategy.ipynb).

Try the [distribution strategy with training loops](training_loops.ipynb) tutorial to use `tf.distribute.Strategy` with custom training loops.



`tf.distribute.Strategy` is actively under development and we will be adding more examples and tutorials in the near future. Please give it a try, we welcome your feedback via [issues on GitHub](https://github.com/tensorflow/tensorflow/issues/new).