##### Copyright 2019 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 使用 tf.distribute.Strategy 进行自定义训练

<table class="tfo-notebook-buttons" align="left">
  <td>     <a target="_blank" href="https://tensorflow.google.cn/tutorials/distribute/custom_training"><img src="https://tensorflow.google.cn/images/tf_logo_32px.png">在 TensorFlow.org 上查看</a>   </td>
  <td>     <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/distribute/custom_training.ipynb"><img src="https://tensorflow.google.cn/images/colab_logo_32px.png">在 Google Colab 上运行</a>   </td>
  <td>     <a target="_blank" href="https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/distribute/custom_training.ipynb"><img src="https://tensorflow.google.cn/images/GitHub-Mark-32px.png">在 GitHub 上查看源代码</a>   </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/distribute/custom_training.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png">Download notebook</a>
  </td>
</table>

This tutorial demonstrates how to use `tf.distribute.Strategy`—a TensorFlow API that provides an abstraction for [distributing your training](../../guide/distributed_training.ipynb) across multiple processing units (GPUs, multiple machines, or TPUs)—with custom training loops. In this example, you will train a simple convolutional neural network on the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) containing 70,000 images of size 28 x 28.

[Custom training loops](../customization/custom_training_walkthrough.ipynb) provide flexibility and a greater control on training. They also make it easier to debug the model and the training loop.

In [None]:
# Import TensorFlow
import tensorflow as tf

# Helper libraries
import numpy as np
import os

print(tf.__version__)

## Download the Fashion MNIST dataset

In [None]:
fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Adding a dimension to the array -> new shape == (28, 28, 1)
# We are doing this because the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]

# Getting the images in [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)

## 创建一个分发变量和图形的策略

`tf.distribute.MirroredStrategy` 策略是如何运作的？

- All the variables and the model graph are replicated across the replicas.
- 输入都均匀分布在副本中。
- 每个副本在收到输入后计算输入的损失和梯度。
- 通过求和，每一个副本上的梯度都能同步。
- 同步后，每个副本上的复制的变量都可以同样更新。

Note: You can put all the code below inside a single scope. This example divides it into several code cells for illustration purposes.


In [None]:
# If the list of devices is not specified in the
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy = tf.distribute.MirroredStrategy()

In [None]:
print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))

## 设置输入流水线

In [None]:
BUFFER_SIZE = len(train_images)

BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

EPOCHS = 10

创建数据集并分发它们：

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE) 
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE) 

train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)

## 创建模型

Create a model using `tf.keras.Sequential`. You can also use the [Model Subclassing API](https://www.tensorflow.org/guide/keras/custom_layers_and_models) or the [functional API](https://www.tensorflow.org/guide/keras/functional) to do this.

In [None]:
def create_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
    ])

  return model

In [None]:
# Create a checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

## 定义损失函数

Normally, on a single machine with single GPU/CPU, loss is divided by the number of examples in the batch of input.

*因此，使用 `tf.distribute.Strategy` 时应如何计算损失？*

- 例如，假设有 4 个 GPU，批次大小为 64。一个批次的输入会分布在各个副本（4 个 GPU）上，每个副本获得一个大小为 16 的输入。

- 每个副本上的模型都会使用其各自的输入进行前向传递，并计算损失。现在，不将损失除以其相应输入中的样本数 (BATCH_SIZE_PER_REPLICA = 16)，而应将损失除以 GLOBAL_BATCH_SIZE (64)。

*为什么这样做？*

- 之所以需要这样做，是因为在每个副本上计算完梯度后，会通过对梯度**求和**在副本之间同步梯度。

*如何在 TensorFlow 中执行此操作？*

- 如果您正在编写自定义训练循环（如本教程中所述），则应将每个样本的损失相加，然后将总和除以 GLOBAL_BATCH_SIZE: `scale_loss = tf.reduce_sum(loss) * (1. / GLOBAL_BATCH_SIZE)`，或者您可以使用 `tf.nn.compute_average_loss`，它会将每个样本的损失、可选样本权重和 GLOBAL_BATCH_SIZE 作为参数，并返回经过缩放的损失。

- If you are using regularization losses in your model then you need to scale the loss value by the number of replicas. You can do this by using the `tf.nn.scale_regularization_loss` function.

- 不建议使用 `tf.reduce_mean`。这样做会将损失除以实际的每个副本批次大小，该大小可能会随着步骤的不同而发生变化。

- This reduction and scaling is done automatically in Keras `Model.compile` and `Model.fit`

- If using `tf.keras.losses` classes (as in the example below), the loss reduction needs to be explicitly specified to be one of `NONE` or `SUM`. `AUTO` and `SUM_OVER_BATCH_SIZE`  are disallowed when used with `tf.distribute.Strategy`. `AUTO` is disallowed because the user should explicitly think about what reduction they want to make sure it is correct in the distributed case. `SUM_OVER_BATCH_SIZE` is disallowed because currently it would only divide by per replica batch size, and leave the dividing by number of replicas to the user, which might be easy to miss. So, instead, you need to do the reduction yourself explicitly.

- 如果 `labels` 为多维，则对每个样本中的元素数量的 `per_example_loss` 求平均值。例如，如果 `predictions` 的形状为 `(batch_size, H, W, n_classes)`，而 `labels` 为 `(batch_size, H, W)`，则需要更新 `per_example_loss`，例如：`per_example_loss /= tf.cast(tf.reduce_prod(tf.shape(labels)[1:]), tf.float32)`

    小心：**验证损失的形状**。`tf.losses`/`tf.keras.losses` 中的损失函数通常会返回输入最后一个维度的平均值。损失类封装这些函数。在创建损失类的实例时传递 `reduction=Reduction.NONE`，表示“无**额外**缩减”。对于样本输入形状为 `[batch, W, H, n_classes]` 的类别损失，会缩减 `n_classes` 维度。对于类似 `losses.mean_squared_error` 或 `losses.binary_crossentropy` 的逐点损失，应包含一个虚拟轴，使 `[batch, W, H, 1]` 缩减为 `[batch, W, H]`。如果没有虚拟轴，`则 [batch, W, H]` 将被错误地缩减为 `[batch, W]`。


In [None]:
with strategy.scope():
  # Set reduction to `none` so we can do the reduction afterwards and divide by
  # global batch size.
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True,
      reduction=tf.keras.losses.Reduction.NONE)
  def compute_loss(labels, predictions):
    per_example_loss = loss_object(labels, predictions)
    return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)

## 定义衡量指标以跟踪损失和准确性

这些指标可以跟踪测试的损失，训练和测试的准确性。 您可以使用`.result()`随时获取累积的统计信息。

In [None]:
with strategy.scope():
  test_loss = tf.keras.metrics.Mean(name='test_loss')

  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='train_accuracy')
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='test_accuracy')

## 训练循环

In [None]:
# model, optimizer, and checkpoint must be created under `strategy.scope`.
with strategy.scope():
  model = create_model()

  optimizer = tf.keras.optimizers.Adam()

  checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

In [None]:
def train_step(inputs):
  images, labels = inputs

  with tf.GradientTape() as tape:
    predictions = model(images, training=True)
    loss = compute_loss(labels, predictions)

  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  train_accuracy.update_state(labels, predictions)
  return loss 

def test_step(inputs):
  images, labels = inputs

  predictions = model(images, training=False)
  t_loss = loss_object(labels, predictions)

  test_loss.update_state(t_loss)
  test_accuracy.update_state(labels, predictions)

In [None]:
# `run` replicates the provided computation and runs it
# with the distributed input.
@tf.function
def distributed_train_step(dataset_inputs):
  per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
  return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                         axis=None)

@tf.function
def distributed_test_step(dataset_inputs):
  return strategy.run(test_step, args=(dataset_inputs,))

for epoch in range(EPOCHS):
  # TRAIN LOOP
  total_loss = 0.0
  num_batches = 0
  for x in train_dist_dataset:
    total_loss += distributed_train_step(x)
    num_batches += 1
  train_loss = total_loss / num_batches

  # TEST LOOP
  for x in test_dist_dataset:
    distributed_test_step(x)

  if epoch % 2 == 0:
    checkpoint.save(checkpoint_prefix)

  template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
              "Test Accuracy: {}")
  print (template.format(epoch+1, train_loss,
                         train_accuracy.result()*100, test_loss.result(),
                         test_accuracy.result()*100))

  test_loss.reset_states()
  train_accuracy.reset_states()
  test_accuracy.reset_states()

以上示例中需要注意的事项：

- Iterate over the `train_dist_dataset` and `test_dist_dataset` using  a `for x in ...` construct.
- 缩放损失是`distributed_train_step`的返回值。 这个值会在各个副本使用`tf.distribute.Strategy.reduce`的时候合并，然后通过`tf.distribute.Strategy.reduce`叠加各个返回值来跨批次。
- 在执行`tf.distribute.Strategy.experimental_run_v2`时，`tf.keras.Metrics`应在`train_step`和`test_step`中更新。


## 恢复最新的检查点并进行测试

使用 `tf.distribute.Strategy` 设置了检查点的模型可以使用或不使用策略进行恢复。

In [None]:
eval_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='eval_accuracy')

new_model = create_model()
new_optimizer = tf.keras.optimizers.Adam()

test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)

In [None]:
@tf.function
def eval_step(images, labels):
  predictions = new_model(images, training=False)
  eval_accuracy(labels, predictions)

In [None]:
checkpoint = tf.train.Checkpoint(optimizer=new_optimizer, model=new_model)
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

for images, labels in test_dataset:
  eval_step(images, labels)

print ('Accuracy after restoring the saved model without strategy: {}'.format(
    eval_accuracy.result()*100))

## 迭代一个数据集的替代方法

### 使用迭代器

If you want to iterate over a given number of steps and not through the entire dataset, you can create an iterator using the `iter` call and explicitly call `next` on the iterator. You can choose to iterate over the dataset both inside and outside the `tf.function`. Here is a small snippet demonstrating iteration of the dataset outside the `tf.function` using an iterator.


In [None]:
for _ in range(EPOCHS):
  total_loss = 0.0
  num_batches = 0
  train_iter = iter(train_dist_dataset)

  for _ in range(10):
    total_loss += distributed_train_step(next(train_iter))
    num_batches += 1
  average_train_loss = total_loss / num_batches

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print (template.format(epoch+1, average_train_loss, train_accuracy.result()*100))
  train_accuracy.reset_states()

### 在 tf.function 中迭代

You can also iterate over the entire input `train_dist_dataset` inside a `tf.function` using the `for x in ...` construct or by creating iterators like you did above. The example below demonstrates wrapping one epoch of training with a `@tf.function` decorator and iterating over `train_dist_dataset` inside the function.

In [None]:
@tf.function
def distributed_train_epoch(dataset):
  total_loss = 0.0
  num_batches = 0
  for x in dataset:
    per_replica_losses = strategy.run(train_step, args=(x,))
    total_loss += strategy.reduce(
      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
    num_batches += 1
  return total_loss / tf.cast(num_batches, dtype=tf.float32)

for epoch in range(EPOCHS):
  train_loss = distributed_train_epoch(train_dist_dataset)

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print (template.format(epoch+1, train_loss, train_accuracy.result()*100))

  train_accuracy.reset_states()

### 跟踪副本中的训练的损失

注意：作为通用的规则，您应该使用`tf.keras.Metrics`来跟踪每个样本的值以避免它们在副本中合并。

Because of the loss scaling computation that is carried out, it's not recommended to use `tf.keras.metrics.Mean` to track the training loss across different replicas.

例如，如果您运行具有以下特点的训练作业：

- 两个副本
- 在每个副本上处理两个例子
- 产生的损失值：每个副本为[2,3]和[4,5]
- 全局批次大小 = 4

通过损失缩放，您可以通过添加损失值来计算每个副本上的每个样本的损失值，然后除以全局批量大小。 在这种情况下：`（2 + 3）/ 4 = 1.25`和`（4 + 5）/ 4 = 2.25`。

If you use `tf.keras.metrics.Mean` to track loss across the two replicas, the result is different. In this example, you end up with a `total` of 3.50 and `count` of 2, which results in `total`/`count` = 1.75  when `result()` is called on the metric. Loss calculated with `tf.keras.Metrics` is scaled by an additional factor that is equal to the number of replicas in sync.

### 例子和教程

以下是一些使用自定义训练循环来分发策略的示例：

1. [Distributed training guide](../../guide/distributed_training)
2. [DenseNet](https://github.com/tensorflow/examples/blob/master/tensorflow_examples/models/densenet/distributed_train.py) 使用 `MirroredStrategy`的例子。
3. [BERT](https://github.com/tensorflow/models/blob/master/official/legacy/bert/run_classifier.py) example trained using `MirroredStrategy` and `TPUStrategy`. This example is particularly helpful for understanding how to load from a checkpoint and generate periodic checkpoints during distributed training etc.
4. [NCF](https://github.com/tensorflow/models/blob/master/official/recommendation/ncf_keras_main.py) 使用 `MirroredStrategy` 来启用 `keras_use_ctl` 标记。
5. [NMT](https://github.com/tensorflow/examples/blob/master/tensorflow_examples/models/nmt_with_attention/distributed_train.py) 使用 `MirroredStrategy`来训练的例子。

You can find more examples listed under *Examples and tutorials* in the [Distribution strategy guide](../../guide/distributed_training.ipynb).

## 下一步

- 在您的模型上尝试新的 `tf.distribute.Strategy` API。
- Visit the [Better performance with `tf.function`](../../guide/function.ipynb) and [TensorFlow Profiler](../../guide/profiler.md) guides to learn more about tools to optimize the performance of your TensorFlow models.
- Check out the [Distributed training in TensorFlow](../../guide/distributed_training.ipynb) guide, which provides an overview of the available distribution strategies.