Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 21 additions & 20 deletions site/en/tutorials/distribute/multi_worker_with_keras.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@
"id": "fLW6D2TzvC-4"
},
"source": [
"Next, create an `mnist.py` file with a simple model and dataset setup. This Python file will be used by the worker-processes in this tutorial:"
"Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker processes in this tutorial:"
]
},
{
Expand All @@ -205,7 +205,7 @@
},
"outputs": [],
"source": [
"%%writefile mnist.py\n",
"%%writefile mnist_setup.py\n",
"\n",
"import os\n",
"import tensorflow as tf\n",
Expand Down Expand Up @@ -256,11 +256,11 @@
},
"outputs": [],
"source": [
"import mnist\n",
"import mnist_setup\n",
"\n",
"batch_size = 64\n",
"single_worker_dataset = mnist.mnist_dataset(batch_size)\n",
"single_worker_model = mnist.build_and_compile_cnn_model()\n",
"single_worker_dataset = mnist_setup.mnist_dataset(batch_size)\n",
"single_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
"single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70)"
]
},
Expand Down Expand Up @@ -439,7 +439,7 @@
"\n",
"This tutorial demonstrates how to perform synchronous multi-worker training using an instance of `tf.distribute.MultiWorkerMirroredStrategy`.\n",
"\n",
"`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The [`tf.distribute.Strategy` guide](../../guide/distributed_training.ipynb) has more details about this strategy."
"`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The `tf.distribute.Strategy` [guide](../../guide/distributed_training.ipynb) has more details about this strategy."
]
},
{
Expand All @@ -459,7 +459,7 @@
"id": "N0iv7SyyAohc"
},
"source": [
"Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy()` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training."
"Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training."
]
},
{
Expand All @@ -468,7 +468,7 @@
"id": "FMy2VM4Akzpr"
},
"source": [
"`MultiWorkerMirroredStrategy` provides multiple implementations via the [`CommunicationOptions`](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/CommunicationOptions) parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster."
"`MultiWorkerMirroredStrategy` provides multiple implementations via the `tf.distribute.experimental.CommunicationOptions` parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster."
]
},
{
Expand All @@ -492,7 +492,7 @@
"source": [
"with strategy.scope():\n",
" # Model building/compiling need to be within `strategy.scope()`.\n",
" multi_worker_model = mnist.build_and_compile_cnn_model()"
" multi_worker_model = mnist_setup.build_and_compile_cnn_model()"
]
},
{
Expand All @@ -512,7 +512,7 @@
"source": [
"To actually run with `MultiWorkerMirroredStrategy` you'll need to run worker processes and pass a `TF_CONFIG` to them.\n",
"\n",
"Like the `mnist.py` file written earlier, here is the `main.py` that each of the workers will run:"
"Like the `mnist_setup.py` file written earlier, here is the `main.py` that each of the workers will run:"
]
},
{
Expand All @@ -529,7 +529,7 @@
"import json\n",
"\n",
"import tensorflow as tf\n",
"import mnist\n",
"import mnist_setup\n",
"\n",
"per_worker_batch_size = 64\n",
"tf_config = json.loads(os.environ['TF_CONFIG'])\n",
Expand All @@ -538,11 +538,11 @@
"strategy = tf.distribute.MultiWorkerMirroredStrategy()\n",
"\n",
"global_batch_size = per_worker_batch_size * num_workers\n",
"multi_worker_dataset = mnist.mnist_dataset(global_batch_size)\n",
"multi_worker_dataset = mnist_setup.mnist_dataset(global_batch_size)\n",
"\n",
"with strategy.scope():\n",
" # Model building/compiling need to be within `strategy.scope()`.\n",
" multi_worker_model = mnist.build_and_compile_cnn_model()\n",
" multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
"\n",
"\n",
"multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)"
Expand Down Expand Up @@ -820,7 +820,7 @@
"options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF\n",
"\n",
"global_batch_size = 64\n",
"multi_worker_dataset = mnist.mnist_dataset(batch_size=64)\n",
"multi_worker_dataset = mnist_setup.mnist_dataset(batch_size=64)\n",
"dataset_no_auto_shard = multi_worker_dataset.with_options(options)"
]
},
Expand Down Expand Up @@ -882,7 +882,7 @@
"\n",
"When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
"\n",
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback."
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, which also adds the support to single-worker training for a consistent experience, and removed the fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new `BackupAndRestore` callback."
]
},
{
Expand Down Expand Up @@ -1129,8 +1129,9 @@
"\n",
"The `BackupAndRestore` callback uses the `CheckpointManager` to save and restore the training state, which generates a file called checkpoint that tracks existing checkpoints together with the latest one. For this reason, `backup_dir` should not be re-used to store other checkpoints in order to avoid name collision.\n",
"\n",
"Currently, the `BackupAndRestore` callback supports single worker with no strategy, MirroredStrategy, and multi-worker with MultiWorkerMirroredStrategy.\n",
"Below are two examples for both multi-worker training and single worker training."
"Currently, the `BackupAndRestore` callback supports single-worker training with no strategy—`MirroredStrategy`—and multi-worker training with `MultiWorkerMirroredStrategy`.\n",
"\n",
"Below are two examples for both multi-worker training and single-worker training:"
]
},
{
Expand All @@ -1141,12 +1142,12 @@
},
"outputs": [],
"source": [
"# Multi-worker training with MultiWorkerMirroredStrategy\n",
"# and the BackupAndRestore callback.\n",
"# Multi-worker training with `MultiWorkerMirroredStrategy`\n",
"# and the `BackupAndRestore` callback.\n",
"\n",
"callbacks = [tf.keras.callbacks.BackupAndRestore(backup_dir='/tmp/backup')]\n",
"with strategy.scope():\n",
" multi_worker_model = mnist.build_and_compile_cnn_model()\n",
" multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
"multi_worker_model.fit(multi_worker_dataset,\n",
" epochs=3,\n",
" steps_per_epoch=70,\n",
Expand Down