##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 使用 Kubernetes 进行高性能模拟

---

本教程将介绍如何使用在 Kubernetes 上部署的 TFF 运行时来设置高性能模拟。

出于演示目的，我们将使用[用于图像分类的联合学习](https://tensorflow.google.cn/federated/tutorials/federated_learning_for_image_classification)教程中的 TFF 模拟进行图像分类，但我们将针对由在 Kubernetes 中运行的两个 TFF 工作进程组成的多机器设置来运行它。我们将使用同一个 [EMNIST 数据集](https://tensorflow.google.cn/federated/tutorials/federated_learning_for_image_classification#preparing_the_input_data)进行训练，但拆成两个分区，每个分区对应于一个 TFF 工作进程。

本教程涉及以下 Google Cloud 服务：

- 用于创建 Kubernetes 集群的 [GKE](https://cloud.google.com/kubernetes-engine/)，但是创建集群后的所有步骤都可以用于任何 Kubernetes 安装。
- 用于提供训练数据的 [Filestore](https://cloud.google.com/filestore)，但可与任何可以作为 Kubernetes [持久卷](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)安装的存储介质一起使用。

> **注**：本教程假定您目前拥有 GCP 项目。

<table class="tfo-notebook-buttons" align="left">
  <td>     <a target="_blank" href="https://tensorflow.google.cn/federated/tutorials/high_performance_simulation_with_kubernetes"><img src="https://tensorflow.google.cn/images/tf_logo_32px.png">在 TensorFlow.org 上查看</a> </td>
  <td>     <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/federated/tutorials/high_performance_simulation_with_kubernetes.ipynb"><img src="https://tensorflow.google.cn/images/colab_logo_32px.png">在 Google Colab 中运行</a>
</td>
  <td>     <a target="_blank" href="https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/federated/tutorials/high_performance_simulation_with_kubernetes.ipynb"><img src="https://tensorflow.google.cn/images/GitHub-Mark-32px.png">在 GitHub 上查看源代码</a>
</td>
  <td>     <a href="https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/federated/tutorials/high_performance_simulation_with_kubernetes.ipynb"><img src="https://tensorflow.google.cn/images/download_logo_32px.png">下载笔记本</a> </td>
</table>

## 在 Kubernetes 上启动 TFF 工作进程

### 打包 TFF 工作进程二进制文件

[worker_service.py](https://github.com/tensorflow/federated/blob/main/docs/tutorials/high_performance_simulation_with_kubernetes/worker_service.py) 包含我们自定义 TFF 工作进程的源代码。它运行一个带有自定义逻辑的模拟服务器，此逻辑用于加载数据集分区并为每一轮联合学习从中采样。（要了解详细信息 ，请参阅[在 TFF 中加载远程数据](https://tensorflow.google.cn/federated/tutorials/loading_remote_data)。）

我们将把 TFF 工作进程部署为 Kubernetes 上的容器化应用。我们从构建 Docker 镜像开始。利用此 [Dockerfile](https://github.com/tensorflow/federated/blob/main/docs/tutorials/high_performance_simulation_with_kubernetes/Dockerfile)，我们可以通过运行以下代码将代码打包：

```
$ WORKER_IMAGE=tff-worker-service:latest

$ docker build --tag $WORKER_IMAGE --file "./Dockerfile" .
```

（假设 [worker_service.py](https://github.com/tensorflow/federated/blob/main/docs/tutorials/high_performance_simulation_with_kubernetes/worker_service.py) 和 [Dockerfile](https://github.com/tensorflow/federated/blob/main/docs/tutorials/high_performance_simulation_with_kubernetes/Dockerfile) 位于您的工作目录中。）

随后将映像发布到容器仓库，我们将要创建的 Kubernetes 簇可以访问它，例如：

```
$ docker push $WORKER_IMAGE
```

### 创建 Kubernetes 集群

The following step only needs to be done once. The cluster can be re-used for future workloads.

按照 GKE 说明[创建一个已启用 Filestore CSI 驱动程序的集群](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver#enabling_the_on_a_new_cluster)，例如：

```
gcloud container clusters create tff-cluster --addons=GcpFilestoreCsiDriver
```

The commands to interact with GCP can be run [locally](https://cloud.google.com/kubernetes-engine/docs/tutorials/hello-app#option_b_use_command-line_tools_locally) or in the [Google Cloud Shell](https://cloud.google.com/shell/). We recommend the Google Cloud Shell since it doesn't require additional setup.

本教程的其余部分假定集群的名称为 `tff-cluster`，但实际名称并不重要。

### Deploy the TFF Worker Application

[worker_deployment.yaml](https://github.com/tensorflow/federated/blob/main/docs/tutorials/high_performance_simulation_with_kubernetes/worker_deployment.yaml) 声明用于建立两个 TFF 工作进程的配置，每个工作进程在自己的 Kubernetes Pod 中有两个副本。我们可以将此配置应用于我们正在运行的集群：

```
kubectl apply -f worker_deployment.yaml
```

请求更改后，可以检查 pod 是否准备就绪：

```
kubectl get pod
NAME                                        READY   STATUS    RESTARTS   AGE
tff-workers-deployment-1-6bb8d458d5-hjl9d   1/1     Running   0          5m
tff-workers-deployment-1-6bb8d458d5-jgt4b   1/1     Running   0          5m
tff-workers-deployment-2-6cb76c6f5d-hqt88   1/1     Running   0          5m
tff-workers-deployment-2-6cb76c6f5d-xk92h   1/1     Running   0          5m
```

每个工作进程实例都在带有端点的负载均衡器后面运行。查找负载均衡器的外部 IP 地址：

```
kubectl get service
NAME                    TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
tff-workers-service-1   LoadBalancer   XX.XX.X.XXX   XX.XXX.XX.XXX   80:31830/TCP   6m
tff-workers-service-2   LoadBalancer   XX.XX.X.XXX   XX.XXX.XX.XXX   80:31319/TCP   6m
```

稍后您需要它来将训练循环连接到正在运行的工作进程。

> **注：**这会将您的部署公开到互联网，并且仅用于演示目的。对于生产用途，强烈建议使用防火墙和身份验证。

## 准备训练数据

可以从 TFF 的公共[数据集仓库](https://console.cloud.google.com/storage/browser/tff-datasets-public/emnist-partitions/2-partition)下载我们将用于训练的 EMNIST 分区：

```
gsutil cp -r gs://tff-datasets-public/emnist-partitions/2-partition
```

随后，可以通过将它们复制到副本来将上传到每个 Pod，例如：

```
kubectl cp emnist_part_1.sqlite tff-workers-deployment-1-6bb8d458d5-hjl9d:/root/worker/data/emnist_partition.sqlite

kubectl cp emnist_part_2.sqlite tff-workers-deployment-2-6cb76c6f5d-hqt88:/root/worker/data/emnist_partition.sqlite
```

## 运行模拟

现在，我们已准备好对我们的集群运行模拟。

### 设置 TFF 环境

In [None]:
#@test {"skip": true}
!pip install --quiet --upgrade tensorflow-federated
!pip install --quiet --upgrade nest-asyncio

import nest_asyncio
nest_asyncio.apply()

### 定义训练程序

下面定义了联合学习的数据集迭代方式、模型架构和轮次循环过程。（[了解详情](https://tensorflow.google.cn/federated/tutorials/loading_remote_data#training_the_model)。）

In [None]:
import collections
from typing import Any, Optional, List
import tensorflow as tf
import tensorflow_federated as tff


class FederatedData(tff.program.FederatedDataSource,
                    tff.program.FederatedDataSourceIterator):
  """Interface for interacting with the federated training data."""

  def __init__(self, type_spec: tff.FederatedType):
    self._type_spec = type_spec
    self._capabilities = [tff.program.Capability.RANDOM_UNIFORM]

  @property
  def federated_type(self) -> tff.FederatedType:
    return self._type_spec

  @property
  def capabilities(self) -> List[tff.program.Capability]:
    return self._capabilities

  def iterator(self) -> tff.program.FederatedDataSourceIterator:
    return self

  def select(self, num_clients: Optional[int] = None) -> Any:
    data_uris = [f'uri://{i}' for i in range(num_clients)]
    return tff.framework.CreateDataDescriptor(
        arg_uris=data_uris, arg_type=self._type_spec)


input_spec = collections.OrderedDict([
    ('x', tf.TensorSpec(shape=(1, 784), dtype=tf.float32, name=None)),
    ('y', tf.TensorSpec(shape=(1, 1), dtype=tf.int32, name=None))
])
element_type = tff.types.StructWithPythonType(
    input_spec, container_type=collections.OrderedDict)
dataset_type = tff.types.SequenceType(element_type)

train_data_source = FederatedData(type_spec=dataset_type)
train_data_iterator = train_data_source.iterator()

def model_fn():
  model = tf.keras.models.Sequential([
      tf.keras.layers.InputLayer(input_shape=(784,)),
      tf.keras.layers.Dense(units=10, kernel_initializer='zeros'),
      tf.keras.layers.Softmax(),
  ])
  return tff.learning.from_keras_model(
      model,
      input_spec=input_spec,
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])


trainer = tff.learning.algorithms.build_weighted_fed_avg(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0))


def train_loop(num_rounds=10, num_clients=10):
  state = trainer.initialize()
  for round in range(1, num_rounds + 1):
    train_data = train_data_iterator.select(num_clients)
    result = trainer.next(state, train_data)
    state = result.state
    train_metrics = result.metrics['client_work']['train']
    print('round {:2d}, metrics={}'.format(round, train_metrics))

### 连接到 TFF 工作进程

默认情况下，TFF 在本地执行所有计算。在此步骤中，我们指示 TFF 连接到我们在上面设置的 Kubernetes 服务。确保在此处复制服务的 IP 地址。

In [None]:
import grpc

ip_address_1 = '0.0.0.0'  #@param {type:"string"}
ip_address_2 = '0.0.0.0'  #@param {type:"string"}
port = 80

channels = [
    grpc.insecure_channel(f'{ip_address_1}:{port}'),
    grpc.insecure_channel(f'{ip_address_2}:{port}')
]

tff.backends.native.set_remote_python_execution_context(channels)

### 执行训练

In [None]:
train_loop()

round  1, metrics=OrderedDict([('sparse_categorical_accuracy', 0.10557769), ('loss', 12.475689), ('num_examples', 5020), ('num_batches', 5020)])
round  2, metrics=OrderedDict([('sparse_categorical_accuracy', 0.11940298), ('loss', 10.497084), ('num_examples', 5360), ('num_batches', 5360)])
round  3, metrics=OrderedDict([('sparse_categorical_accuracy', 0.16223507), ('loss', 7.569645), ('num_examples', 5190), ('num_batches', 5190)])
round  4, metrics=OrderedDict([('sparse_categorical_accuracy', 0.2648384), ('loss', 6.0947175), ('num_examples', 5105), ('num_batches', 5105)])
round  5, metrics=OrderedDict([('sparse_categorical_accuracy', 0.29003084), ('loss', 6.2815433), ('num_examples', 4865), ('num_batches', 4865)])
round  6, metrics=OrderedDict([('sparse_categorical_accuracy', 0.40237388), ('loss', 4.630901), ('num_examples', 5055), ('num_batches', 5055)])
round  7, metrics=OrderedDict([('sparse_categorical_accuracy', 0.4288425), ('loss', 4.2358975), ('num_examples', 5270), ('num_batches