<a href="https://colab.research.google.com/github/saadz-khan/federated/blob/main/high_performance_simulation_with_kubernetes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# High-performance Simulation with Kubernetes

This tutorial will describe how to set up high-performance simulation using a
TFF runtime running on Kubernetes.

This tutorial refers to Google Cloud's [GKE](https://cloud.google.com/kubernetes-engine/) to create the Kubernetes cluster,
but all the steps after the cluster is created can be used with any Kubernetes
installation.

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/federated/tutorials/high_performance_simulation_with_kubernetes"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/federated/blob/v0.33.0/docs/tutorials/high_performance_simulation_with_kubernetes.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/federated/blob/v0.33.0/docs/tutorials/high_performance_simulation_with_kubernetes.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/federated/docs/tutorials/high_performance_simulation_with_kubernetes.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Launch the TFF Workers on GKE

> **Note:** This tutorial assumes the user has an existing GCP project.

### Create a Kubernetes Cluster

The following step only needs to be done once. The cluster can be re-used for future workloads.

Follow the GKE instructions to [create a container cluster](https://cloud.google.com/kubernetes-engine/docs/tutorials/hello-app#step_4_create_a_container_cluster). The rest of this tutorial assumes that the cluster is named `tff-cluster`, but the actual name isn't important.
Stop following the instructions when you get to "*Step 5: Deploy your application*".

### Deploy the TFF Worker Application

The commands to interact with GCP can be run [locally](https://cloud.google.com/kubernetes-engine/docs/tutorials/hello-app#option_b_use_command-line_tools_locally) or in the [Google Cloud Shell](https://cloud.google.com/shell/). We recommend the Google Cloud Shell since it doesn't require additional setup.

1. Run the following command to launch the Kubernetes application.

```
kubectl create deployment tff-workers --image=gcr.io/tensorflow-federated/remote-executor-service:latest

kubectl create deployment tff-workers --image=gcr.io/tensorflow-federated/remote-executor-service@sha256:b38b785e64b6e366c51bf0ac8657961155a285cf04ac10283b6bd6cb2a43a672
```

2. Add a load balancer for the application.

```
kubectl expose deployment tff-workers --type=LoadBalancer --port 80 --target-port 8000
```

> **Note:** This exposes your deployment to the internet and is for demo
purposes only. For production use, a firewall and authentication are strongly
recommended.

Look up the IP address of the loadbalancer on the Google Cloud Console. You'll need it later to connect the training loop to the worker app.

### (Alternately) Launch the Docker Container Locally

```
docker run --rm -p 8000:8000 gcr.io/tensorflow-federated/remote-executor-service:latest
```

## Set Up TFF Environment

In [1]:
#@test {"skip": true}

!pip install --quiet --upgrade tensorflow-federated==0.19.0
!pip install --quiet --upgrade nest-asyncio

import nest_asyncio
nest_asyncio.apply()

[K     |████████████████████████████████| 602 kB 13.3 MB/s 
[K     |████████████████████████████████| 460.3 MB 7.7 kB/s 
[K     |████████████████████████████████| 172 kB 58.9 MB/s 
[K     |████████████████████████████████| 132 kB 62.9 MB/s 
[K     |████████████████████████████████| 14.8 MB 59.9 MB/s 
[K     |████████████████████████████████| 45 kB 3.8 MB/s 
[K     |████████████████████████████████| 4.0 MB 47.5 MB/s 
[K     |████████████████████████████████| 192 kB 61.5 MB/s 
[K     |████████████████████████████████| 887 kB 61.3 MB/s 
[K     |████████████████████████████████| 65.1 MB 80 kB/s 
[K     |████████████████████████████████| 462 kB 60.4 MB/s 
[K     |████████████████████████████████| 1.2 MB 59.8 MB/s 
[?25h  Building wheel for jax (setup.py) ... [?25l[?25hdone
  Building wheel for retrying (setup.py) ... [?25l[?25hdone
  Building wheel for wrapt (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the pa

## Define the Model to Train

In [2]:
import collections
import time

import tensorflow as tf
import tensorflow_federated as tff

source, _ = tff.simulation.datasets.emnist.load_data()


def map_fn(example):
  return collections.OrderedDict(
      x=tf.reshape(example['pixels'], [-1, 784]), y=example['label'])


def client_data(n):
  ds = source.create_tf_dataset_for_client(source.client_ids[n])
  return ds.repeat(10).batch(20).map(map_fn)


train_data = [client_data(n) for n in range(10)]
input_spec = train_data[0].element_spec


def model_fn():
  model = tf.keras.models.Sequential([
      tf.keras.layers.InputLayer(input_shape=(784,)),
      tf.keras.layers.Dense(units=10, kernel_initializer='zeros'),
      tf.keras.layers.Softmax(),
  ])
  return tff.learning.from_keras_model(
      model,
      input_spec=input_spec,
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])


trainer = tff.learning.build_federated_averaging_process(
    model_fn, client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.02))


def evaluate(num_rounds=10):
  state = trainer.initialize()
  for round in range(num_rounds):
    t1 = time.time()
    state, metrics = trainer.next(state, train_data)
    t2 = time.time()
    print('Round {}: loss {}, round time {}'.format(round, metrics.loss, t2 - t1))

Downloading emnist_all.sqlite.lzma: 100%|██████████| 170507172/170507172 [00:42<00:00, 3954674.99it/s]


## Set Up the Remote Executors

By default, TFF executes all computations locally. In this step we tell TFF to connect to the Kubernetes services we set up above. Be sure to copy the IP address of your service here.

In [3]:
import grpc

ip_address = '34.94.39.30'  #@param {type:"string"}
port = 80  #@param {type:"integer"}

channels = [grpc.insecure_channel(f'{ip_address}:{port}') for _ in range(10)]

tff.backends.native.set_remote_execution_context(channels)

## Run Training

In [4]:
evaluate()

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-3f393ad04e6a>", line 1, in <module>
    evaluate()
  File "<ipython-input-2-583373d4eba8>", line 42, in evaluate
    state = trainer.initialize()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/core/impl/computation/function_utils.py", line 525, in __call__
    return context.invoke(self, arg)
  File "/usr/local/lib/python3.7/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.7/dist-packages/retrying.py", line 220, in call
    time.sleep(sleep / 1000.0)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 

KeyboardInterrupt: ignored