# Chapter 12: Distributing Tensorflow Across Devices and Servers
----
----

Tensorflow's supportof  distributed computing is one of its main highlights. Full control of how to split your computation graph across servers and devices.

## Multiple Devices on a Single Machine
----

#### Installation

- [AWS instunctions](http://goo.gl/kbge5b)
- [Google Cloud Learning](https://cloud.google.com/ml)
- [Good build options for deep learning](https://goo.gl/pCtSAn)
- Currently using Colab

Steps: [GPU Tensorflow](https://www.tensorflow.org/install/gpu)
- Insall Nvidia Drivers
- Install CUDA toolkit
- Install cuDNN 

In [0]:
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [4]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [5]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6249105289000007423
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11281553818
locality {
  bus_id: 1
  links {
  }
}
incarnation: 6822170813712466320
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
]


#### Managin the GPU RAM

Default, Tensorflow uses all of the RAM available, the first time the graph is ran.


Set programs on different GPUs
```
$ CUDA_VISIBLE_DEVICES=0,1 python3 program_1.py
# and in another terminal
$ CUDA_VISIBLE_DEVICES=3,2 python3 program_2.py
```

Or, tell Tensorflow to use part of the memory:

```(python)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config)
```

Or, grab memory only when needed:
```(python)
config.gpu_options.allow_growth = True
```

#### Placing Operations on Devices
[The Tensorflow whitepaper](http://goo.gl/vSjA14)

##### Simple placement
Rules:
- If a node was placed on a device in a pervious run of the graph, it is left on that device
- Else, if the user pinned a node to a device, the placer places it on that device
- Else, it defaults to GPU#0, or CPU if there is not GPU

```(python)
with tf.device("/cpu:0"): 
    a = tf.Variable(3.0)
    b = tf.Variable(4.0)
c = a * b
```

a and b are pinned to cpu:0 and c is not pinned so it will default to gpu:0

##### Logging placement

In [0]:
config = tf.ConfigProto()
config.log_device_placement = True
sess = tf.Session(config=config)

##### Dynamic placement function

place variables on cpu

In [0]:
def variables_on_cpu(op):
  if op.type == "Variable":
    return "/cpu:0"
  else:
    return "/gpu:0"
  
with tf.device(variables_on_cpu):
    a = tf.Variable(3.0)
    b = tf.Variable(4.0)
    c = a * b

##### Operations and kernels
For a Tensorflow operation to run on a device, it needs to have an implementation for that device; this is called a kernel.

NB: Variables do not have support for GPU

In [11]:
reset_graph()
sess = tf.Session()
with tf.device("/gpu:0"):
  i = tf.Variable(3)
sess.run(i.initializer)

InvalidArgumentError: ignored

##### Soft placement
Fall back to cpu set **allow_soft_placement**

In [0]:
reset_graph()

with tf.device("/gpu:0"):
  i = tf.Variable(3.0)
config = tf.ConfigProto()
config.allow_soft_placement = True
sess = tf.Session(config=config)
sess.run(i.initializer)

#### Parallel Execution

can control # of thread per inter-op pool
 - `inter_op_parallelism_threads`
 - `use_per_session_threads`
 - `intra_op_parallelism_threads`

#### Control Dependencies
to postpone evaluation of some node, simple soltuion is to add _control dependencies_

In [0]:
reset_graph()

a = tf.constant(1.0)
b = a + 2.0

with tf.control_dependencies([a,b]):
  x = tf.constant(3.0)
  y = tf.constant(4.0)
z = x + y

## Multiple Devices Across Multiple Servers
----
First define a cluster

In [0]:
cluster_spec = tf.train.ClusterSpec({
    "ps": [
        "127.0.0.1:2221",  # /job:ps/task:0
        "127.0.0.1:2222",  # /job:ps/task:1
    ],
    "worker": [
        "127.0.0.1:2223",  # /job:worker/task:0
        "127.0.0.1:2224",  # /job:worker/task:1
        "127.0.0.1:2225",  # /job:worker/task:2
    ]})

task_ps0 = tf.train.Server(cluster_spec, job_name="ps", task_index=0)
task_ps1 = tf.train.Server(cluster_spec, job_name="ps", task_index=1)
task_worker0 = tf.train.Server(cluster_spec, job_name="worker", task_index=0)
task_worker1 = tf.train.Server(cluster_spec, job_name="worker", task_index=1)
task_worker2 = tf.train.Server(cluster_spec, job_name="worker", task_index=2)

#### Opening a Session



In [0]:
a = tf.constant(1.0)
b = a + 2
c = a * 2

with tf.Session("grpc://machine-b.example.com:2222") as sess:
  print(c.eval())

#### The Master and Worker Services

The client uses the Google Remote Procedure Call to communicate with the server.

#### Pinning Operatings Across Tasks
You can use device blocks to pin operations on any device managed by any task, by specifying the job name, task index, device type and device index.

In [0]:
with tf.device("/job:ps/task:0/cpu:0"):
  # ...

with tf.device("/job:worker/task:0/gpu:1"):
  # ...

#### Sharding Variables Across Multiple Parameter Servers

Store model parameters on a set of parameter servers(ps job) while other tasks focus on computations(worker jobs)

`replica_device_setter()` distributes variables across all the "ps" tasks in a round robin fashion

#### Sharing State Across Sessions Using Resource Containers

variable state is managed by resource containers located on the cluster itself, not by the session.

#### Asynchronous Communications Using Tensorflow Queues


In [0]:
q = tf.FIFOQueue(capacity=10, dtypes=[tf.float32], shapes=[[2]], name="q", shared_name="shared_q")

training_instance = tf.placeholder(tf.float32, shape=(2))
enqueue = q.enqueue([training_instance])

with tf.Session("grpc://machine-a.example.com:2222") as sess:
  sess.run(enqueue, feed_dict={training_instance:[1., 2.]})
  # ...
  
  
dequeue = q.dequeue()

with tf.Session("grpc://machine-a.example.com:2222") as sess:
  print(sess.run(dequeue)) # [1., 2.]

#### Loading Data Directly From the Graph



## Paralelizing Neural Networks on a Tensorflow Cluster
----
First we look at how to parallelize several neural nets by placing each one on different device. The we look at training single network across multiple devices and servers.

#### One Neural Net per Device

specify master server address when creating a session


#### Model Parallelism
depends on architecture of neural net, it can be tricky.