# 1 Performance  Guide
---
- [General best parctices](#General-best-parctices)
- [Optimizing for GPU](#Optimizing-for-GPU)
- [Optimizing for CPU](#Optimizing-for-CPU)




## General best parctices
---
- [Input pipeline optimization](#Input-pipeline-optimization)
- [Data formats](#Data-formats)
- [Common fused Ops](#Common-fused-Ops)
- [RNN Performance](#RNN-Performance)
- [Building and installing from source](#Building-and-installing-from-source)

### Input pipeline optimization
---
常见的图片读取主要有以下流程：load image > decode image into tensor > crop and pad > flip and dissort > batch

1. Preprocessing on the CPU.数据预处理通常放在CPU中处理，GPU主要用于训练

```python
with tf.device('/cpu:0'):
    distorted_inputs = load_and_distort_image()
```
如果使用`tf.estimator.Estimator`则input function自动在CPU中处理。

2. Using the tf.data API.tf.data使用`queue_runner`来创建input pipelines.

在处理large input时，不建议直接使用`feed_dict`，可以配合batch使用
```python
sess.run(strain_op, feed_dict={x: batch_xs, y: batch_ys})
```

3. Fused decode and crop.在处理图片数据时，建议使用 **`tf.image.decode_and_crop_jpeg`** 方法，该方法会先crop，然后在decode，可以明显提高处理效率。

```python
def _image_preprocess_fn(image_buffer):
    # one-D
    # extract image shape from raw jpeg image buffer
    image_shape = tf.image.extract_jpeg_shape(image_buffer)
    
    # get a crop window with distorted bounding box
    sample_distorted_bounding_box = tf.image.sample_distorted_bounding_box(image_shape, ...)
    bbox_begin, bbox_size, distort_bbox = sample_distorted_bounding_box
    
    # Decode and crop image
    offset_y, offset_x, _ = tf.unstack(bbox_begin)
    target_height, target_width, _ = tf.unstack(bbox_size)
    
    crop_window = tf.stack([offset_y, offset_x, target_height, target_width])
    cropped_image = tf.image.decode_and_crop_jpeg(image, crop_window)
```
4. Use large files instead of large numbers of small files.


### Data formats
---
tensorflow接收两种四维的图片数据格式NCHW and NHWC.
- NCHW or channel_first 主要用于NVIDIA GPUs using cuDNN
- NHWC or channel_last tensorflow默认接收格式，在CPU处理中速度快

### Common fused Ops
---
```python
bn = tf.layer.batch_normalization(input_layers, fused=True, ..data_format='NCHW')
```
### RNN Performance
---
- `tf.nn.static_rnn`
- `tf.nn.dynamic_rnn`适合用于长序列训练。

### Building and installing from source
---
## Optimizing for GPU
---
为实现并行处理，需要对模型进行复制，称为towers，把每个tower放到每个GPU上，每个tower操作不同的一个batch数据，然后更新变量
## Optimizing for CPU
---

# 2 Building High-Performance Model
---
- build the model with both NHWC and NCHW

NHWC在CPU上运行较快。NCHW在GPU上运行较快

- Use Fused Batch-Normalization

```python
bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW', scope=scope)
```
- variable Distribution and Gradient Aggregation

- Parameter Server Variables
- Replicated Variables

# 3 dataset_performance
---
```python
# use num_parallel_calls cpu核数
dataset = dataset.map(map_func, num_parallel_calls)

dataset = dataset.batch(batch_size)
# use prefetch
dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
```
or use dataset.apply
```python
dataste = dataset.apply(tf.contrib.data.map_and_batch(map_func, batch_size))
```