## 数据输入（二） image 数据

上个例子中学习了 numpy 数据的读取方式，已经可以解决不少问题了。现在还有两个问题可能需要用到不同的数据打包方式，一个是图片数据的读取；另一个就是变长序列的读取。图片的例子不少，但是关于变长序列我还没找到很好的解决方式。

本文主要参考：
- [TensorFlow全新的数据读取方式：Dataset API入门教程](https://zhuanlan.zhihu.com/p/30751039)
- https://github.com/yongyehuang/TensorFlow-Examples/blob/master/examples/5_DataManagement/build_an_image_dataset.py

In [1]:
import warnings
warnings.filterwarnings('ignore')  # 不打印 warning 

import tensorflow as tf

# 设置GPU按需增长
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

import numpy as np
import sys
import os
import time

```
在 data/sketchy_000000000000/ 目录下一共有 125 个目录，每个目录表示一个类别，每个目录下有多张图片。
airplane/
    ***1.png
    ***2.png
    ...
ant/ 
    ***1.png
    ***2.png
    ...
...
   
```

- step1: 获取每张图片的文件名和对应的标签。
- step2: 使用 dataset.map 函数解析图片。

In [2]:
def get_file_path(data_path = '../data/sketchy_000000000000/'):
    """解析文件夹，获取每个文件的路径和标签。"""
    img_paths = list()
    labels = list()
    class_dirs = sorted(os.listdir(data_path))
    dict_class2id = dict()
    for i in range(len(class_dirs)):
        label = i
        class_dir = class_dirs[i]
        dict_class2id[class_dir] = label
        class_path = os.path.join(data_path, class_dir)  # 每类的路径
        file_names = sorted(os.listdir(class_path))
        for file_name in file_names:
            file_path = os.path.join(class_path, file_name)
            img_paths.append(file_path)
            labels.append(label)
    return img_paths, labels

img_paths, labels = get_file_path()
print(len(img_paths))
print(len(labels))
img0 = img_paths[0]
print(img0)


75481
75481
../data/sketchy_000000000000/airplane/n02691156_10151-1.png


构造 dataset 并读取数据

In [None]:
def parse_png(img_path, label, height=256, width=256, channel=3):
    """根据 img_path 读入图片并做相应处理"""
    # 从硬盘上读取图片
    img = tf.read_file(img_path)
    img_decoded = tf.image.decode_png(img, channels=channel)
    # resize
    img_resized = tf.image.resize_images(img_decoded, [height, width])
    # normalize 
    img_norm = img_resized * 1.0 / 127.5 - 1.0
    return img_norm, label

dataset = tf.data.Dataset.from_tensor_slices((img_paths, labels))
dataset = dataset.map(parse_png)
print('parsing image', dataset)
dataset = dataset.shuffle(buffer_size=5000).repeat().batch(256)
print('batch', dataset)

# 生成迭代器
iterator = dataset.make_one_shot_iterator()
print(iterator)

time0 = time.time()
for count in range(1000):
    X_batch, y_batch = sess.run(iterator.get_next())
#     print('count = {} : X.shape = {}, y[:10] = {}, pass {}s'.format(count, X_batch.shape, y_batch[:10], time.time() - time0))
#     time0 = time.time()
print(time.time() - time0)

###  batch_size=4 ，no shuffle

```
parsing image <MapDataset shapes: ((256, 256, 3), ()), types: (tf.float32, tf.int32)>
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7f0bd00e05c0>
count = 0 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.4704289436340332s
count = 1 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.38211870193481445s
count = 2 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.4049665927886963s
count = 3 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.38713955879211426s
count = 4 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.3735225200653076s
count = 5 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.37211036682128906s
count = 6 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.36931586265563965s
count = 7 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.36876487731933594s
count = 8 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.3712954521179199s
count = 9 : X.shape = (4, 256, 256, 3), y[:10] = [0 0 0 0], pass 0.38353633880615234s
```


###  batch_size=4 ，buffer_size=1000

```
parsing image <MapDataset shapes: ((256, 256, 3), ()), types: (tf.float32, tf.int32)>
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7f81c9faba90>
count = 0 : X.shape = (4, 256, 256, 3), y = [0 0 0 0], pass 2.277726411819458s
count = 1 : X.shape = (4, 256, 256, 3), y = [0 0 1 1], pass 0.3655529022216797s
count = 2 : X.shape = (4, 256, 256, 3), y = [0 1 0 0], pass 0.3675987720489502s
count = 3 : X.shape = (4, 256, 256, 3), y = [1 0 0 0], pass 0.3594975471496582s
count = 4 : X.shape = (4, 256, 256, 3), y = [0 0 0 0], pass 0.35462403297424316s
count = 5 : X.shape = (4, 256, 256, 3), y = [0 0 0 1], pass 0.3697807788848877s
count = 6 : X.shape = (4, 256, 256, 3), y = [0 0 0 0], pass 0.37018513679504395s
count = 7 : X.shape = (4, 256, 256, 3), y = [1 0 0 0], pass 0.3580958843231201s
count = 8 : X.shape = (4, 256, 256, 3), y = [0 0 1 1], pass 0.35622262954711914s
count = 9 : X.shape = (4, 256, 256, 3), y = [0 1 0 0], pass 0.35665392875671387s
```
###  batch_size=4 ，buffer_size=5000

```
parsing image <MapDataset shapes: ((256, 256, 3), ()), types: (tf.float32, tf.int32)>
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7f8de8a718d0>
count = 0 : X.shape = (4, 256, 256, 3), y = [3 6 4 4], pass 10.296564102172852s
count = 1 : X.shape = (4, 256, 256, 3), y = [5 0 4 3], pass 0.43836045265197754s
count = 2 : X.shape = (4, 256, 256, 3), y = [4 7 4 2], pass 0.4133310317993164s
count = 3 : X.shape = (4, 256, 256, 3), y = [6 8 0 4], pass 0.37926673889160156s
count = 4 : X.shape = (4, 256, 256, 3), y = [6 7 0 1], pass 0.41953492164611816s
count = 5 : X.shape = (4, 256, 256, 3), y = [3 4 6 6], pass 0.39876246452331543s
count = 6 : X.shape = (4, 256, 256, 3), y = [1 5 2 7], pass 0.39066076278686523s
count = 7 : X.shape = (4, 256, 256, 3), y = [3 8 7 0], pass 0.4210829734802246s
count = 8 : X.shape = (4, 256, 256, 3), y = [0 2 0 2], pass 0.39181971549987793s
count = 9 : X.shape = (4, 256, 256, 3), y = [6 3 5 2], pass 0.3952939510345459s
```

###  batch_size=4 ，buffer_size=10000

```
parsing image <MapDataset shapes: ((256, 256, 3), ()), types: (tf.float32, tf.int32)>
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7fe7971adb70>
count = 0 : X.shape = (4, 256, 256, 3), y[:10] = [7 6 7 5], pass 20.642507076263428s
count = 1 : X.shape = (4, 256, 256, 3), y[:10] = [ 5  1 13  5], pass 0.42054080963134766s
count = 2 : X.shape = (4, 256, 256, 3), y[:10] = [0 8 3 9], pass 0.41022515296936035s
count = 3 : X.shape = (4, 256, 256, 3), y[:10] = [5 3 0 4], pass 0.38398194313049316s
count = 4 : X.shape = (4, 256, 256, 3), y[:10] = [ 3 13 10  2], pass 0.39125919342041016s
count = 5 : X.shape = (4, 256, 256, 3), y[:10] = [ 9 14 11  3], pass 0.3871769905090332s
count = 6 : X.shape = (4, 256, 256, 3), y[:10] = [ 0  3 12  1], pass 0.39249467849731445s
count = 7 : X.shape = (4, 256, 256, 3), y[:10] = [10 11  4  7], pass 0.39815545082092285s
count = 8 : X.shape = (4, 256, 256, 3), y[:10] = [6 3 3 4], pass 0.3916475772857666s
count = 9 : X.shape = (4, 256, 256, 3), y[:10] = [ 7 13 13  5], pass 0.3907802104949951s
```


-----
### batch_size=256， no shuffle 

```
parsing image <MapDataset shapes: ((256, 256, 3), ()), types: (tf.float32, tf.int32)>
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7fa731d6da58>
count = 0 : X.shape = (256, 256, 256, 3), y[:10] = [0 1 0 0 0 0 0 1 0 0], pass 2.809150218963623s
count = 1 : X.shape = (256, 256, 256, 3), y[:10] = [0 0 0 1 0 1 0 1 1 0], pass 0.9180774688720703s
count = 2 : X.shape = (256, 256, 256, 3), y[:10] = [1 1 2 1 1 1 0 0 0 2], pass 0.9346837997436523s
count = 3 : X.shape = (256, 256, 256, 3), y[:10] = [0 1 0 1 0 2 1 1 2 1], pass 0.9159815311431885s
count = 4 : X.shape = (256, 256, 256, 3), y[:10] = [1 1 0 0 3 0 3 0 3 0], pass 0.9176328182220459s
count = 5 : X.shape = (256, 256, 256, 3), y[:10] = [1 2 1 3 2 1 3 2 3 3], pass 0.9095683097839355s
count = 6 : X.shape = (256, 256, 256, 3), y[:10] = [3 3 1 1 3 1 1 3 1 3], pass 0.9027786254882812s
count = 7 : X.shape = (256, 256, 256, 3), y[:10] = [4 4 2 2 2 3 3 3 0 0], pass 0.8944077491760254s
count = 8 : X.shape = (256, 256, 256, 3), y[:10] = [3 4 3 4 2 4 3 4 4 3], pass 0.9207658767700195s
count = 9 : X.shape = (256, 256, 256, 3), y[:10] = [0 2 5 1 4 5 4 0 5 5], pass 0.9587688446044922s
```

### batch_size=256， buffer_size=1000 

```
parsing image <MapDataset shapes: ((256, 256, 3), ()), types: (tf.float32, tf.int32)>
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7f6461269b38>
count = 0 : X.shape = (256, 256, 256, 3), y[:10] = [0 1 1 0 0 0 1 1 1 0], pass 2.8092191219329834s
count = 1 : X.shape = (256, 256, 256, 3), y[:10] = [0 0 1 0 1 0 1 1 0 0], pass 0.9225723743438721s
count = 2 : X.shape = (256, 256, 256, 3), y[:10] = [0 0 0 0 0 0 1 1 2 2], pass 0.929398775100708s
count = 3 : X.shape = (256, 256, 256, 3), y[:10] = [2 2 0 1 0 2 2 2 1 2], pass 0.9135322570800781s
count = 4 : X.shape = (256, 256, 256, 3), y[:10] = [1 2 3 1 2 2 0 1 1 2], pass 0.9033732414245605s
count = 5 : X.shape = (256, 256, 256, 3), y[:10] = [0 3 0 3 3 3 3 3 2 2], pass 0.8910338878631592s
count = 6 : X.shape = (256, 256, 256, 3), y[:10] = [1 0 3 3 3 3 3 2 0 1], pass 0.9181504249572754s
count = 7 : X.shape = (256, 256, 256, 3), y[:10] = [3 4 3 3 3 4 2 1 1 3], pass 0.8899593353271484s
count = 8 : X.shape = (256, 256, 256, 3), y[:10] = [2 4 4 4 3 1 3 4 4 4], pass 0.8846793174743652s
count = 9 : X.shape = (256, 256, 256, 3), y[:10] = [4 3 0 4 4 4 4 2 5 4], pass 0.8913383483886719s
```

### batch_size=256， buffer_size=5000 
```
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7f5d6ba2cb38>
count = 0 : X.shape = (256, 256, 256, 3), y[:10] = [0 7 4 0 7 6 1 8 5 3], pass 11.391936540603638s
count = 1 : X.shape = (256, 256, 256, 3), y[:10] = [6 7 5 1 6 2 7 4 1 6], pass 1.4594790935516357s
count = 2 : X.shape = (256, 256, 256, 3), y[:10] = [3 0 2 4 6 6 7 2 5 4], pass 1.4608898162841797s
count = 3 : X.shape = (256, 256, 256, 3), y[:10] = [7 9 1 1 5 6 8 3 5 1], pass 1.2903695106506348s
count = 4 : X.shape = (256, 256, 256, 3), y[:10] = [5 4 0 6 8 9 8 3 3 8], pass 1.3749797344207764s
count = 5 : X.shape = (256, 256, 256, 3), y[:10] = [2 2 3 4 8 6 1 8 4 7], pass 1.3508849143981934s
count = 6 : X.shape = (256, 256, 256, 3), y[:10] = [8 7 7 9 0 1 0 6 2 6], pass 1.3386600017547607s
count = 7 : X.shape = (256, 256, 256, 3), y[:10] = [10  7  0  4  8  6  8  2  1  1], pass 1.38677978515625s
count = 8 : X.shape = (256, 256, 256, 3), y[:10] = [ 9  0  9  2  6  1  7  7  3 10], pass 1.3200161457061768s
count = 9 : X.shape = (256, 256, 256, 3), y[:10] = [ 4  7  3  8  8  0 10  1  9  3], pass 1.3426642417907715s
```

### batch_size=256， buffer_size=10000 
```
parsing image <MapDataset shapes: ((256, 256, 3), ()), types: (tf.float32, tf.int32)>
batch <BatchDataset shapes: ((?, 256, 256, 3), (?,)), types: (tf.float32, tf.int32)>
<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x7fecb8eafb38>
count = 0 : X.shape = (256, 256, 256, 3), y[:10] = [ 1 11 12  2  3  3  2 10 13  8], pass 26.30490732192993s
count = 1 : X.shape = (256, 256, 256, 3), y[:10] = [ 0  1 13  9 16  6 16 14  0 12], pass 1.4967591762542725s
count = 2 : X.shape = (256, 256, 256, 3), y[:10] = [12  8 10  9 14  3  3  4 15  4], pass 1.6755847930908203s
count = 3 : X.shape = (256, 256, 256, 3), y[:10] = [ 9 16  1 14 17  0 11 15  5  0], pass 1.2678115367889404s
count = 4 : X.shape = (256, 256, 256, 3), y[:10] = [10 15 16  1  0  6 11 10 16  4], pass 1.2361955642700195s
count = 5 : X.shape = (256, 256, 256, 3), y[:10] = [10 13 10 15 12 13  7  0  1 14], pass 1.4305830001831055s
count = 6 : X.shape = (256, 256, 256, 3), y[:10] = [10  7  1 14  3 13 14 12  4  0], pass 1.262021541595459s
count = 7 : X.shape = (256, 256, 256, 3), y[:10] = [16 15 17 19  5 14  8  6  1 16], pass 1.4019050598144531s
count = 8 : X.shape = (256, 256, 256, 3), y[:10] = [ 5  9  9 18 12 14 14  0  3 17], pass 1.3533422946929932s
count = 9 : X.shape = (256, 256, 256, 3), y[:10] = [ 0  8 19 14  2 10 17  9  7  2], pass 1.429227352142334s
```

大概整理了一下不同 batch_size 和 不同 buffle size 时候取 batch 的速度，下面是在机械硬盘上的实验结果，因为这台机器上没有 SSD，所以还没有试 SSD 上的速度，应该能快好多倍。

|batch_size|buffer_size|启动时间(s)|每个batch(s)|
|:----:|:---:|:---:|:---:|
|4|0|0|0.37|
|4|1000|2|0.37|
|4|5000|10|0.39|
|4|10000|20|0.39|
|256|0|2|0.9|
|256|1000|2|0.9|
|256|5000|10|1.35|
|256|10000|20|1.35(方差大)|

从上面来看，第一感觉就是：**妈呀，咋这么慢呀！这个 IO 估计都比网络训练的时间还多了。**

下面是采用 队列 的方式来进行数据读取，速度要快很多很多。

In [1]:
"""Use tf.data.Dataset to create dataset for image(png) data.
With TF Queue, shuffle data

refer: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/5_DataManagement/build_an_image_dataset.py
"""

from __future__ import print_function
from __future__ import division
from __future__ import absolute_import

import warnings

warnings.filterwarnings('ignore')  # 不打印 warning
import tensorflow as tf

# 设置GPU按需增长
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

import numpy as np
import sys
import os
import time


def get_file_path(data_path='../data/sketchy_000000000000/'):
    """解析文件夹，获取每个文件的路径和标签。"""
    img_paths = list()
    labels = list()
    class_dirs = sorted(os.listdir(data_path))
    dict_class2id = dict()
    for i in range(len(class_dirs)):
        label = i
        class_dir = class_dirs[i]
        dict_class2id[class_dir] = label
        class_path = os.path.join(data_path, class_dir)  # 每类的路径
        file_names = sorted(os.listdir(class_path))
        for file_name in file_names:
            file_path = os.path.join(class_path, file_name)
            img_paths.append(file_path)
            labels.append(label)
    return img_paths, labels


def get_batch(img_paths, labels, batch_size=128, height=256, width=256, channel=3):
    """根据 img_path 读入图片并做相应处理"""
    # 从硬盘上读取图片
    img_paths = np.asarray(img_paths)
    labels = np.asarray(labels)

    img_paths = tf.convert_to_tensor(img_paths, dtype=tf.string)
    labels = tf.convert_to_tensor(labels, dtype=tf.int32)
    # Build a TF Queue, shuffle data
    image, label = tf.train.slice_input_producer([img_paths, labels], shuffle=True)
    # Read images from disk
    image = tf.read_file(image)
    image = tf.image.decode_jpeg(image, channels=channel)
    # Resize images to a common size
    image = tf.image.resize_images(image, [height, width])
    # Normalize
    image = image * 1.0 / 127.5 - 1.0
    # Create batches
    X_batch, y_batch = tf.train.batch([image, label], batch_size=batch_size,
                                      capacity=batch_size * 8,
                                      num_threads=4)
    return X_batch, y_batch


img_paths, labels = get_file_path()
X_batch, y_batch = get_batch(img_paths, labels)

sess.run(tf.global_variables_initializer())
tf.train.start_queue_runners(sess=sess)

time0 = time.time()
for count in range(100):   # 11s for 100batch
    _X_batch, _y_batch = sess.run([X_batch, y_batch])
    sys.stdout.write("\rloop {}, pass {:.2f}s".format(count, time.time() - time0))
    sys.stdout.flush()


loop 99, pass 11.43s

对于 png 数据的读取,我尝试了 3 组不同的方式: one-shot 方式, tf 的队列方式(queue), tfrecord 方式. 同样是在机械硬盘上操作, 结果是 tfrecord 方式明显要快一些.

|iter_mode|buffer_size|100 batch(s)|
|:----:|:---:|:---:|
|one-shot|2000|75|
|one-shot|5000|86|
|tf.queue|2000|11|
|tf.queue|5000|11|
|tfrecord（3线程）|2000|5.3|
|tfrecord（3线程）|5000|5.3|

如果是在 SSD 上面的话,tf 的队列方式应该也是比较快的.打包成 tfrecord 格式只是减少了小文件的读取，其实现也是使用队列的。

下面在 tfrecord 方式下，比较对 image 进行 resize (256 -> 224)操作，batch_size=128：

|线程数|是否resize|500 batch(s)|
|:----:|:---:|:---:|
|2|否|35.58|
|4|否|21.43|
|6|否|16.35|
|2|是|80|
|4|是|44|
|6|是|34|