- [TensorFlow 高级接口使用简介（estimator, keras， data, experiment](https://www.cnblogs.com/arkenstone/p/8448208.html)

`tf.data`可以方便从多种来源的数据输入到搭建的网络中（利用`tf.feature_column`可以方便的对结构化的数据进行读取和处理，比如存在`csv`中的数据，具体操作可以参考[这篇](https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html)文档）；

`tf.estimator`提供了`tf.estimator.Estimator.export_savedmodel`函数很方便将训练好的`ckpt`模型文件转成`pb`文件并结合 docker 和 tensorflow serving 进行灵活稳定的模型部署和更新。

# 1. `tf.data` 进行数据流操作（TFRecords）

在 keras 中有 [`keras.preprocessing.image.ImageDataGenerator`类](https://keras-cn.readthedocs.io/en/latest/preprocessing/image/)和`.flow_from_directory()`函数可以很容易将保存在 文件夹 下面的数据进行读取；也可以用`.flow()`函数将数据直接从`np.array`中读取后输入网络进行训练（具体可以查看官方文档）。在使用图片并以文件夹名作为分类名的训练任务时这个方案是十分简单有效的，但是Tensorflow官方推荐的数据保存格式是 TFRecords，而keras官方不支持直接从tfrecords文件中读取数据（tf.keras也不行，keras作者不太推荐使用），所以这里就可以用`tf.data`类来处理从`TFRecords`中的数据（也可以用之前常用的`tf.train.batch()`或`tf.train.shuffle_batch()`来处理训练数据）。

Tensorflow官方提供了详细的文档来介绍`data`的机制和使用方法（看这里），而对于`TFRecords`类型数据，主要利用 `tf.data.Iterator()`来抽取数据，这里简单说下从`TFRecords中`提取数据的方法：

以下代码为官方代码

In [9]:
def dataset_input_fn():
    filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
    dataset = tf.data.TFRecordDataset(filenames)  # 制定提取数据的tfrecords文件

    # Use `tf.parse_single_example()` to extract data from a `tf.Example`
    # protocol buffer, and perform any additional per-record preprocessing.
    def parser(record):   # 对tfrecords中的数据进行解析的操作
        keys_to_features = {
            "image_data": tf.FixedLenFeature((), tf.string, default_value=""),
            "date_time": tf.FixedLenFeature((), tf.int64, default_value=""),
            "label": tf.FixedLenFeature((), tf.int64,
                                        default_value=tf.zeros([], dtype=tf.int64)),
        }
        parsed = tf.parse_single_example(record, keys_to_features)

        # Perform additional preprocessing on the parsed data.
        image = tf.image.decode_jpeg(parsed["image_data"])
        image = tf.reshape(image, [299, 299, 1])
        label = tf.cast(parsed["label"], tf.int32)

        return {"image_data": image, "date_time": parsed["date_time"]}, label

    # Use `Dataset.map()` to build a pair of a feature dictionary and a label
    # tensor for each example.
    # 一般就用map函数对输入图像进行预处理，而预处理函数可以包含在上面用于解析的parser函数
    dataset = dataset.map(parser)
    # 在训练的时候一般需要将输入数据进行顺序打乱提高训练的泛化性
    dataset = dataset.shuffle(buffer_size=10000)
    dataset = dataset.batch(32)  # 单次读取的batch大小
    dataset = dataset.repeat(num_epochs)   # 数据集的重复使用次数，为空的话则无线循环
    iterator = dataset.make_one_shot_iterator()

    # `features` is a dictionary in which each value is a batch of values for
    # that feature; `labels` is a batch of labels.
    features, labels = iterator.get_next()
    return features, labels
    # return {"input": features, labels}  # 对于estimator的输入前者为dict类型，后者为tensor

注: `dataset.make_one_shot_iterator()`是最简单的`Iterator`类，不需要明确的`initialization`，并且目前这是唯一能够在`estimator`中容易使用的`iterator`。对于需要重新使用的`Dataset`类（比如结构相同的训练和测试数据集），一般是需要用 `reinitializable iterator` ，不过在`estimator`中由于上述问题，现在一般的做法是对训练集和验证集单独写两个`pipeline`用`make_one_shot_iterator`来处理数据流。

# 2. Dataset + Keras
通过`tf.data`处理`TFRecords`数据流，我们就可以使用`keras`来进行训练了。

In [11]:
_PATCH_SIZE = 32
def architecture(input_shape=(_PATCH_SIZE, _PATCH_SIZE, 3)):
    """
    Model architecture
    Args:
        input_shape: input image shape (not include batch)

    Returns: an keras model instance
    """
    base_model = Xception(include_top=True,
                          weights=None,   # no pre-trained weights used
                          pooling="max",
                          input_shape=input_shape,  # modify first layer
                          classes=_NUM_CLASSES)
    base_model.summary()
    return base_model


def train(source_dir, model_save_path):
    """
    Train patch based model
    Args:
        source_dir: a directory where training tfrecords file stored. All TF records start with train will be used!
        model_save_path: weights save path
    """
    if tf.gfile.Exists(source_dir):
        train_data_paths = tf.gfile.Glob(source_dir+"/train*tfrecord")
        val_data_paths = tf.gfile.Glob(source_dir+"/val*tfrecord")
        if not len(train_data_paths):
            raise Exception(
                "[Train Error]: unable to find train*.tfrecord file")
        if not len(val_data_paths):
            raise Exception("[Eval Error]: unable to find val*.tfrecord file")
    else:
        raise Exception("[Train Error]: unable to find input directory!")
    (images, labels) = dataset_input_fn(train_data_paths)
    # keras model的输入需要为keras.Input类型，但是直接使用tensorflow tensor类型也是可以的
    model_input = keras.Input(tensor=images, shape=(
        _PATCH_SIZE, _PATCH_SIZE, 3), dtype=tf.float32, name="input")
    base_model = architecture()
    model_output = base_model(model_input)
    model = keras.models.Model(inputs=model_input, outputs=model_output)
    optimizer = keras.optimizers.RMSprop(lr=2e-3, decay=0.9)
    model.compile(optimizer=optimizer,
                  loss=focal_loss,
                  metrics=['accuracy'],
                  target_tensors=[labels])   # 1
    tensorboard = tf.keras.callbacks.TensorBoard(log_dir=model_save_path)
    model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
        filepath=model_save_path+"/saved_model.h5")
    model.fit(steps_per_epoch=8000, epochs=_EPOCHS,
              callbacks=[tensorboard, model_checkpoint])


def evaluate(source_dir, weights_path):
    """
    Eval patch based model
    Args:
        source_dir: directory where val tf records file stored. All TF records start with val will be used!
        weights_path: model weights save path
    """
    # load model
    base_model = architecture()
    base_model.load_weights(weights_path)
    # load test dataset
    if tf.gfile.Exists(source_dir):
        val_data_paths = tf.gfile.Glob(source_dir+"/val*.tfrecord")
        if not len(val_data_paths):
            raise Exception("[Eval Error]: unable to find val*.tfrecord file")
    else:
        raise Exception("[Train Error]: unable to find input directory!")
    (images, labels) = input_fn(source_dir)
    probs = base_model(images)
    predictions = tf.argmax(probs, axis=-1)
    accuracy_score = tf.reduce_mean(tf.equal(probs, predictions))
    print("Accuracy of testing images: {}".format(accuracy_score))

https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html

# 3. Dataset + estimator
可以用`tf.layers`函数来代替`keras`搭建网络，而且可以提供更丰富的 layer。

In [13]:
import tensorflow as tf
import keras

from keras.preprocessing.image import ImageDataGenerator

In [8]:
def xception():
    def tf_xception(features, classes=2, is_training=True):
    """
    The Xception architecture written in tf.layers
    Args:
        features: input image tensor
        classes: number of classes to classify images into
        is_training: is training stage or not

    Returns:
        2-D logits prediction output after pooling and activation
    """
    x = tf.layers.conv2d(features, 32, (3, 3), strides=(
        2, 2), use_bias=False, name='block1_conv1')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block1_conv1_bn')
    x = tf.nn.relu(x, name='block1_conv1_act')
    x = tf.layers.conv2d(x, 64, (3, 3), use_bias=False, name='block1_conv2')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block1_conv2_bn')
    x = tf.nn.relu(x, name='block1_conv2_act')

    residual = tf.layers.conv2d(x, 128, (1, 1), strides=(
        2, 2), padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.layers.separable_conv2d(
        x, 128, (3, 3), padding='same', use_bias=False, name='block2_sepconv1')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block2_sepconv1_bn')
    x = tf.nn.relu(x, name='block2_sepconv2_act')
    x = tf.layers.separable_conv2d(
        x, 128, (3, 3), padding='same', use_bias=False, name='block2_sepconv2')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block2_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(
        2, 2), padding='same', name='block2_pool')
    x = tf.add(x, residual, name='block2_add')

    residual = tf.layers.conv2d(x, 256, (1, 1), strides=(
        2, 2), padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.nn.relu(x, name='block3_sepconv1_act')
    x = tf.layers.separable_conv2d(
        x, 256, (3, 3), padding='same', use_bias=False, name='block3_sepconv1')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block3_sepconv1_bn')
    x = tf.nn.relu(x, name='block3_sepconv2_act')
    x = tf.layers.separable_conv2d(
        x, 256, (3, 3), padding='same', use_bias=False, name='block3_sepconv2')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block3_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(
        2, 2), padding='same', name='block3_pool')
    x = tf.add(x, residual, name="block3_add")

    residual = tf.layers.conv2d(x, 728, (1, 1), strides=(
        2, 2), padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.nn.relu(x, name='block4_sepconv1_act')
    x = tf.layers.separable_conv2d(
        x, 728, (3, 3), padding='same', use_bias=False, name='block4_sepconv1')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block4_sepconv1_bn')
    x = tf.nn.relu(x, name='block4_sepconv2_act')
    x = tf.layers.separable_conv2d(
        x, 728, (3, 3), padding='same', use_bias=False, name='block4_sepconv2')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block4_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(
        2, 2), padding='same', name='block4_pool')
    x = tf.add(x, residual, name="block4_add")

    for i in range(8):
        residual = x
        prefix = 'block' + str(i + 5)

        x = tf.nn.relu(x, name=prefix + '_sepconv1_act')
        x = tf.layers.separable_conv2d(
            x, 728, (3, 3), padding='same', use_bias=False, name=prefix + '_sepconv1')
        x = tf.layers.batch_normalization(
            x, training=is_training, name=prefix + '_sepconv1_bn')
        x = tf.nn.relu(x, name=prefix + '_sepconv2_act')
        x = tf.layers.separable_conv2d(
            x, 728, (3, 3), padding='same', use_bias=False, name=prefix + '_sepconv2')
        x = tf.layers.batch_normalization(
            x, training=is_training, name=prefix + '_sepconv2_bn')
        x = tf.nn.relu(x, name=prefix + '_sepconv3_act')
        x = tf.layers.separable_conv2d(
            x, 728, (3, 3), padding='same', use_bias=False, name=prefix + '_sepconv3')
        x = tf.layers.batch_normalization(
            x, training=is_training, name=prefix + '_sepconv3_bn')

        x = tf.add(x, residual, name=prefix+"_add")

    residual = tf.layers.conv2d(x, 1024, (1, 1), strides=(
        2, 2), padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.nn.relu(x, name='block13_sepconv1_act')
    x = tf.layers.separable_conv2d(
        x, 728, (3, 3), padding='same', use_bias=False, name='block13_sepconv1')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block13_sepconv1_bn')
    x = tf.nn.relu(x, name='block13_sepconv2_act')
    x = tf.layers.separable_conv2d(
        x, 1024, (3, 3), padding='same', use_bias=False, name='block13_sepconv2')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block13_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(
        2, 2), padding='same', name='block13_pool')
    x = tf.add(x, residual, name="block13_add")

    x = tf.layers.separable_conv2d(
        x, 1536, (3, 3), padding='same', use_bias=False, name='block14_sepconv1')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block14_sepconv1_bn')
    x = tf.nn.relu(x, name='block14_sepconv1_act')

    x = tf.layers.separable_conv2d(
        x, 2048, (3, 3), padding='same', use_bias=False, name='block14_sepconv2')
    x = tf.layers.batch_normalization(
        x, training=is_training, name='block14_sepconv2_bn')
    x = tf.nn.relu(x, name='block14_sepconv2_act')
    # replace conv layer with fc
    x = tf.layers.average_pooling2d(
        x, (3, 3), (2, 2), name="global_average_pooling")
    x = tf.layers.conv2d(
        x, 2048, [1, 1], activation=None, name="block15_conv1")
    x = tf.layers.conv2d(x, classes, [1, 1],
                         activation=None, name="block15_conv2")
    x = tf.squeeze(x, axis=[1, 2], name="logits")
    return x


def model_fn(features, labels, mode, params):
    """
    Model_fn for estimator model
    Args:
        features (Tensor): Input features to the model.
        labels (Tensor): Labels tensor for training and evaluation.
        mode (ModeKeys): Specifies if training, evaluation or prediction.
        params (HParams): hyper-parameters for estimator model
    Returns:
        (EstimatorSpec): Model to be run by Estimator.
    """
    # check if training stage
    if mode == tf.estimator.ModeKeys.TRAIN:
        is_training = True
    else:
        is_training = False
    # is_training = False   # 1
    input_tensor = features["input"]
    logits = xception(input_tensor, classes=_NUM_CLASSES,
                      is_training=is_training)
    probs = tf.nn.softmax(logits, name="output_score")
    predictions = tf.argmax(probs, axis=-1, name="output_label")
    onehot_labels = tf.one_hot(tf.cast(labels, tf.int32), _NUM_CLASSES)
    # provide a tf.estimator spec for PREDICT
    predictions_dict = {"score": probs,
                        "label": predictions}
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions_output = tf.estimator.export.PredictOutput(
            predictions_dict)
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions_dict,
                                          export_outputs={
                                              tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: predictions_output
                                          })
    # calculate loss
    # loss = focal_loss(onehot_labels, logits, gamma=1.5)
    gamma = 1.5
    weights = tf.reduce_sum(tf.multiply(
        onehot_labels, tf.pow(1. - probs, gamma)), axis=-1)
    loss = tf.losses.softmax_cross_entropy(
        onehot_labels, logits, weights=weights)
    accuracy = tf.metrics.accuracy(labels=labels,
                                   predictions=predictions)
    if mode == tf.estimator.ModeKeys.TRAIN:
        lr = params.learning_rate
        # train optimizer
        optimizer = tf.train.RMSPropOptimizer(learning_rate=lr, decay=0.9)
        update_ops = tf.get_collections(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(update_ops):
          train_op = optimizer.minimize(
              loss, global_step=tf.train.get_global_step())
        tensors_to_log = {'batch_accuracy': accuracy[1],
                          'logits': logits,
                          'label': labels}
        logging_hook = tf.train.LoggingTensorHook(
            tensors=tensors_to_log, every_n_iter=1000)
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train_op,
                                          training_hooks=[logging_hook])
    else:
        eval_metric_ops = {"accuracy": accuracy}
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          eval_metric_ops=eval_metric_ops)


def get_estimator_model(config=None, params=None):
    """
    Get estimator model by definition of model_fn
    """
    est_model = tf.estimator.Estimator(model_fn=model_fn,
                                       config=config,
                                       params=params)
    return est_model


def train(source_dir, model_save_path):
    """
    Train patch based model
    Args:
        source_dir: a directory where training tfrecords file stored. All TF records start with train will be used!
        model_save_path: weights save path
    """
    if tf.gfile.Exists(source_dir):
        train_data_paths = tf.gfile.Glob(source_dir+"/train*tfrecord")
        val_data_paths = tf.gfile.Glob(source_dir+"/val*tfrecord")
        if not len(train_data_paths):
            raise Exception(
                "[Train Error]: unable to find train*.tfrecord file")
        if not len(val_data_paths):
            raise Exception("[Eval Error]: unable to find val*.tfrecord file")
    else:
        raise Exception("[Train Error]: unable to find input directory!")
    train_config = tf.estimator.RunConfig()
    new_config = train_config.replace(model_dir=model_save_path,
                                      save_checkpoints_steps=1000,
                                      keep_checkpoint_max=5)
    params = tf.contrib.training.HParams(
        learning_rate=0.001,
        train_steps=5000,
        min_eval_frequency=1000
    )
    est_model = get_estimator_model(config=new_config,
                                    params=params)
    # define training config
    train_spec = tf.estimator.TrainSpec(input_fn=lambda: input_fn(data_path=train_data_paths,
                                                                  batch_size=_BATCH_SIZE,
                                                                  is_training=True),
                                        max_steps=_MAX_STEPS)

    eval_spec = tf.estimator.EvalSpec(input_fn=lambda: input_fn(data_path=val_data_paths,
                                                                batch_size=_BATCH_SIZE),
                                      steps=100,
                                      throttle_secs=900)
    # train and evaluate model
    tf.estimator.train_and_evaluate(estimator=est_model,
                                    train_spec=train_spec,
                                    eval_spec=eval_spec)


def evaluate(source_dir, model_save_path):
    """
    Eval patch based model
    Args:
        source_dir: directory where val tf records file stored. All TF records start with val will be used!
        model_save_path: model save path
    """
    # load model
    run_config = tf.contrib.learn.RunConfig(model_dir=model_save_path)
    est_model = get_estimator_model(run_config)
    # load test dataset
    if tf.gfile.Exists(source_dir):
        val_data_paths = tf.gfile.Glob(source_dir+"/val*.tfrecord")
        if not len(val_data_paths):
            raise Exception("[Eval Error]: unable to find val*.tfrecord file")
    else:
        raise Exception("[Train Error]: unable to find input directory!")
    accuracy_score = est_model.evaluate(input_fn=lambda: input_fn(val_data_paths,
                                                                  batch_size=_BATCH_SIZE,
                                                                  is_training=False))
    print("Accuracy of testing images: {}".format(accuracy_score)

# 4. [TensorFlow全新的数据读取方式：Dataset API入门教程](https://www.leiphone.com/news/201711/zV7yM5W1dFrzs8W5.html)

## `Dataset` 和 `Iterator`

`Dataset`可以看作是相同类型“元素”的有序列表。在实际使用时，单个“元素”可以是向量，也可以是字符串、图片，甚至是`tuple`或者`dict`。

In [16]:
import tensorflow as tf
import numpy as np

a = [3, 4, 5]
b = 'fg'
c = np.array([3, 4])

In [19]:
a1 = tf.data.Dataset.from_tensor_slices(np.array([1.0, 2.0, 3.0, 4.0, 5.0]))

In [20]:
a2 = tf.data.Dataset.from_tensor_slices(a)

如何将这个`dataset`中的元素取出呢？方法是从`Dataset`中示例化一个`Iterator`，然后对`Iterator`进行迭代：
- 在非`Eager`模式下，读取上述`dataset`中元素的方法为：

In [25]:
iterator = a1.make_one_shot_iterator()
one_element = iterator.get_next()
with tf.Session() as sess:
    for i in range(5):
        print(sess.run(one_element))

1.0
2.0
3.0
4.0
5.0


`make_one_shot_iterator()`从`dataset`中实例化了一个`Iterator`，这个`Iterator`是一个“one shot iterator”，即只能从头到尾读取一次。

In [26]:
dataset = tf.data.Dataset.from_tensor_slices(np.array([1.0, 2.0, 3.0, 4.0, 5.0]))
iterator = dataset.make_one_shot_iterator()
one_element = iterator.get_next()
with tf.Session() as sess:
    try:
        while True:
            print(sess.run(one_element))
    except tf.errors.OutOfRangeError:
        print("end!")

1.0
2.0
3.0
4.0
5.0
end!


在`Eager`模式中，创建`Iterator`的方式有所不同。是通过`tfe.Iterator(dataset)`的形式直接创建`Iterator`并迭代。迭代时可以直接取出值，不需要使用`sess.run()`：

In [27]:
import tensorflow.contrib.eager as tfe
tfe.enable_eager_execution()

dataset = tf.data.Dataset.from_tensor_slices(
    np.array([1.0, 2.0, 3.0, 4.0, 5.0]))

for one_element in tfe.Iterator(dataset):
    print(one_element)

AttributeError: module 'pandas.core.computation' has no attribute 'expressions'

## 从内存中创建更复杂的 Dataset

其实，`tf.data.Dataset.from_tensor_slices`的功能不止如此，它的真正作用是切分传入`Tensor`的第一个维度，生成相应的`dataset`。

In [28]:
dataset = tf.data.Dataset.from_tensor_slices(np.random.uniform(size=(5, 2)))

传入的数值是一个矩阵，它的形状为`(5, 2)`，`tf.data.Dataset.from_tensor_slices`就会切分它形状上的第一个维度，最后生成的`dataset`中一个含有`5`个元素，每个元素的形状是`(2, )`，即每个元素是矩阵的一行。

In [30]:
iterator = dataset.make_one_shot_iterator()
one_element = iterator.get_next()
with tf.Session() as sess:
    try:
        while True:
            print(sess.run(one_element))
    except tf.errors.OutOfRangeError:
        print("end!")

[0.39768109 0.71553004]
[0.98190965 0.03232115]
[0.8452184  0.17540232]
[0.12115773 0.51115376]
[0.43379943 0.87275321]
end!


在实际使用中，我们可能还希望`Dataset`中的每个元素具有更复杂的形式，如每个元素是一个 Python中 的元组，或是 Python 中的词典。例如，在图像识别问题中，一个元素可以是`{"image": image_tensor, "label": label_tensor}`的形式，这样处理起来更方便。`tf.data.Dataset.from_tensor_slices`同样支持创建这种`dataset`，例如我们可以让每一个元素是一个词典：

In [32]:
dataset = tf.data.Dataset.from_tensor_slices({
    "a":
    np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
    "b":
    np.random.uniform(size=(5, 2))
})

In [33]:
iterator = dataset.make_one_shot_iterator()
one_element = iterator.get_next()
with tf.Session() as sess:
    try:
        while True:
            print(sess.run(one_element))
    except tf.errors.OutOfRangeError:
        print("end!")

{'a': 1.0, 'b': array([0.11961808, 0.48213857])}
{'a': 2.0, 'b': array([0.65146467, 0.96469436])}
{'a': 3.0, 'b': array([0.63114947, 0.61944099])}
{'a': 4.0, 'b': array([0.04219048, 0.02537955])}
{'a': 5.0, 'b': array([0.44816809, 0.60973009])}
end!


利用`tf.data.Dataset.from_tensor_slices`创建每个元素是一个`tuple`的`dataset`也是可以的：

In [42]:
dataset = tfdataset = tf.data.Dataset.from_tensor_slices(
    (np.array([1.0, 2.0, 3.0, 4.0, 5.0]), np.random.uniform(size=(5, 2))))

iterator = dataset.make_one_shot_iterator()
one_element = iterator.get_next()

with tf.Session() as sess:
    try:
        while True:
            print(sess.run(one_element))
    except tf.errors.OutOfRangeError:
        print("end!")

(1.0, array([0.65244424, 0.29549035]))
(2.0, array([0.00764112, 0.7995072 ]))
(3.0, array([0.67443196, 0.62777513]))
(4.0, array([0.56733892, 0.91038492]))
(5.0, array([0.37998656, 0.13630998]))
end!


## 对 `Dataset`中的元素做变换：`Transformation`
`Dataset`支持一类特殊的操作：`Transformation`。一个`Dataset`通过`Transformation`变成一个新的`Dataset`。通常我们可以通过`Transformation`完成数据变换，打乱，组成`batch`，生成`epoch`等一系列操作。

常用的`Transformation`有：
- `map`
- `batch`
- `shuffle`
- `repeat`

### （1）`map`
`map`接收一个函数，`Dataset`中的每个元素都会被当作这个函数的输入，并将函数返回值作为新的`Dataset`，如我们可以对`dataset`中每个元素的值加`1`：

In [35]:
dataset = tf.data.Dataset.from_tensor_slices(
    np.array([1.0, 2.0, 3.0, 4.0, 5.0]))
dataset = dataset.map(lambda x: x + 1)     # 2.0, 3.0, 4.0, 5.0, 6.0

### （2）`batch`
`batch`就是将多个元素组合成`batch`，如下面的程序将`dataset`中的每个元素组成了大小为`32`的`batch`：

In [49]:
dataset = tf.data.Dataset.from_tensor_slices(np.random.sample((1024, 28, 28)))
dataset = dataset.batch(32)

In [50]:
iterator = dataset.make_one_shot_iterator()
one_element = iterator.get_next()
print(one_element.shape)
with tf.Session() as sess:
    print(sess.run(one_element).shape)

(?, 28, 28)
(32, 28, 28)


### （3）`shuffle`
`shuffle`的功能为打乱`dataset`中的元素，它有一个参数`buffersize`，表示打乱时使用的`buffer`的大小：

In [51]:
dataset = tf.data.Dataset.from_tensor_slices(np.random.sample((60000, 28, 28)))
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)

In [52]:
iterator = dataset.make_one_shot_iterator()
one_element = iterator.get_next()
print(one_element.shape)
with tf.Session() as sess:
    print(sess.run(one_element).shape)

(?, 28, 28)
(32, 28, 28)


### （4）`repeat`
`repeat`的功能就是将整个序列重复多次，主要用来处理机器学习中的`epoch`，假设原先的数据是一个`epoch`，使用`repeat(5)`就可以将之变成`5`个`epoch`：

In [58]:
dataset = tf.data.Dataset.from_tensor_slices(np.arange(3))
dataset = dataset.repeat(5)

如果直接调用`repeat()`的话，生成的序列就会无限重复下去，没有结束，因此也不会抛出`tf.errors.OutOfRangeError`异常：
```py
dataset=dataset.repeat()
```

### 例子：读入磁盘图片与对应 label

讲到这里，我们可以来考虑一个简单，但同时也非常常用的例子：读入磁盘中的图片和图片相应的 label，并将其打乱，组成 `batch_size=32` 的训练样本。在训练时重复`10`个 epoch。

In [209]:
import os

import tensorflow as tf

root = 'E:/Data/datasets/flower_photos/'


def image_string(root):
    '''获取根目录下的图片的文件名和标签以及整个数据集的类别'''
    file_names = []
    label_names = []
    for k in os.listdir(root):
        if os.path.isdir(root + k):
            '''如果根目录下的是目录，则该目录下拥有需要处理的文件'''
            for file_name in os.listdir(root + k):
                file_names.append(root + k + '/' + file_name)
                label_names.append(k)
    return file_names, label_names

file_names, label_names = image_string(root)

class Index(object):
    '''对标签进行处理'''

    def __init__(self, label_names):
        self.label_names = label_names

    def one_hot(self):
        '''将标签转换为 one-hot 编码格式'''
        df = pd.get_dummies(self.label_names)
        classes = df.columns
        return df.get_values(), classes

    def to_categorical_index(self):
        '''将标签转换为 Categorical 类，以加快计算速度'''
        labels = pd.CategoricalIndex(self.label_names)
        labels_to_array = labels.codes
        classes = labels.categories
        return labels_to_array, classes

In [210]:
def _parse_function(file_name):
    '''将 filename 对应的图片文件读进来，并将标签转换为 one-hot 编码'''
    image_string = tf.read_file(file_name)
    image_decoded = tf.image.decode_image(image_string)
    return image_decoded

In [214]:
index = Index(label_names)
labels, classes = index.to_categorical_index()

In [236]:
# 此时 dataset中的一个元素是(filename, label)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# 此时 dataset 中的一个元素是(image_resized, label)
dataset = dataset.map(_parse_function)

iterator = dataset.make_one_shot_iterator()
one_element = iterator.get_next()
with tf.Session() as sess:
    img = sess.run(one_element)
    # 缩放到统一的大小
    image_resized = tf.image.resize_images(img, [28, 28])
# 在每个 epoch 内将图片打乱组成大小为 32 的 batch，并重复 10 次。
#dataset = dataset.shuffle(buffer_size=1000).batch(32).repeat(10)

In [237]:
image_resized.shape

TensorShape([Dimension(28), Dimension(28), Dimension(3)])

In [222]:
with tf.Session() as sess:
    img = sess.run(image_decoded)
    # 缩放到统一的大小
    image_resized = tf.image.resize_images(img, [28, 28])

In [70]:
img[0].shape

(263, 320, 3)

接下来我们就可以用这两个Tensor来建立模型了。

除了 `tf.data.Dataset.from_tensor_slices`外，目前 Dataset API 还提供了另外三种创建`Dataset`的方式：
- `tf.data.TextLineDataset()`：这个函数的输入是一个文件的列表，输出是一个`dataset`。`dataset`中的每一个元素就对应了文件中的一行。可以使用这个函数来读入`CSV`文件。
- `tf.data.FixedLengthRecordDataset()`：这个函数的输入是一个文件的列表和一个`record_bytes`，之后`dataset`的每一个元素就是文件中固定字节数`record_bytes`的内容。通常用来读取以二进制形式保存的文件，如`CIFAR10`数据集就是这种形式。
- `tf.data.TFRecordDataset()`：顾名思义，这个函数是用来读`TFRecord`文件的，`dataset`中的每一个元素就是一个`TFExample`。

详细内容见：[Module: tf.data](https://www.tensorflow.org/api_docs/python/tf/data)

在非 Eager 模式下，最简单的创建 Iterator 的方法就是通过 `dataset.make_one_shot_iterator()`来创建一个one shot iterator。除了这种 one shot iterator 外，还有三个更复杂的 Iterator，即：
- `initializable iterator`
- `reinitializable iterator`
- `feedable iterator`

`initializable iterator`必须要在使用前通过`sess.run()`来初始化。使用`initializable iterator`，可以将`placeholder` 代入 Iterator 中，这可以方便我们通过参数快速定义新的 Iterator。一个简单的 initializable iterator 使用示例：

In [41]:
limit = tf.placeholder(dtype=tf.int32, shape=[])

dataset = tf.data.Dataset.from_tensor_slices(tf.range(start=0, limit=limit))

iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={limit: 10})
    for i in range(10):
        value = sess.run(next_element)
        assert i == value

此时的`limit`相当于一个“参数”，它规定了 Dataset 中数的“上限”。

`initializable iterator`还有一个功能：读入较大的数组。

在使用`tf.data.Dataset.from_tensor_slices(array)`时，实际上发生的事情是将`array`作为一个`tf.constants`保存到了计算图中。当`array`很大时，会导致计算图变得很大，给传输、保存带来不便。这时，我们可以用一个`placeholder`取代这里的`array`，并使用`initializable iterator`，只在需要时将`array`传进去，这样就可以避免把大数组保存在图里，示例代码为（来自官方例程）：

```py
# 从硬盘中读入两个 Numpy 数组
with np.load("/var/data/training_data.npy") as data:
    features = data["features"]
    labels = data["labels"]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices(
    (features_placeholder, labels_placeholder))
iterator = dataset.make_initializable_iterator()
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                          labels_placeholder: labels})
```

`reinitializable iterator`和`feedable iterator`相比`initializable iterator`更复杂，也更加少用，如果想要了解它们的功能，可以参阅[官方介绍](https://www.tensorflow.org/programmers_guide/datasets#creating_an_iterator)，这里就不再赘述了。

`tf.data`
- [官方Guide](https://www.tensorflow.org/programmers_guide/datasets)
- [API文档](https://www.tensorflow.org/api_docs/python/tf/data)
- [如何联合使用Dataset和Estimator](https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html)

In [43]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)  # ==> "tf.float32"
print(dataset1.output_shapes)  # ==> "(10,)"

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random_uniform([4]),
    tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)  # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes)  # ==> "((), (100,))"

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)  # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes)  # ==> "(10, ((), (100,)))"

<dtype: 'float32'>
(10,)
(tf.float32, tf.int32)
(TensorShape([]), TensorShape([Dimension(100)]))
(tf.float32, (tf.float32, tf.int32))
(TensorShape([Dimension(10)]), (TensorShape([]), TensorShape([Dimension(100)])))


In [31]:
dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random_uniform([4]),
    "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)  # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes)  # ==> "{'a': (), 'b': (100,)}"

{'a': tf.float32, 'b': tf.int32}
{'a': TensorShape([]), 'b': TensorShape([Dimension(100)])}


```py
dataset1 = dataset1.map(lambda x: ...)

dataset2 = dataset2.flat_map(lambda x, y: ...)
```