# Building Input Functions with tf.estimator 读书笔记

[tf官网原文](https://www.tensorflow.org/get_started/input_fn)

input_fn 的目的是把feature传入Extimator中的train, evaluate和predict中。可以在input_fn中做feature engineering 或者 pre-processing

In [1]:
import tensorflow as tf


## input_fn的基本形式
``` python
def my_input_fn():

    # Preprocess your data here...

    # ...then return 1) a mapping of feature columns to Tensors with
    # the corresponding feature data, and 2) a Tensor containing labels
    return feature_cols, labels
```

回传值一定是feature_cols和labels。

**feature_cols**: A dict containing key/value pairs that map feature column names to Tensors (or SparseTensors) containing the corresponding feature data.

**labels**:A Tensor containing your label (target) values: the values your model aims to predict.

feature colums的官方解释：

A **FeatureColumn** represents a single feature in your data. A **FeatureColumn** may represent a quantity like 'height', or it may represent a category like 'eye_color' where the value is drawn from a set of discrete possibilities like {'blue', 'brown', 'green'}.

In the case of both continuous features like 'height' and categorical features like 'eye_color', a single value in the data might get transformed into a sequence of numbers before it is input into the model. The FeatureColumn abstraction lets you manipulate the feature as a single semantic unit in spite of this fact. You can specify transformations and select features to include without dealing with specific indices in the tensors you feed into the model.

自己的理解，由于有的时候要对某一个特定的feature做preprocessing，在pandas里面可以直接做到拿column的动作，应该tf也是支持这一属性的，于是称为feature column。

## numpy pandas资料处理

如果资料室pandas或者numpy，可以利用一下方法产生input_fn:
```python
import numpy as np

my_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"x": np.array(x_data)},
    y=np.array(y_data),
    ...)

import pandas as pd

my_input_fn = tf.extimator.inputs.pandas_input_fn(
    x=pd.DataFrame({"x": x_data}),
    y=pd.Series(y_data),
    ...)
```

## 将input_fn传入model

直接将function object传入input_fn中。
```python
classifier.train(input_fn=my_input_fn, steps=2000)
```
因为是function object，意味着我们无法限定这个function的输入参数，建议使用lambda来传入参数，这样对于train evaluate的input_fn并不需要重新命名。
```python
classifier.train(input_fn=lambda: my_input_fn(training_set), steps=2000)
classifier.evaluate(input_fn=lambda: my_input_fn(test_set), steps=2000)
```
如果需要更多的参数来控制function可以采用以下方法从numpy或pandas获取data set。

```python
import pandas as pd

def get_input_fn_from_pandas(data_set, num_epochs=None, shuffle=True):
  return tf.estimator.inputs.pandas_input_fn(
      x=pdDataFrame(...),
      y=pd.Series(...),
      num_epochs=num_epochs,
      shuffle=shuffle)
      
import numpy as np

def get_input_fn_from_numpy(data_set, num_epochs=None, shuffle=True):
  return tf.estimator.inputs.numpy_input_fn(
      x={...},
      y=np.array(...),
      num_epochs=num_epochs,
      shuffle=shuffle)
```

[tf.estimator.inputs.pandas_input_fn api](https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/pandas_input_fn)

[ tf.estimator.inputs.numpy_input_fn api](https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn)

# Boston House Values DNN

实作input_fn在DNN上。

[itertools 介绍](https://docs.python.org/3/library/itertools.html)

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import itertools

import pandas as pd

tf.logging.set_verbosity(tf.logging.INFO)

定义columns，并利用pandas从csv内读取资料。

skipinitialspace : boolean, default False
Skip spaces after delimiter.

在delimiter后为什么会有spaces？自己测试True or False好像没有太大的区别。

In [3]:
COLUMNS = ["crim", "zn", "indus", "nox", "rm", "age", "dis", "tax", "ptratio", "medv"]
FEATURES = ["crim", "zn", "indus", "nox", "rm", "age", "dis", "tax", "ptratio"]
LABEL = "medv"

training_set = pd.read_csv("./boston/boston_train.csv",
                           skipinitialspace=True,
                           skiprows=1,
                           names=COLUMNS)
test_set = pd.read_csv("./boston/boston_test.csv",
                      skipinitialspace=True,
                      skiprows=1,
                      names=COLUMNS)
prediction_set = pd.read_csv("./boston/boston_predict.csv",
                            skipinitialspace=True,
                            skiprows=1,
                            names=COLUMNS)

### 建构FeatureColumn

需要产生一个NumericColum的list

建構get_input_fn，让他return一个input function object，这样可以输入一些参数。

**num_epochs**：在training的时候设定为None，让train函数的steps控制次数。而在evaluate的时候设定为1，这样的话input_fn只跑一次。

In [8]:
feature_cols = [tf.feature_column.numeric_column(k) for k in FEATURES]
regressor = tf.estimator.DNNRegressor(feature_columns=feature_cols,
                                     hidden_units=[10,10],
                                     model_dir="./model/boston_model")

# Building the input_fn

def get_input_fn(data_set, num_epochs=None, shuffle=True):
    return tf.estimator.inputs.pandas_input_fn(
        x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
        y=pd.Series(data_set[LABEL].values),
        num_epochs=num_epochs,
        shuffle=shuffle)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './model/boston_model', '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1, '_save_summary_steps': 100, '_log_step_count_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600}


设定5000次steps，并传入training_set。

In [9]:
regressor.train(input_fn=get_input_fn(training_set), steps=5000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into ./model/boston_model/model.ckpt.
INFO:tensorflow:loss = 64316.5, step = 1
INFO:tensorflow:global_step/sec: 271.077
INFO:tensorflow:loss = 10938.7, step = 101 (0.373 sec)
INFO:tensorflow:global_step/sec: 319.752
INFO:tensorflow:loss = 9922.81, step = 201 (0.313 sec)
INFO:tensorflow:global_step/sec: 294.247
INFO:tensorflow:loss = 11458.7, step = 301 (0.336 sec)
INFO:tensorflow:global_step/sec: 303.462
INFO:tensorflow:loss = 5194.75, step = 401 (0.332 sec)
INFO:tensorflow:global_step/sec: 214.842
INFO:tensorflow:loss = 6338.51, step = 501 (0.465 sec)
INFO:tensorflow:global_step/sec: 280.574
INFO:tensorflow:loss = 11045.4, step = 601 (0.355 sec)
INFO:tensorflow:global_step/sec: 245.094
INFO:tensorflow:loss = 7759.94, step = 701 (0.408 sec)
INFO:tensorflow:global_step/sec: 305.579
INFO:tensorflow:loss = 6263.56, step = 801 (0.328 sec)
INFO:tensorflow:global_step/sec: 266.529
INFO:tensorflow:loss = 7780

<tensorflow.python.estimator.canned.dnn.DNNRegressor at 0x111024400>

### 评估模型
利用test_set和model.evaluate来评估模型的好坏程度。

In [12]:
ev = regressor.evaluate(input_fn=get_input_fn(test_set, num_epochs=1, shuffle=False))
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))

INFO:tensorflow:Starting evaluation at 2017-08-21-12:36:08
INFO:tensorflow:Restoring parameters from ./model/boston_model/model.ckpt-5000
INFO:tensorflow:Finished evaluation at 2017-08-21-12:36:08
INFO:tensorflow:Saving dict for global step 5000: average_loss = 11.9585, global_step = 5000, loss = 1195.85
Loss: 1195.853882


### 模型预测

利用model.predict来预测prediction_set。

In [13]:
y = regressor.predict(
    input_fn=get_input_fn(prediction_set, num_epochs=1, shuffle=False))
# .predict() returns an iterator of dicts; convert to a list and print
# predictions
predictions = list(p["predictions"] for p in itertools.islice(y, 6))
print("Predictions: {}".format(str(predictions)))

INFO:tensorflow:Restoring parameters from ./model/boston_model/model.ckpt-5000
Predictions: [array([ 32.69150925], dtype=float32), array([ 17.34743881], dtype=float32), array([ 22.3914566], dtype=float32), array([ 33.71511459], dtype=float32), array([ 14.71958637], dtype=float32), array([ 18.5411911], dtype=float32)]
