# Logging and Monitoring Basics with tf.contrib.learn 读书笔记
[tf官方原文](https://www.tensorflow.org/get_started/monitors)

这篇文章是基于[quickstart的模型](quickstart.ipynb)

在用keras的时候一直都知道monitoring是非常重要的东西，因为overfitting可以用earlystoping来解决问题，而earlystoping和monitoring是紧密相关的。

tensorflow有提供[Monitor API](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/monitors)来进行在training阶段的监控。

## 在TensorFlow中开启Logging

在tensorflow中有5种不同levels的log讯息。从小到大分别是，**DEBUG, INFO, WARN, ERROR, FATAL**，如果将log level设定为ERROR，就会将ERROR和FATAL都一起log出来，同理，设定DEBUG就会将所有讯息都一起log。

tf的Default设定是 **WARN**，可是为了要能够将fit内部的讯息log出来，要设定为 **INFO** *（现在好像改掉了，default就是**INFO**）*

```python
tf.logging.set_verbosity(tf.logging.INFO)
```

只要设定为INFO，就会出现每100 steps出现一个log，并将loss metrics内的东西显示出来。类似于下方讯息。

```python
INFO:tensorflow:loss = 1.18812, step = 1
INFO:tensorflow:loss = 0.210323, step = 101
INFO:tensorflow:loss = 0.109025, step = 201
```

## ValidationMonitor

其实在tf中还有除了ValidationMonitor之外的模组，可是这个感觉比较常用，而且对我来说好像就够用了。

google对ValidationMonitor的描述：Logs a specified set of evaluation metrics at every n steps of training, and, if desired, implements early stopping under certain conditions

利用test_set中的资料来防止overfitting，并且每50 steps evaluate一次。

```python
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps=50)
```

详细参考[ValidationMonitor \_\_init\_\_](https://www.tensorflow.org/versions/r0.12/api_docs/python/contrib.learn.monitors/ops#ValidationMonitor)

## RunConfig

因为ValidationMonitor需要利用到checkpoints来储存模型，于是需要在建构model的时候就要加入RunConfig这个物件来传入参数。RunConfig是一定需要的，不然做不了validation。

[RunConifg API](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/RunConfig)

在config中传入RunConfig
```python
classifier = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[10, 20, 10],
    n_classes=3,
    model_dir="/tmp/iris_model",
    config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))
```

## 使用ValidationMonitor

在fit function中传入validation_monitor
```python
classifier.fit(x=training_set.data,
               y=training_set.target,
               steps=2000,
               monitors=[validation_monitor])
```

## Evaluation Metrics

在tf中如果没有设定Evaluation Metrics输出的是loss和accuracy，当然和keras一样可以设定自己需要的Evluation Metrics。

首先要构造一个dict里面key代表这个值的名称，value就是[**MetricSpec**](https://github.com/tensorflow/tensorflow/blob/r1.3/tensorflow/contrib/learn/python/learn/metric_spec.py)，**MetricSpec**就是Evaluation Metrics。

### MetricSpec google解读

东西太多了，还是贴英文吧。

The MetricSpec constructor accepts **four parameters**:

* **metric_fn** The function that calculates and returns the value of a metric. This can be a predefined function available in the [tf.contrib.metrics](https://www.tensorflow.org/api_docs/python/tf/contrib/metrics) module, such as [tf.contrib.metrics.streaming_precision](https://www.tensorflow.org/api_docs/python/tf/contrib/metrics/streaming_precision) or [tf.contrib.metrics.streaming_recall](https://www.tensorflow.org/api_docs/python/tf/contrib/metrics/streaming_recall). Alternatively, you can define your own custom metric function, which must take **predictions_key** and **labels_key** tensors as arguments (a weights argument can also optionally be supplied). The function must return the value of the metric in one of two formats:

  * A single tensor
  * A pair of ops (value_op, update_op), where value_op returns the metric value and update_op performs a corresponding operation to update internal model state.
  
* **prediction_key** The key of the tensor containing the predictions returned by the model. This argument may be omitted if the model returns either a single tensor or a dict with a single entry. For a DNNClassifier model, class predictions will be returned in a tensor with the key tf.contrib.learn.PredictionKey.CLASSES.

* **label_key** The key of the tensor containing the labels returned by the model, as specified by the model's input_fn. As with prediction_key, this argument may be omitted if the input_fn returns either a single tensor or a dict with a single entry. In the iris example in this tutorial, the DNNClassifier does not have an input_fn (x,y data is passed directly to fit), so it's not necessary to provide a label_key.

* **weights_key** Optional. The key of the tensor (returned by the input_fn) containing weights inputs for the metric_fn.

代码例子
```python
validation_metrics = {
    "accuracy":
        tf.contrib.learn.MetricSpec(
            metric_fn=tf.contrib.metrics.streaming_accuracy,
            prediction_key=tf.contrib.learn.PredictionKey.CLASSES),
    "precision":
        tf.contrib.learn.MetricSpec(
            metric_fn=tf.contrib.metrics.streaming_precision,
            prediction_key=tf.contrib.learn.PredictionKey.CLASSES),
    "recall":
        tf.contrib.learn.MetricSpec(
            metric_fn=tf.contrib.metrics.streaming_recall,
            prediction_key=tf.contrib.learn.PredictionKey.CLASSES)
}

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps=50,
    metrics=validation_metrics)
```

结果

```python
INFO:tensorflow:Validation (step 50): recall = 0.0, loss = 1.20626, global_step = 1, precision = 0.0, accuracy = 0.266667
...
INFO:tensorflow:Validation (step 600): recall = 1.0, loss = 0.0530696, global_step = 571, precision = 1.0, accuracy = 0.966667
...
INFO:tensorflow:Validation (step 1500): recall = 1.0, loss = 0.0617403, global_step = 1452, precision = 1.0, accuracy = 0.966667
```

## Early Stopping with ValidationMonitor

重点！重点！重点！（敲黑板），上面所有的一切都只是为了Early Stopping而已。

要使用Early Stopping需要在ValidationMonitor中新加入三个参数（这边引用原文解释，英文的解释比较清晰）

|Param|Description|
|:-|:-|
|early_stopping_metric|Metric that triggers early stopping (e.g., loss or accuracy) under conditions specified in early_stopping_rounds and early_stopping_metric_minimize. Default is "loss".|
|early_stopping_metric_minimize|True if desired model behavior is to minimize the value of early_stopping_metric; False if desired model behavior is to maximize the value of early_stopping_metric. Default is True.|
|early_stopping_rounds|Sets a number of steps during which if the early_stopping_metric does not decrease (if early_stopping_metric_minimize is True) or increase (if early_stopping_metric_minimize is False), training will be stopped. Default is None, which means early stopping will never occur.|

创建一个如果loss在200 steps都没有更新的earlystopping。
```python
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps=50,
    metrics=validation_metrics,
    early_stopping_metric="loss",
    early_stopping_metric_minimize=True,
    early_stopping_rounds=200)
```

## 利用TensorBoard视觉化log data

在terminal中输入tensorboard --logdir=/tmp/iris_model/(放model的文件夹)就可以看到在train的过程中log资讯了。之后有更详细的学习。

## Code

tf官方网站并没有一份完整的code，这边做一些整理。

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

import numpy as np
import tensorflow as tf

In [2]:
tf.logging.set_verbosity(tf.logging.INFO)

# Data sets
IRIS_TRAINING = "./data/iris/iris_training.csv"
IRIS_TEST = "./data/iris/iris_test.csv"

def main(unused_argv):
    # Load datasets.
    training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
        filename=IRIS_TRAINING, target_dtype=np.int, features_dtype=np.float32)
    test_set = tf.contrib.learn.datasets.base.load_csv_with_header(
        filename=IRIS_TEST, target_dtype=np.int, features_dtype=np.float32)

    # Specify that all features have real-value data
    feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]
    
    # Create validation metrics
    validation_metrics = {
        "accuracy":
            tf.contrib.learn.MetricSpec(
                metric_fn=tf.contrib.metrics.streaming_accuracy,
                prediction_key=tf.contrib.learn.PredictionKey.CLASSES),
        "precision":
            tf.contrib.learn.MetricSpec(
                metric_fn=tf.contrib.metrics.streaming_precision,
                prediction_key=tf.contrib.learn.PredictionKey.CLASSES),
        "recall":
            tf.contrib.learn.MetricSpec(
                metric_fn=tf.contrib.metrics.streaming_recall,
                prediction_key=tf.contrib.learn.PredictionKey.CLASSES),
    }
    # Create validation_monitor with early stopping
    validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
        test_set.data,
        test_set.target,
        every_n_steps=50,
        metrics=validation_metrics,
        early_stopping_metric="loss",
        early_stopping_metric_minimize=True,
        early_stopping_rounds=200)

    # Build 3 layer DNN with 10, 20, 10 units respectively.
    classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                                hidden_units=[10, 20, 10],
                                                n_classes=3,
                                                model_dir="/tmp/iris_model",
                                                config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))

    # Fit model.
    classifier.fit(x=training_set.data,
                   y=training_set.target,
                   steps=6000,
                   monitors=[validation_monitor])

    # Evaluate accuracy.
    accuracy_score = classifier.evaluate(x=test_set.data,
                                         y=test_set.target)["accuracy"]
    print('Accuracy: {0:f}'.format(accuracy_score))

    # Classify two new flower samples.
    new_samples = np.array(
        [[6.4, 3.2, 4.5, 1.5], [5.8, 3.1, 5.0, 1.7]], dtype=float)
    y = list(classifier.predict(new_samples, as_iterable=True))
    print('Predictions: {}'.format(str(y)))

if __name__ == "__main__":
    tf.app.run()

Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_save_checkpoints_steps': None, '_master': '', '_evaluation_master': '', '_tf_random_seed': None, '_num_ps_replicas': 0, '_environment': 'local', '_task_id': 0, '_save_checkpoints_secs': 1, '_log_step_count_steps': 100, '_model_dir': '/tmp/iris_model', '_num_worker_replicas': 0, '_session_config': None, '_keep_checkpoint_every_n_hours': 10000, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1142dda20>}
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = 

  equality = a == b


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/iris_model/model.ckpt.
INFO:tensorflow:loss = 1.81938, step = 1
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will onl

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
