编写自定义 Estimator

tf.estimator.Estimator类的用法 

参考链接 https://www.cnblogs.com/zongfa/p/10149483.html

https://blog.csdn.net/qq_31573519/article/details/108062970

https://www.cnblogs.com/gogoSandy/p/12435372.html

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn
import pandas as pd
import os
import sys
import time
import tensorflow as tf

from tensorflow import keras

print(tf.__version__)
print(sys.version_info)
for module in mpl, np, pd, sklearn, tf, keras:
    print(module.__name__, module.__version__)

1.14.0
sys.version_info(major=3, minor=6, micro=5, releaselevel='final', serial=0)
matplotlib 2.2.2
numpy 1.16.6
pandas 0.23.0
sklearn 0.23.2
tensorflow 1.14.0
tensorflow.python.keras.api._v1.keras 2.2.4-tf


In [2]:
# https://storage.googleapis.com/tf-datasets/titanic/train.csv
# https://storage.googleapis.com/tf-datasets/titanic/eval.csv
train_file = "./data/titanic/train.csv"
eval_file = "./data/titanic/eval.csv"

train_df = pd.read_csv(train_file)
eval_df = pd.read_csv(eval_file)

print(train_df.head())
print(eval_df.head())

   survived     sex   age  n_siblings_spouses  parch     fare  class     deck  \
0         0    male  22.0                   1      0   7.2500  Third  unknown   
1         1  female  38.0                   1      0  71.2833  First        C   
2         1  female  26.0                   0      0   7.9250  Third  unknown   
3         1  female  35.0                   1      0  53.1000  First        C   
4         0    male  28.0                   0      0   8.4583  Third  unknown   

   embark_town alone  
0  Southampton     n  
1    Cherbourg     n  
2  Southampton     y  
3  Southampton     n  
4   Queenstown     y  
   survived     sex   age  n_siblings_spouses  parch     fare   class  \
0         0    male  35.0                   0      0   8.0500   Third   
1         0    male  54.0                   0      0  51.8625   First   
2         1  female  58.0                   0      0  26.5500   First   
3         1  female  55.0                   0      0  16.0000  Second   
4         

In [3]:
y_train = train_df.pop('survived')
y_eval = eval_df.pop('survived')

print(train_df.head())
print(eval_df.head())
print(y_train.head())
print(y_eval.head())

      sex   age  n_siblings_spouses  parch     fare  class     deck  \
0    male  22.0                   1      0   7.2500  Third  unknown   
1  female  38.0                   1      0  71.2833  First        C   
2  female  26.0                   0      0   7.9250  Third  unknown   
3  female  35.0                   1      0  53.1000  First        C   
4    male  28.0                   0      0   8.4583  Third  unknown   

   embark_town alone  
0  Southampton     n  
1    Cherbourg     n  
2  Southampton     y  
3  Southampton     n  
4   Queenstown     y  
      sex   age  n_siblings_spouses  parch     fare   class     deck  \
0    male  35.0                   0      0   8.0500   Third  unknown   
1    male  54.0                   0      0  51.8625   First        E   
2  female  58.0                   0      0  26.5500   First        C   
3  female  55.0                   0      0  16.0000  Second  unknown   
4    male  34.0                   0      0  13.0000  Second        D   

  

In [4]:
train_df.describe()

Unnamed: 0,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0
mean,29.631308,0.545455,0.379585,34.385399
std,12.511818,1.15109,0.792999,54.59773
min,0.75,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,15.0458
75%,35.0,1.0,0.0,31.3875
max,80.0,8.0,5.0,512.3292


In [5]:
# 定义特征列
# tf.feature_column 都标识了特征名称、特征类型和任何输入预处理操作

categorical_columns = ['sex', 'n_siblings_spouses', 'parch', 'class',
                       'deck', 'embark_town', 'alone']
numeric_columns = ['age', 'fare']

feature_columns = []
for categorical_column in categorical_columns:
    vocab = train_df[categorical_column].unique()
    print(categorical_column, vocab)
    feature_columns.append(
        tf.feature_column.indicator_column(
            #用法：根据单词的序列顺序，吧单词根据index转换为 one hot encoding
            tf.feature_column.categorical_column_with_vocabulary_list(
                categorical_column, vocab)))

for categorical_column in numeric_columns:
    feature_columns.append(
        tf.feature_column.numeric_column(
            categorical_column, dtype=tf.float32))

sex ['male' 'female']
n_siblings_spouses [1 0 3 4 2 5 8]
parch [0 1 2 5 3 4]
class ['Third' 'First' 'Second']
deck ['unknown' 'C' 'G' 'A' 'B' 'D' 'F' 'E']
embark_town ['Southampton' 'Cherbourg' 'Queenstown' 'unknown']
alone ['n' 'y']


构建数据集合：

    每个数据集导入函数都必须返回两个对象
    一个字典，其中键是特征名称，值是包含相应特征数据的张量（或 SparseTensor）
    一个包含一个或多个标签的张量

In [6]:
#构建数据集
def make_dataset(data_df, label_df, epochs = 10, shuffle = True,
                 batch_size = 32):
    dataset = tf.data.Dataset.from_tensor_slices(
        (dict(data_df), label_df))
    if shuffle:
        dataset = dataset.shuffle(10000)
    dataset = dataset.repeat(epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()

In [7]:
x = make_dataset(eval_df, y_eval, epochs = 1)
print(x)

Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.
({'sex': <tf.Tensor 'IteratorGetNext:8' shape=(?,) dtype=string>, 'age': <tf.Tensor 'IteratorGetNext:0' shape=(?,) dtype=float64>, 'n_siblings_spouses': <tf.Tensor 'IteratorGetNext:6' shape=(?,) dtype=int64>, 'parch': <tf.Tensor 'IteratorGetNext:7' shape=(?,) dtype=int64>, 'fare': <tf.Tensor 'IteratorGetNext:5' shape=(?,) dtype=float64>, 'class': <tf.Tensor 'IteratorGetNext:2' shape=(?,) dtype=string>, 'deck': <tf.Tensor 'IteratorGetNext:3' shape=(?,) dtype=string>, 'embark_town': <tf.Tensor 'IteratorGetNext:4' shape=(?,) dtype=string>, 'alone': <tf.Tensor 'IteratorGetNext:1' shape=(?,) dtype=string>}, <tf.Tensor 'IteratorGetNext:9' shape=(?,) dtype=int64>)


1、自定义estimator

tf.estimator.Estimator()

1、__init__(self, model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

参数：

    model_fn: 模型函数。函数的格式如下：
      参数：
      1、features: 这是 input_fn 返回的第一项（input_fn 是 train, evaluate 和  predict 的参数）。类型应该是单一的 Tensor 或者 dict。
      2、labels: 这是 input_fn 返回的第二项。类型应该是单一的 Tensor 或者 dict。如果 mode 为 ModeKeys.PREDICT，则会默认为 labels=None。如果 model_fn 不接受 mode，model_fn 应该仍然可以处理 labels=None。
      3、mode: 可选。指定是训练、验证还是测试。参见 ModeKeys。
      4、params: 可选，超参数的 dict。 可以从超参数调整中配置 Estimators。
      5、config: 可选，配置。如果没有传则为默认值。可以根据 num_ps_replicas 或 model_dir 等配置更新 model_fn。
      返回：
      EstimatorSpec
    model_dir: 保存模型参数、图等的地址，也可以用来将路径中的检查点加载至 estimator 中来继续训练之前保存的模型。如果是 PathLike， 那么路径就固定为它了。如果是 None，那么 config 中的 model_dir 会被使用（如果设置了的话），如果两个都设置了，那么必须相同；如果两个都是 None，则会使用临时目录。
    config: 配置类。
    params: 超参数的dict，会被传递到 model_fn。keys 是参数的名称，values 是基本 python 类型。
    warm_start_from: 可选，字符串，检查点的文件路径，用来指示从哪里开始热启动。或者是 tf.estimator.WarmStartSettings 类来全部配置热启动。如果是字符串路径，则所有的变量都是热启动，并且需要 Tensor 和词汇的名字都没有变。
异常：

RuntimeError： 开启了 eager execution

ValueError：model_fn 的参数与 params 不匹配

ValueError：这个函数被 Estimator 的子类所覆盖

2、tf.feature_column.input_layer

tensorflow中，如果已经聚集了一大堆特征，如何将特征转换成模型可以直接输入的数据，
可以通过tf.feature_column.input_layer将数据输入进神经网络。通常在tensorflow中，
训练数据中的单条Example通常表示成FeatureColumn，而在模型的第一层中，
面向列的数据通常转换为tensor。

    tf.feature_column.input_layer(
        features,
        feature_columns,
        weight_collections=None,
        trainable=True,
        cols_to_vars=None,
        cols_to_output_tensors=None
    )

关键参数

    features：从key到tensor的一个映射，_FeatureColumn就是通过这些key来查询的，对应的features的值都是SparseTensor或者Tensor
    feature_columns：表示一个迭代器，包含在模型中将要使用到的FeatureColumns。其中每个列特征值的类型都必须是 _DenseColumn类的数值，
    可能是numeric_column、embedding_column、bucketized_column、indicator_column，如果是类别特征的话，
    需要包装在embedding_column或者indicator_column中。
   
返回

    返回一个表示模型输入层的一个Tensor，shape为(batch_size, first_layer_dimension)，数值类型为float32，first_layer_dimension的大小需要基于feature_columns来定。


In [8]:
#保存模型参数、图等的地址
output_dir = "customized_estimator"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

# 构造一个 Estimator 的实例
def model_fn(features, labels, mode, params):
    """
        features: 这是 input_fn 返回的第一项（input_fn 是 train, evaluate 和 predict 的参数）。
                 类型应该是单一的 Tensor 或者 dict
        labels:  input_fn 返回的第二项。类型应该是单一的 Tensor 或者 dict。
                如果 mode 为 ModeKeys.PREDICT，则会默认为 labels=None。
                如果 model_fn 不接受 mode，model_fn 应该仍然可以处理 labels=None。
        mode: 指定是训练、验证还是测试 
        params: 超参数的 dict。 可以从超参数调整中配置 Estimators。
    """
    input_for_next_layer = tf.feature_column.input_layer(
        features, params['feature_columns'])
    
    for n_unit in params['hidden_units']:
        input_for_next_layer = tf.layers.dense(
            input_for_next_layer,
            units = n_unit,
            activation = tf.nn.relu)
    logits = tf.layers.dense(
        input_for_next_layer,
        params['n_classes'],
        activation = None)
    predicted_classes = tf.argmax(logits, 1)
    
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {
            "class_ids": predicted_classes[:, tf.newaxis],
            "probabilities": tf.nn.softmax(logits),
            "logits": logits
        }
        
        return tf.estimator.EstimatorSpec(mode,
                                         predictions = predictions)
    
    loss = tf.losses.sparse_softmax_cross_entropy(labels= labels,
                                                 logits = logits)
    accuracy = tf.metrics.accuracy(labels= labels,
                                  predictions = predicted_classes,
                                  name = "acc_op")
    
    metrics = {"accuracy": accuracy}
    if mode == tf.estimator.ModeKeys.EVAL:
        return tf.estimator.EstimatorSpec(mode, 
                                          loss = loss,
                                          eval_metric_ops = metrics)
    
    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.minimize(loss, global_step = tf.train.get_global_step())
    
    return tf.estimator.EstimatorSpec(mode, loss = loss, 
                                     train_op = train_op)


estimator = tf.estimator.Estimator(
    model_fn = model_fn,
    model_dir = output_dir,
    params = {
        "feature_columns": feature_columns,
        "hidden_units": [100, 100], # 隐藏层神经元
        "n_classes": 2 # 2分类
    })

estimator.train(input_fn = lambda : make_dataset(
    train_df, y_train, epochs = 100))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'customized_estimator', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000000125553C8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized aut

INFO:tensorflow:loss = 0.3493362, step = 3361 (0.117 sec)
INFO:tensorflow:global_step/sec: 869.567
INFO:tensorflow:loss = 0.20475882, step = 3461 (0.115 sec)
INFO:tensorflow:global_step/sec: 869.565
INFO:tensorflow:loss = 0.24628563, step = 3561 (0.115 sec)
INFO:tensorflow:global_step/sec: 854.702
INFO:tensorflow:loss = 0.3514607, step = 3661 (0.115 sec)
INFO:tensorflow:global_step/sec: 900.9
INFO:tensorflow:loss = 0.35510308, step = 3761 (0.112 sec)
INFO:tensorflow:global_step/sec: 884.956
INFO:tensorflow:loss = 0.33746642, step = 3861 (0.114 sec)
INFO:tensorflow:Saving checkpoints for 3920 into customized_estimator\model.ckpt.
INFO:tensorflow:Loss for final step: 0.41045007.


<tensorflow_estimator.python.estimator.estimator.Estimator at 0x12555048>

In [9]:
estimator.evaluate(lambda : make_dataset(
    eval_df, y_eval, epochs = 1))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-12-28T15:39:45Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from customized_estimator\model.ckpt-3920
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2020-12-28-15:39:45
INFO:tensorflow:Saving dict for global step 3920: accuracy = 0.79545456, global_step = 3920, loss = 0.52268475
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 3920: customized_estimator\model.ckpt-3920


{'accuracy': 0.79545456, 'loss': 0.52268475, 'global_step': 3920}