<div class="devsite-table-wrapper"><table class="tfo-notebook-buttons" align="left">
<td><a target="_blank" href="https://tensorflow.google.cn/tfx/tutorials/transform/census">
<img src="https://tensorflow.google.cn/images/tf_logo_32px.png" />View on TensorFlow.org</a></td>
<td><a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/tfx/tutorials/transform/census.ipynb">
<img src="https://tensorflow.google.cn/images/colab_logo_32px.png">Run in Google Colab</a></td>
<td><a target="_blank" href="https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/tfx/tutorials/transform/census.ipynb">
<img width=32px src="https://tensorflow.google.cn/images/GitHub-Mark-32px.png">View source on GitHub</a></td>
<td><a target="_blank" href="https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/tfx/tutorials/transform/census.ipynb">
<img width=32px src="https://tensorflow.google.cn/images/download_logo_32px.png">Download notebook</a></td>
</table></div>

##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 使用 TensorFlow Transform 预处理数据

***TensorFlow Extended (TFX) 的特征工程组件***

此示例 Colab 笔记本提供了一个更高级的示例，说明了如何使用 <a target="_blank" href="https://tensorflow.google.cn/tfx/transform/">TensorFlow Transform</a> (`tf.Transform`) 预处理数据，此示例使用完全相同的代码训练模型和在生产环境中应用推断。

TensorFlow Transform 是一个用于预处理 TensorFlow 输入数据的库，包括创建需要在训练数据集上进行完整传递的特征。利用 TensorFlow Transform，您可以：

- 使用平均值和标准差归一化输入值
- 通过在所有输入值上生成词汇将字符串转换为整数
- 根据观测到的数据分布，通过分配给桶将浮点数转换为整数

TensorFlow 内置了对在单个样本或一批样本上进行操作的支持。`tf.Transform` 扩展了这些功能，支持在整个训练数据集上进行完整传递。

`tf.Transform` 的输出将导出为可用于训练和应用的 TensorFlow 计算图。将同一个计算图用于训练和应用可以避免偏差，因为会在两个阶段应用相同的转换。

要点：要了解 `tf.Transform` 以及它如何与 Apache Beam 配合使用，您需要对 Apache Beam 有所了解。最好是从<a target="_blank" href="https://beam.apache.org/documentation/programming-guide/">Beam 编程指南</a>开始着手。

##我们在此示例中执行的操作

在此示例中，我们将处理<a target="_blank" href="https://archive.ics.uci.edu/ml/machine-learning-databases/adult">包含人口普查数据的广泛使用的数据集</a>，并训练模型进行分类。在这个过程中，我们将使用 `tf.Transform` 转换数据。

要点：作为模型创建者和开发者，思考如何使用这些数据以及模型预测的潜在好处和危害。此类模型可能会加剧社会偏见和差距。某个特征与您要解决的问题相关，还是会引入偏差？有关更多信息，请阅读 <a target="_blank" href="https://developers.google.com/machine-learning/fairness-overview/">ML 公平性</a>。

注：<a target="_blank" href="https://tensorflow.google.cn/tfx/model_analysis">TensorFlow Model Analysis</a> 是了解模型对数据各个部分的预测能力的强大工具，包括了解模型如何加剧社会偏见和差距。

### 升级 Pip

为了避免在本地运行时升级系统中的 Pip，请检查以确保我们在 Colab 中运行。当然，可以单独升级本地系统。

In [None]:
try:
  import colab
  !pip install --upgrade pip
except:
  pass

### 安装 TensorFlow Transform

**注：在 Google Colab 中，由于软件包更新，第一次运行此代码单元时必须重新启动运行时 (Runtime &gt; Restart runtime ...)。**

In [None]:
!pip install tensorflow-transform

## Python 检查、导入和全局

首先，我们要确保使用的是 Python 3，然后继续安装并导入所需的内容。

In [None]:
import sys

# Confirm that we're using Python 3
assert sys.version_info.major is 3, 'Oops, not running Python 3. Use Runtime > Change runtime type'

In [None]:
import math
import os
import pprint

import tensorflow as tf
print('TF: {}'.format(tf.__version__))

import apache_beam as beam
print('Beam: {}'.format(beam.__version__))

import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
print('Transform: {}'.format(tft.__version__))

from tfx_bsl.public import tfxio
from tfx_bsl.coders.example_coder import RecordBatchToExamples

!wget https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data
!wget https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.test

train = './adult.data'
test = './adult.test'

### 为列命名

我们将创建一些方便的列表来引用数据集中的列。

In [None]:
CATEGORICAL_FEATURE_KEYS = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
]
NUMERIC_FEATURE_KEYS = [
    'age',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
]
OPTIONAL_NUMERIC_FEATURE_KEYS = [
    'education-num',
]
ORDERED_CSV_COLUMNS = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race', 'sex',
    'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label'
]
LABEL_KEY = 'label'

###定义特征和架构<br>我们根据输入中列的类型来定义一个架构。这将有助于正确导入它们，也将惠及其他操作。

In [None]:
RAW_DATA_FEATURE_SPEC = dict(
    [(name, tf.io.FixedLenFeature([], tf.string))
     for name in CATEGORICAL_FEATURE_KEYS] +
    [(name, tf.io.FixedLenFeature([], tf.float32))
     for name in NUMERIC_FEATURE_KEYS] +
    [(name, tf.io.VarLenFeature(tf.float32))
     for name in OPTIONAL_NUMERIC_FEATURE_KEYS] +
    [(LABEL_KEY, tf.io.FixedLenFeature([], tf.string))]
)

SCHEMA = tft.tf_metadata.dataset_metadata.DatasetMetadata(
    tft.tf_metadata.schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC)).schema

###设置超参数和基本整理<br>常量和超参数用于训练。桶大小包括数据集描述中列出的所有类别以及一个表示未知的“?”的额外类别。

注：在未来版本中，实例数将由 `tf.Transform` 计算，在这种情况下，可以从元数据中读取实例数。同理，将不需要 BUCKET_SIZES，因为此信息将存储在每个列的元数据中。

In [None]:
testing = os.getenv("WEB_TEST_BROWSER", False)
NUM_OOV_BUCKETS = 1
if testing:
  TRAIN_NUM_EPOCHS = 1
  NUM_TRAIN_INSTANCES = 1
  TRAIN_BATCH_SIZE = 1
  NUM_TEST_INSTANCES = 1
else:
  TRAIN_NUM_EPOCHS = 16
  NUM_TRAIN_INSTANCES = 32561
  TRAIN_BATCH_SIZE = 128
  NUM_TEST_INSTANCES = 16281

# Names of temp files
TRANSFORMED_TRAIN_DATA_FILEBASE = 'train_transformed'
TRANSFORMED_TEST_DATA_FILEBASE = 'test_transformed'
EXPORTED_MODEL_DIR = 'exported_model_dir'

##使用 `tf.Transform` 进行预处理

###创建一个 `tf.Transform` preprocessing_fn<br>*预处理函数*是 tf.Transform 最重要的概念。预处理函数是真正发生数据集转换的地方。它接受并返回一个张量字典，其中张量是指 [`Tensor`](https://tensorflow.google.cn/versions/r1.15/api_docs/python/tf/Tensor) 或 [`SparseTensor`](https://tensorflow.google.cn/versions/r1.15/api_docs/python/tf/SparseTensor)。有两组主要的 API 调用，它们通常构成预处理函数的核心：

1. **TensorFlow 运算**：接受并返回张量的任何函数，通常是指 TensorFlow 运算。这些函数会将 TensorFlow 运算添加到计算图中，计算图能够以每次一个特征向量的方式转换原始数据。在训练和应用期间，将为每个样本运行这种转换。
2. **TensorFlow Transform 分析器**：tf.Transform 提供的任何分析器。这些分析器还会接受并返回张量，但与 TensorFlow 运算不同的是，它们在训练期间仅运行一次，并且通常在整个训练数据集上进行完整传递。它们会创建随后添加到计算图中的[张量常量](https://tensorflow.google.cn/versions/r1.15/api_docs/python/tf/constant)。例如，`tft.min` 可以计算某个张量在训练数据集上的最小值。tf.Transform 提供了一组固定的分析器，但是在未来版本中会对这些分析器进行扩展。

警告：将预处理函数用于应用推断时，分析器在训练过程中创建的常量不会更改。如果您的数据包含趋势或季节性分量，请相应地制定计划。

In [None]:
def preprocessing_fn(inputs):
  """Preprocess input columns into transformed columns."""
  # Since we are modifying some features and leaving others unchanged, we
  # start by setting `outputs` to a copy of `inputs.
  outputs = inputs.copy()

  # Scale numeric columns to have range [0, 1].
  for key in NUMERIC_FEATURE_KEYS:
    outputs[key] = tft.scale_to_0_1(inputs[key])

  for key in OPTIONAL_NUMERIC_FEATURE_KEYS:
    # This is a SparseTensor because it is optional. Here we fill in a default
    # value when it is missing.
    sparse = tf.sparse.SparseTensor(inputs[key].indices, inputs[key].values,
                                    [inputs[key].dense_shape[0], 1])
    dense = tf.sparse.to_dense(sp_input=sparse, default_value=0.)
    # Reshaping from a batch of vectors of size 1 to a batch to scalars.
    dense = tf.squeeze(dense, axis=1)
    outputs[key] = tft.scale_to_0_1(dense)

  # For all categorical columns except the label column, we generate a
  # vocabulary but do not modify the feature.  This vocabulary is instead
  # used in the trainer, by means of a feature column, to convert the feature
  # from a string to an integer id.
  for key in CATEGORICAL_FEATURE_KEYS:
    outputs[key] = tft.compute_and_apply_vocabulary(
        tf.strings.strip(inputs[key]),
        num_oov_buckets=NUM_OOV_BUCKETS,
        vocab_filename=key)

  # For the label column we provide the mapping from string to index.
  table_keys = ['>50K', '<=50K']
  with tf.init_scope():
    initializer = tf.lookup.KeyValueTensorInitializer(
        keys=table_keys,
        values=tf.cast(tf.range(len(table_keys)), tf.int64),
        key_dtype=tf.string,
        value_dtype=tf.int64)
    table = tf.lookup.StaticHashTable(initializer, default_value=-1)
  # Remove trailing periods for test data when the data is read with tf.data.
  label_str = tf.strings.regex_replace(inputs[LABEL_KEY], r'\.', '')
  label_str = tf.strings.strip(label_str)
  data_labels = table.lookup(label_str)
  transformed_label = tf.one_hot(
      indices=data_labels, depth=len(table_keys), on_value=1.0, off_value=0.0)
  outputs[LABEL_KEY] = tf.reshape(transformed_label, [-1, len(table_keys)])

  return outputs

###转换数据<br>现在，我们准备开始在 Apache Beam 流水线中转换数据。

1. 使用 CSV 阅读器读入数据
2. 使用预处理流水线转换数据，此流水线可对数值数据进行缩放，并将分类数据从字符串转换为 int64 值索引，方法是为每个类别创建一个词汇
3. 将结果作为 `Example` proto 的 `TFRecord` 写出来，我们稍后会使用它来训练模型

<aside class="key-term"><b>关键词</b>：<a target="_blank" href="https://beam.apache.org/">Apache Beam</a> 使用<a target="_blank" href="https://beam.apache.org/documentation/programming-guide/#applying-transforms">特殊的语法来定义和调用 Transform</a>。例如，在下面的代码行中：</aside>

<code><blockquote>result = pass_this | 'name this step' &gt;&gt; to_this_call</blockquote></code>

方法 <code>to_this_call</code> 正在被调用并传递给名为 <code>pass_this</code> 的对象，<a target="_blank" href="https://stackoverflow.com/questions/50519662/what-does-the-redirection-mean-in-apache-beam-python">在堆栈跟踪中，此运算被称为 <code>name this step</code></a>。调用 <code>to_this_call</code> 的结果将在 <code>result</code> 中返回。您经常会看到流水线的各个阶段像这样链接在一起：

<code><blockquote>result = apache_beam.Pipeline() | 'first step' &gt;&gt; do_this_first() | 'second step' &gt;&gt; do_this_last()</blockquote></code>

由于这是从一个新的流水线开始的，因此您可以按以下方式继续：

<code><blockquote>next_result = result | 'doing more stuff' &gt;&gt; another_function()</blockquote></code>

In [None]:
def transform_data(train_data_file, test_data_file, working_dir):
  """Transform the data and write out as a TFRecord of Example protos.

  Read in the data using the CSV reader, and transform it using a
  preprocessing pipeline that scales numeric data and converts categorical data
  from strings to int64 values indices, by creating a vocabulary for each
  category.

  Args:
    train_data_file: File containing training data
    test_data_file: File containing test data
    working_dir: Directory to write transformed data and metadata to
  """

  # The "with" block will create a pipeline, and run that pipeline at the exit
  # of the block.
  with beam.Pipeline() as pipeline:
    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
      # Create a TFXIO to read the census data with the schema. To do this we
      # need to list all columns in order since the schema doesn't specify the
      # order of columns in the csv.
      # We first read CSV files and use BeamRecordCsvTFXIO whose .BeamSource()
      # accepts a PCollection[bytes] because we need to patch the records first
      # (see "FixCommasTrainData" below). Otherwise, tfxio.CsvTFXIO can be used
      # to both read the CSV files and parse them to TFT inputs:
      # csv_tfxio = tfxio.CsvTFXIO(...)
      # raw_data = (pipeline | 'ToRecordBatches' >> csv_tfxio.BeamSource())
      csv_tfxio = tfxio.BeamRecordCsvTFXIO(
          physical_format='text',
          column_names=ORDERED_CSV_COLUMNS,
          schema=SCHEMA)

      # Read in raw data and convert using CSV TFXIO.  Note that we apply
      # some Beam transformations here, which will not be encoded in the TF
      # graph since we don't do the from within tf.Transform's methods
      # (AnalyzeDataset, TransformDataset etc.).  These transformations are just
      # to get data into a format that the CSV TFXIO can read, in particular
      # removing spaces after commas.
      raw_data = (
          pipeline
          | 'ReadTrainData' >> beam.io.ReadFromText(
              train_data_file, coder=beam.coders.BytesCoder())
          | 'FixCommasTrainData' >> beam.Map(
              lambda line: line.replace(b', ', b','))
          | 'DecodeTrainData' >> csv_tfxio.BeamSource())

      # Combine data and schema into a dataset tuple.  Note that we already used
      # the schema to read the CSV data, but we also need it to interpret
      # raw_data.
      raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())

      # The TFXIO output format is chosen for improved performance.
      transformed_dataset, transform_fn = (
          raw_dataset | tft_beam.AnalyzeAndTransformDataset(
              preprocessing_fn, output_record_batches=True))

      # Transformed metadata is not necessary for encoding.
      transformed_data, _ = transformed_dataset

      # Extract transformed RecordBatches, encode and write them to the given
      # directory.
      _ = (
          transformed_data
          | 'EncodeTrainData' >>
          beam.FlatMapTuple(lambda batch, _: RecordBatchToExamples(batch))
          | 'WriteTrainData' >> beam.io.WriteToTFRecord(
              os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE)))

      # Now apply transform function to test data.  In this case we remove the
      # trailing period at the end of each line, and also ignore the header line
      # that is present in the test data file.
      raw_test_data = (
          pipeline
          | 'ReadTestData' >> beam.io.ReadFromText(
              test_data_file, skip_header_lines=1,
              coder=beam.coders.BytesCoder())
          | 'FixCommasTestData' >> beam.Map(
              lambda line: line.replace(b', ', b','))
          | 'RemoveTrailingPeriodsTestData' >> beam.Map(lambda line: line[:-1])
          | 'DecodeTestData' >> csv_tfxio.BeamSource())

      raw_test_dataset = (raw_test_data, csv_tfxio.TensorAdapterConfig())

      # The TFXIO output format is chosen for improved performance.
      transformed_test_dataset = (
          (raw_test_dataset, transform_fn)
          | tft_beam.TransformDataset(output_record_batches=True))

      # Transformed metadata is not necessary for encoding.
      transformed_test_data, _ = transformed_test_dataset

      # Extract transformed RecordBatches, encode and write them to the given
      # directory.
      _ = (
          transformed_test_data
          | 'EncodeTestData' >>
          beam.FlatMapTuple(lambda batch, _: RecordBatchToExamples(batch))
          | 'WriteTestData' >> beam.io.WriteToTFRecord(
              os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE)))

      # Will write a SavedModel and metadata to working_dir, which can then
      # be read by the tft.TFTransformOutput class.
      _ = (
          transform_fn
          | 'WriteTransformFn' >> tft_beam.WriteTransformFn(working_dir))

##使用预处理数据通过 tf.keras 训练模型

为了展示 `tf.Transform` 如何使我们能够将相同的代码用于训练和应用，进而避免偏差，我们将训练一个模型。要训​​练模型并为生产环境准备经过训练的模型，我们需要创建输入函数。训练输入函数与应用输入函数之间的主要区别在于，训练数据包含标签，而生产数据则不包含标签。两者的参数和返回值也有所不同。

注：本部分使用 tf.keras 进行训练。如果您要寻找使用 tf.estimator 进行训练的示例，请参阅下一部分。

###创建训练输入函数

In [None]:
def _make_training_input_fn(tf_transform_output, transformed_examples,
                            batch_size):
  """An input function reading from transformed data, converting to model input.

  Args:
    tf_transform_output: Wrapper around output of tf.Transform.
    transformed_examples: Base filename of examples.
    batch_size: Batch size.

  Returns:
    The input data for training or eval, in the form of k.
  """
  def input_fn():
    return tf.data.experimental.make_batched_features_dataset(
        file_pattern=transformed_examples,
        batch_size=batch_size,
        features=tf_transform_output.transformed_feature_spec(),
        reader=tf.data.TFRecordDataset,
        label_key=LABEL_KEY,
        shuffle=True).prefetch(tf.data.experimental.AUTOTUNE)

  return input_fn

###创建应用输入函数

我们创建一个可以在生产环境中使用的输入函数，并针对应用准备经过训练的模型。

In [None]:
def _make_serving_input_fn(tf_transform_output, raw_examples, batch_size):
  """An input function reading from raw data, converting to model input.

  Args:
    tf_transform_output: Wrapper around output of tf.Transform.
    raw_examples: Base filename of examples.
    batch_size: Batch size.

  Returns:
    The input data for training or eval, in the form of k.
  """

  def get_ordered_raw_data_dtypes():
    result = []
    for col in ORDERED_CSV_COLUMNS:
      if col not in RAW_DATA_FEATURE_SPEC:
        result.append(0.0)
        continue
      spec = RAW_DATA_FEATURE_SPEC[col]
      if isinstance(spec, tf.io.FixedLenFeature):
        result.append(spec.dtype)
      else:
        result.append(0.0)
    return result

  def input_fn():
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern=raw_examples,
        batch_size=batch_size,
        column_names=ORDERED_CSV_COLUMNS,
        column_defaults=get_ordered_raw_data_dtypes(),
        prefetch_buffer_size=0,
        ignore_errors=True)

    tft_layer = tf_transform_output.transform_features_layer()

    def transform_dataset(data):
      raw_features = {}
      for key, val in data.items():
        if key not in RAW_DATA_FEATURE_SPEC:
          continue
        if isinstance(RAW_DATA_FEATURE_SPEC[key], tf.io.VarLenFeature):
          raw_features[key] = tf.RaggedTensor.from_tensor(
              tf.expand_dims(val, -1)).to_sparse()
          continue
        raw_features[key] = val
      transformed_features = tft_layer(raw_features)
      data_labels = transformed_features.pop(LABEL_KEY)
      return (transformed_features, data_labels)

    return dataset.map(
        transform_dataset,
        num_parallel_calls=tf.data.experimental.AUTOTUNE).prefetch(
            tf.data.experimental.AUTOTUNE)

  return input_fn

##训练、评估并导出模型

In [None]:
def export_serving_model(tf_transform_output, model, output_dir):
  """Exports a keras model for serving.

  Args:
    tf_transform_output: Wrapper around output of tf.Transform.
    model: A keras model to export for serving.
    output_dir: A directory where the model will be exported to.
  """
  # The layer has to be saved to the model for keras tracking purpases.
  model.tft_layer = tf_transform_output.transform_features_layer()

  @tf.function
  def serve_tf_examples_fn(serialized_tf_examples):
    """Serving tf.function model wrapper."""
    feature_spec = RAW_DATA_FEATURE_SPEC.copy()
    feature_spec.pop(LABEL_KEY)
    parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)
    transformed_features = model.tft_layer(parsed_features)
    outputs = model(transformed_features)
    classes_names = tf.constant([['0', '1']])
    classes = tf.tile(classes_names, [tf.shape(outputs)[0], 1])
    return {'classes': classes, 'scores': outputs}

  concrete_serving_fn = serve_tf_examples_fn.get_concrete_function(
      tf.TensorSpec(shape=[None], dtype=tf.string, name='inputs'))
  signatures = {'serving_default': concrete_serving_fn}

  # This is required in order to make this model servable with model_server.
  versioned_output_dir = os.path.join(output_dir, '1')
  model.save(versioned_output_dir, save_format='tf', signatures=signatures)  

In [None]:
def train_and_evaluate(working_dir,
                       num_train_instances=NUM_TRAIN_INSTANCES,
                       num_test_instances=NUM_TEST_INSTANCES):
  """Train the model on training data and evaluate on test data.

  Args:
    working_dir: The location of the Transform output.
    num_train_instances: Number of instances in train set
    num_test_instances: Number of instances in test set

  Returns:
    The results from the estimator's 'evaluate' method
  """
  train_data_path_pattern = os.path.join(working_dir,
                                 TRANSFORMED_TRAIN_DATA_FILEBASE + '*')
  eval_data_path_pattern = os.path.join(working_dir,
                            TRANSFORMED_TEST_DATA_FILEBASE + '*')
  tf_transform_output = tft.TFTransformOutput(working_dir)

  train_input_fn = _make_training_input_fn(
      tf_transform_output, train_data_path_pattern, batch_size=TRAIN_BATCH_SIZE)
  train_dataset = train_input_fn()

  # Evaluate model on test dataset.
  eval_input_fn = _make_training_input_fn(
      tf_transform_output, eval_data_path_pattern, batch_size=TRAIN_BATCH_SIZE)
  validation_dataset = eval_input_fn()

  feature_spec = tf_transform_output.transformed_feature_spec().copy()
  feature_spec.pop(LABEL_KEY)

  inputs = {}
  for key, spec in feature_spec.items():
    if isinstance(spec, tf.io.VarLenFeature):
      inputs[key] = tf.keras.layers.Input(
          shape=[None], name=key, dtype=spec.dtype, sparse=True)
    elif isinstance(spec, tf.io.FixedLenFeature):
      inputs[key] = tf.keras.layers.Input(
          shape=spec.shape, name=key, dtype=spec.dtype)
    else:
      raise ValueError('Spec type is not supported: ', key, spec)

  encoded_inputs = {}
  for key in inputs:
    feature = tf.expand_dims(inputs[key], -1)
    if key in CATEGORICAL_FEATURE_KEYS:
      num_buckets = tf_transform_output.num_buckets_for_transformed_feature(key)
      encoding_layer = (
          tf.keras.layers.experimental.preprocessing.CategoryEncoding(
              max_tokens=num_buckets, output_mode='binary', sparse=False))
      encoded_inputs[key] = encoding_layer(feature)
    else:
      encoded_inputs[key] = feature

  stacked_inputs = tf.concat(tf.nest.flatten(encoded_inputs), axis=1)
  output = tf.keras.layers.Dense(100, activation='relu')(stacked_inputs)
  output = tf.keras.layers.Dense(70, activation='relu')(output)
  output = tf.keras.layers.Dense(50, activation='relu')(output)
  output = tf.keras.layers.Dense(20, activation='relu')(output)
  output = tf.keras.layers.Dense(2, activation='sigmoid')(output)
  model = tf.keras.Model(inputs=inputs, outputs=output)

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  pprint.pprint(model.summary())

  model.fit(train_dataset, validation_data=validation_dataset,
            epochs=TRAIN_NUM_EPOCHS,
            steps_per_epoch=math.ceil(num_train_instances / TRAIN_BATCH_SIZE),
            validation_steps=math.ceil(num_test_instances / TRAIN_BATCH_SIZE))

  # Export the model.
  exported_model_dir = os.path.join(working_dir, EXPORTED_MODEL_DIR)
  export_serving_model(tf_transform_output, model, exported_model_dir)

  metrics_values = model.evaluate(validation_dataset, steps=num_test_instances)
  metrics_labels = model.metrics_names
  return {l: v for l, v in zip(metrics_labels, metrics_values)}

##总结<br>我们已经创建了所需的一切来预处理人口普查数据，训练模型，并针对应用准备模型。到目前为止，我们已经做好了一切准备。是时候开始运行了！

注：滚动此单元的输出可查看整个流程。结果位于底部。

In [None]:
import tempfile
temp = os.path.join(tempfile.gettempdir(), 'keras')

transform_data(train, test, temp)
results = train_and_evaluate(temp)
pprint.pprint(results)

## （可选）使用预处理数据通过 tf.estimator 训练模型

如果您更愿意使用 Estimator 模型而不是 Keras 模型，本部分中的代码将展示如何进行操作。

###创建训练输入函数

In [None]:
def _make_training_input_fn(tf_transform_output, transformed_examples,
                            batch_size):
  """Creates an input function reading from transformed data.

  Args:
    tf_transform_output: Wrapper around output of tf.Transform.
    transformed_examples: Base filename of examples.
    batch_size: Batch size.

  Returns:
    The input function for training or eval.
  """
  def input_fn():
    """Input function for training and eval."""
    dataset = tf.data.experimental.make_batched_features_dataset(
        file_pattern=transformed_examples,
        batch_size=batch_size,
        features=tf_transform_output.transformed_feature_spec(),
        reader=tf.data.TFRecordDataset,
        shuffle=True)

    transformed_features = tf.compat.v1.data.make_one_shot_iterator(
        dataset).get_next()

    # Extract features and label from the transformed tensors.
    transformed_labels = tf.where(
        tf.equal(transformed_features.pop(LABEL_KEY), 1))

    return transformed_features, transformed_labels[:,1]

  return input_fn

###创建应用输入函数

我们创建一个可以在生产环境中使用的输入函数，并针对应用准备经过训练的模型。

In [None]:
def _make_serving_input_fn(tf_transform_output):
  """Creates an input function reading from raw data.

  Args:
    tf_transform_output: Wrapper around output of tf.Transform.

  Returns:
    The serving input function.
  """
  raw_feature_spec = RAW_DATA_FEATURE_SPEC.copy()
  # Remove label since it is not available during serving.
  raw_feature_spec.pop(LABEL_KEY)

  def serving_input_fn():
    """Input function for serving."""
    # Get raw features by generating the basic serving input_fn and calling it.
    # Here we generate an input_fn that expects a parsed Example proto to be fed
    # to the model at serving time.  See also
    # tf.estimator.export.build_raw_serving_input_receiver_fn.
    raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
        raw_feature_spec, default_batch_size=None)
    serving_input_receiver = raw_input_fn()

    # Apply the transform function that was used to generate the materialized
    # data.
    raw_features = serving_input_receiver.features
    transformed_features = tf_transform_output.transform_raw_features(
        raw_features)

    return tf.estimator.export.ServingInputReceiver(
        transformed_features, serving_input_receiver.receiver_tensors)

  return serving_input_fn

###将输入数据封装到 FeatureColumns 中<br>模型希望我们的数据处于 TensorFlow FeatureColumns 中。

In [None]:
def get_feature_columns(tf_transform_output):
  """Returns the FeatureColumns for the model.

  Args:
    tf_transform_output: A `TFTransformOutput` object.

  Returns:
    A list of FeatureColumns.
  """
  # Wrap scalars as real valued columns.
  real_valued_columns = [tf.feature_column.numeric_column(key, shape=())
                         for key in NUMERIC_FEATURE_KEYS]

  # Wrap categorical columns.
  one_hot_columns = [
      tf.feature_column.indicator_column(
          tf.feature_column.categorical_column_with_identity(
              key=key,
              num_buckets=(NUM_OOV_BUCKETS +
                  tf_transform_output.vocabulary_size_by_name(
                      vocab_filename=key))))
      for key in CATEGORICAL_FEATURE_KEYS]

  return real_valued_columns + one_hot_columns

##训练、评估并导出模型

In [None]:
def train_and_evaluate(working_dir, num_train_instances=NUM_TRAIN_INSTANCES,
                       num_test_instances=NUM_TEST_INSTANCES):
  """Train the model on training data and evaluate on test data.

  Args:
    working_dir: Directory to read transformed data and metadata from and to
        write exported model to.
    num_train_instances: Number of instances in train set
    num_test_instances: Number of instances in test set

  Returns:
    The results from the estimator's 'evaluate' method
  """
  tf_transform_output = tft.TFTransformOutput(working_dir)

  run_config = tf.estimator.RunConfig()

  estimator = tf.estimator.LinearClassifier(
      feature_columns=get_feature_columns(tf_transform_output),
      config=run_config,
      loss_reduction=tf.losses.Reduction.SUM)

  # Fit the model using the default optimizer.
  train_input_fn = _make_training_input_fn(
      tf_transform_output,
      os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE + '*'),
      batch_size=TRAIN_BATCH_SIZE)
  estimator.train(
      input_fn=train_input_fn,
      max_steps=TRAIN_NUM_EPOCHS * num_train_instances / TRAIN_BATCH_SIZE)

  # Evaluate model on test dataset.
  eval_input_fn = _make_training_input_fn(
      tf_transform_output,
      os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE + '*'),
      batch_size=1)

  # Export the model.
  serving_input_fn = _make_serving_input_fn(tf_transform_output)
  exported_model_dir = os.path.join(working_dir, EXPORTED_MODEL_DIR)
  estimator.export_saved_model(exported_model_dir, serving_input_fn)

  return estimator.evaluate(input_fn=eval_input_fn, steps=num_test_instances)

##总结<br>我们已经创建了所需的一切来预处理人口普查数据、训练模型并针对应用准备模型。到目前为止，我们已经做好了一切准备。是时候开始运行了！

注：滚动此单元的输出可查看整个流程。结果位于底部。

In [None]:
import tempfile
temp = os.path.join(tempfile.gettempdir(), 'estimator')

transform_data(train, test, temp)
results = train_and_evaluate(temp)
pprint.pprint(results)

##我们做了什么<br>在此示例中，我们使用 `tf.Transform` 预处理了人口普查数据的数据集，并使用清理和转换后的数据训练了一个模型。此外，我们还创建了一个输入函数，当我们在生产环境中部署经过训练的模型以执行推断时，可以使用该输入函数。通过将相同的代码用于训练和推断，我们可以避免数据偏差方面的任何问题。在此过程中，我们学习了如何创建 Apache Beam Transform 来执行清理数据所需的转换。我们还看到了如何使用转换后的数据通过 `tf.keras` 或 `tf.estimator` 训练模型。这只是 TensorFlow Transform 功能的一小部分！我们鼓励您深入研究 `tf.Transform`，并发现它可以为您做些什么。