##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Working with preprocessing layers

<table class="tfo-notebook-buttons" align="left">
  <td>     <a target="_blank" href="https://www.tensorflow.org/guide/keras/preprocessing_layers"><img src="https://www.tensorflow.org/images/tf_logo_32px.png">在 TensorFlow.org 上查看</a>   </td>
  <td><a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/snapshot-keras/site/en/guide/keras/preprocessing_layers.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png">在 Google Colab 中运行 </a></td>
  <td>     <a target="_blank" href="https://github.com/keras-team/keras-io/blob/master/guides/preprocessing_layers.py"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png">在 GitHub 上查看源代码</a>   </td>
  <td>     <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/guide/keras/preprocessing_layers.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png">下载笔记本</a>   </td>
</table>

## Keras preprocessing layers

The Keras preprocessing layers API allows developers to build Keras-native input
processing pipelines. These input processing pipelines can be used as independent
preprocessing code in non-Keras workflows, combined directly with Keras models, and
exported as part of a Keras SavedModel.

With Keras preprocessing layers, you can build and export models that are truly
end-to-end: models that accept raw images or raw structured data as input; models that
handle feature normalization or feature value indexing on their own.

## 可用的预处理层

### 核心预处理层

- `TextVectorization` 层：将原始字符串转换为可以被 `Embedding` 层或 `Dense` 层读取的编码表示。
- `Normalization` 层：对输入特征执行逐特征归一化。

### 结构化数据预处理层

这些层用于结构化数据编码和特征工程。

- `CategoryEncoding` 层：将整数分类特征转换为独热、多热或计数密集表示。
- `Hashing` 层：执行分类特征哈希（也称为“哈希技巧”）。
- `Discretization` 层：将连续的数字特征转换为整数分类特征。
- `StringLookup` 层：将字符串分类值转换为可以被 `Embedding` 层或 `Dense` 层读取的编码表示。
- `IntegerLookup` 层：将整数分类值转换为可以被 `Embedding` 层或 `Dense` 层读取的编码表示。
- `CategoryCrossing` 层：将分类特征组合成共现特征。例如，如果您有特征值“a”和“b”，它可以提供组合特征“a 和 b 同时存在”。

### 图像预处理层

这些层用于标准化图像模型的输入。

- `Resizing` 层：将一批图像调整为目标大小。
- `Rescaling` 层：重新缩放并偏移一批图像的值（例如，从 `[0, 255]` 范围内的输入到 `[0, 1]` 范围内的输入）。
- `CenterCrop` 层：返回一批图像的中心裁剪。

### 图像数据增强层

这些层可对一批图像应用随机增强转换。它们仅在训练期间处于活动状态。

- `RandomCrop` 层
- `RandomFlip` 层
- `RandomTranslation` 层
- `RandomRotation` 层
- `RandomZoom` 层
- `RandomHeight` 层
- `RandomWidth` 层

## `adapt()` 方法

某些预处理层具有必须根据训练数据的样本计算得出的内部状态。有状态预处理层的列表如下：

- `TextVectorization`: 保留字符串标记与整数索引之间的映射
- `StringLookup` 和 `IntegerLookup`：保留输入值与输出索引之间的映射。
- `Normalization`：保留特征的平均值和标准差。
- `Discretization`：保留有关值桶边界的信息。

最关键的是，这些层**不可训练**。它们的状态并非在训练期间设置；必须在**训练之前**设置，此步骤称为“适配”。

您可以通过 `adapt()` 方法将预处理层公开给训练数据，从而设置它的状态：

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

data = np.array([[0.1, 0.2, 0.3], [0.8, 0.9, 1.0], [1.5, 1.6, 1.7],])
layer = preprocessing.Normalization()
layer.adapt(data)
normalized_data = layer(data)

print("Features mean: %.2f" % (normalized_data.numpy().mean()))
print("Features std: %.2f" % (normalized_data.numpy().std()))

`adapt()` 方法可以接受 Numpy 数组或 `tf.data.Dataset` 对象。对于 `StringLookup` 和 `TextVectorization`，您还可以传递字符串列表：

In [None]:
data = [
    "ξεῖν᾽, ἦ τοι μὲν ὄνειροι ἀμήχανοι ἀκριτόμυθοι",
    "γίγνοντ᾽, οὐδέ τι πάντα τελείεται ἀνθρώποισι.",
    "δοιαὶ γάρ τε πύλαι ἀμενηνῶν εἰσὶν ὀνείρων:",
    "αἱ μὲν γὰρ κεράεσσι τετεύχαται, αἱ δ᾽ ἐλέφαντι:",
    "τῶν οἳ μέν κ᾽ ἔλθωσι διὰ πριστοῦ ἐλέφαντος,",
    "οἵ ῥ᾽ ἐλεφαίρονται, ἔπε᾽ ἀκράαντα φέροντες:",
    "οἱ δὲ διὰ ξεστῶν κεράων ἔλθωσι θύραζε,",
    "οἵ ῥ᾽ ἔτυμα κραίνουσι, βροτῶν ὅτε κέν τις ἴδηται.",
]
layer = preprocessing.TextVectorization()
layer.adapt(data)
vectorized_text = layer(data)
print(vectorized_text)

此外，可适配层总会公开一个选项，以直接通过构造函数参数或权重分配来设置状态。如果预期的状态值在构造层时已知，或在 `adapt()` 调用之外计算得出，则可以在不依赖层的内部计算的情况下进行设置。例如，如果 `TextVectorization`、`StringLookup` 或 `IntegerLookup` 层的外部词汇文件已经存在，则可以通过将词汇文件的路径传递到层的构造函数参数，直接将这些文件加载到查找表中。

下面是一个示例，我们使用预计算的词汇实例化一个 `StringLookup` 层：

In [None]:
vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
layer = preprocessing.StringLookup(vocabulary=vocab)
vectorized_data = layer(data)
print(vectorized_data)

## Preprocessing data before the model or inside the model

There are two ways you could be using preprocessing layers:

**Option 1:** Make them part of the model, like this:

```python
inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)
```

With this option, preprocessing will happen on device, synchronously with the rest of the
model execution, meaning that it will benefit from GPU acceleration.
If you're training on GPU, this is the best option for the `Normalization` layer, and for
all image preprocessing and data augmentation layers.

**Option 2:** apply it to your `tf.data.Dataset`, so as to obtain a dataset that yields
batches of preprocessed data, like this:

```python
dataset = dataset.map(
  lambda x, y: (preprocessing_layer(x), y))
```

With this option, your preprocessing will happen on CPU, asynchronously, and will be
buffered before going into the model.

This is the best option for `TextVectorization`, and all structured data preprocessing
layers. It can also be a good option if you're training on CPU
and you use image preprocessing layers.

## 推断时在模型内部进行预处理的好处

即使您选择了选项 2，您之后也可能希望导出仅用于推断的端到端模型，该模型将包括预处理层。这样做的主要好处是**它使您的模型可移植**，并且**有助于降低[训练-应用偏差](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew)**。

当所有数据预处理都是模型的一部分时，其他人就可以加载和使用您的模型，而无需了解如何对每个特征进行编码和归一化。您的推断模型将能够处理原始图像或原始结构化数据，而不需要模型的用户了解详细信息（例如，用于文本的分词方案，用于分类特征的索引编制方案，图像像素值被归一化为 `[-1, +1]` 还是 `[0, 1]`，等等）。如果要将模型导出到另一个运行时（如 TensorFlow.js），此功能就会尤为强大，您无需在 JavaScript 中重新实现预处理流水线。

如果您最初将预处理层放在了 `tf.data` 流水线中，则可以导出一个推断模型，将预处理打包。只需实例化一个将预处理层和训练模型链接起来的新模型：

```python
inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = training_model(x)
inference_model = keras.Model(inputs, outputs)
```

## Quick recipes

### Image data augmentation (on-device)

Note that image data augmentation layers are only active during training (similarly to
the `Dropout` layer).

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# Create a data augmentation stage with horizontal flipping, rotations, zooms
data_augmentation = keras.Sequential(
    [
        preprocessing.RandomFlip("horizontal"),
        preprocessing.RandomRotation(0.1),
        preprocessing.RandomZoom(0.1),
    ]
)

# Create a model that includes the augmentation stage
input_shape = (32, 32, 3)
classes = 10
inputs = keras.Input(shape=input_shape)
# Augment images
x = data_augmentation(inputs)
# Rescale image values to [0, 1]
x = preprocessing.Rescaling(1.0 / 255)(x)
# Add the rest of the model
outputs = keras.applications.ResNet50(
    weights=None, input_shape=input_shape, classes=classes
)(x)
model = keras.Model(inputs, outputs)

您可以在[从头开始进行图像分类](https://keras.io/examples/vision/image_classification_from_scratch/)示例中查看类似设置的实际运作情况。

### Normalizing numerical features

In [None]:
# Load some data
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
x_train = x_train.reshape((len(x_train), -1))
input_shape = x_train.shape[1:]
classes = 10

# Create a Normalization layer and set its internal state using the training data
normalizer = preprocessing.Normalization()
normalizer.adapt(x_train)

# Create a model that include the normalization layer
inputs = keras.Input(shape=input_shape)
x = normalizer(inputs)
outputs = layers.Dense(classes, activation="softmax")(x)
model = keras.Model(inputs, outputs)

# Train the model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
model.fit(x_train, y_train)

### Encoding string categorical features via one-hot encoding

In [None]:
# Define some toy data
data = tf.constant([["a"], ["b"], ["c"], ["b"], ["c"], ["a"]])

# Use StringLookup to build an index of the feature values and encode output.
lookup = preprocessing.StringLookup(output_mode="binary")
lookup.adapt(data)

# Convert new test data (which includes unknown feature values)
test_data = tf.constant([["a"], ["b"], ["c"], ["d"], ["e"], [""]])
encoded_data = lookup(test_data)
print(encoded_data)

请注意，索引 0 保留用于缺失值（应将其指定为空字符串`""`），索引 1 保留用于词汇外的值（未在 `adapt()` 期间看到的值）。您可以使用 `StringLookup` 的 `mask_token` 和 `oov_token` 构造函数参数进行配置。

您可以在[从头开始进行结构化数据分类](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)示例中查看 `StringLookup` 的实际运作情况。

### Encoding integer categorical features via one-hot encoding

In [None]:
# Define some toy data
data = tf.constant([[10], [20], [20], [10], [30], [0]])

# Use IntegerLookup to build an index of the feature values and encode output.
lookup = preprocessing.IntegerLookup(output_mode="binary")
lookup.adapt(data)

# Convert new test data (which includes unknown feature values)
test_data = tf.constant([[10], [10], [20], [50], [60], [0]])
encoded_data = lookup(test_data)
print(encoded_data)

请注意，索引 0 保留用于缺失值（应将其指定为值 0），索引 1 保留用于词汇外的值（未在 `adapt()` 期间看到的值）。您可以使用 `IntegerLookup` 的 `mask_token` 和 `oov_token` 构造函数参数进行配置。

您可以在[从头开始进行结构化数据分类](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)示例中查看 `IntegerLookup` 的实际运作情况。

### Applying the hashing trick to an integer categorical feature

如果您有一个可以接受许多不同值（处于 10e3 或更高的数量级）的分类特征，其中每个值只在数据中出现几次，那么对特征值进行索引和独热编码将变得不切实际且没有效果。相反，应用“哈希技巧”可能会是个好主意：将值变换成固定大小的向量。这样可以使特征空间的大小易于管理，并且无需显式索引编制。

In [None]:
# Sample data: 10,000 random integers with values between 0 and 100,000
data = np.random.randint(0, 100000, size=(10000, 1))

# Use the Hashing layer to hash the values to the range [0, 64]
hasher = preprocessing.Hashing(num_bins=64, salt=1337)

# Use the CategoryEncoding layer to one-hot encode the hashed values
encoder = preprocessing.CategoryEncoding(num_tokens=64, output_mode="binary")
encoded_data = encoder(hasher(data))
print(encoded_data.shape)

### Encoding text as a sequence of token indices

这是您应该对要传递到 `Embedding` 层的文本进行预处理的方式。

In [None]:
# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "int" output_mode
text_vectorizer = preprocessing.TextVectorization(output_mode="int")
# Index the vocabulary via `adapt()`
text_vectorizer.adapt(data)

# You can retrieve the vocabulary we indexed via get_vocabulary()
vocab = text_vectorizer.get_vocabulary()
print("Vocabulary:", vocab)

# Create an Embedding + LSTM model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = layers.Embedding(input_dim=len(vocab), output_dim=64)(x)
outputs = layers.LSTM(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)

您可以在[从头进行文本分类](https://keras.io/examples/nlp/text_classification_from_scratch/)示例中查看 `TextVectorization` 层与 <code>Embedding</code> 模式组合的实际运作情况。

请注意，在训练此类模型时，为了获得最佳性能，您应将 `TextVectorization` 层用作输入流水线的一部分（我们在上面的文本分类示例中就是这样做的）。

### Encoding text as a dense matrix of ngrams with multi-hot encoding

这是您应该对要传递到 `Dense` 层的文本进行预处理的方式。

In [None]:
# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "binary" output_mode (multi-hot)
# and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
# Index the bigrams via `adapt()`
text_vectorizer.adapt(data)

print(
    "Encoded text:\n",
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
    "\n",
)

# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)

print("Model output:", test_output)

### Encoding text as a dense matrix of ngrams with TF-IDF weighting

This is an alternative way of preprocessing text before passing it to a `Dense` layer.

In [None]:
# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "tf-idf" output_mode
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2)
# Index the bigrams and learn the TF-IDF weights via `adapt()`
text_vectorizer.adapt(data)

print(
    "Encoded text:\n",
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
    "\n",
)

# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)
print("Model output:", test_output)