### 一，准备数据

`imdb` 数据集的目标是根据电影评论的文本内容预测评论的情感标签。

训练集有 `20000` 条电影评论文本，测试集有 `5000` 条电影评论文本，其中正面评论和负面评论都各占一半。

文本数据预处理较为繁琐，包括中文切词（本示例不涉及），构建词典，编码转换，序列填充，构建数据管道等等。

在 `tensorflow` 中完成文本数据预处理的常用方案有两种，第一种是利用 `tf.keras.preprocessing` 中的 `Tokenizer` 词典构建工具和 `tf.keras.utils.Sequence` 构建文本数据生成器管道。

第二种是使用 `tf.data.Dataset` 搭配 `.keras.layers.experimental.preprocessing.TextVectorization` 预处理层。

第一种方法较为复杂，其使用范例可以参考以下文章。

https://zhuanlan.zhihu.com/p/67697840

第二种方法为 `TensorFlow` 原生方式，相对也更加简单一些。

我们此处介绍第二种方法。

![](./data/%E7%94%B5%E5%BD%B1%E8%AF%84%E8%AE%BA.jpg)

In [6]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow.keras import models, layers, preprocessing, optimizers, losses, metrics
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import re, string

train_data_path = "./data/imdb/train.csv"
test_data_path = "./data/imdb/test.csv"

MAX_WORDS = 10000  # 仅考虑最高频的 10000 个词
MAX_LEN = 200  # 每个样本仅保留 200 个词的长度
BATCH_SIZE = 20  # 每次训练 20 个样本

# 构建管道
def split_line(line):
    arr = tf.strings.split(line, "\t")
    label = tf.expand_dims(tf.cast(tf.strings.to_number(arr[0]), tf.int32), axis=0)
    text = tf.expand_dims(arr[1], axis=0)
    return (text, label)


ds_train_raw = (
    tf.data.TextLineDataset(filenames=[train_data_path])
    .map(split_line, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    .shuffle(buffer_size=1000)
    .batch(BATCH_SIZE)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

ds_test_raw = (
    tf.data.TextLineDataset(filenames=[test_data_path])
    .map(split_line, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    .batch(BATCH_SIZE)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

# 构建字典
def clean_text(text):
    lowercase = tf.strings.lower(text)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    clean_punctuation = tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape(string.punctuation), ""
    )
    return clean_punctuation


vectorize_layer = TextVectorization(
    standardize=clean_text,
    split="whitespace",
    max_tokens=MAX_WORDS - 1,  # 留一个给占位符
    output_mode="int",
    output_sequence_length=MAX_LEN,
)

ds_text = ds_train_raw.map(lambda text, label: text)
vectorize_layer.adapt(ds_text)
print(vectorize_layer.get_vocabulary()[0:100])

# 单词编码
ds_train = ds_train_raw.map(
    lambda text, label: (vectorize_layer(text), label)
).prefetch(tf.data.experimental.AUTOTUNE)
ds_test = ds_test_raw.map(lambda text, label: (vectorize_layer(text), label)).prefetch(
    tf.data.experimental.AUTOTUNE
)

['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'his', 'are', 'have', 'be', 'he', 'one', 'its', 'at', 'all', 'by', 'an', 'they', 'from', 'who', 'so', 'like', 'her', 'just', 'or', 'about', 'has', 'if', 'out', 'some', 'there', 'what', 'good', 'more', 'when', 'very', 'she', 'even', 'my', 'no', 'would', 'up', 'time', 'only', 'which', 'story', 'really', 'their', 'were', 'had', 'see', 'can', 'me', 'than', 'we', 'much', 'well', 'get', 'been', 'will', 'into', 'people', 'also', 'other', 'do', 'bad', 'because', 'great', 'first', 'how', 'him', 'most', 'dont', 'made', 'then', 'them', 'films', 'movies', 'way', 'make', 'could', 'too', 'any']


In [7]:
for x, y in ds_train.unbatch().take(1):
    print(x.shape, y.shape)

(200,) (1,)


### 二，定义模型

使用 `Keras` 接口有以下 3 种方式构建模型：使用 `Sequential` 按层顺序构建模型，使用函数式 `API` 构建任意结构模型，继承 `Model` 基类构建自定义模型。

此处选择使用继承 `Model` 基类构建自定义模型。

In [8]:
tf.keras.backend.clear_session()


class CnnModel(models.Model):
    def __init__(self):
        super(CnnModel, self).__init__()

    def build(self, input_shape):
        self.embedding = layers.Embedding(MAX_WORDS, 7, input_length=MAX_LEN)
        self.conv_1 = layers.Conv1D(16, kernel_size=5, name="conv_1", activation="relu")
        self.pool_1 = layers.MaxPool1D(name="pool_1")
        self.conv_2 = layers.Conv1D(
            128, kernel_size=2, name="conv_2", activation="relu"
        )
        self.pool_2 = layers.MaxPool1D(name="pool_2")
        self.flatten = layers.Flatten()
        self.dense = layers.Dense(1, activation="sigmoid")
        super(CnnModel, self).build(input_shape)

    def call(self, x):
        x = self.embedding(x)
        x = self.conv_1(x)
        x = self.pool_1(x)
        x = self.conv_2(x)
        x = self.pool_2(x)
        x = self.flatten(x)
        x = self.dense(x)
        return x

    def summary(self):
        x_input = layers.Input(shape=MAX_LEN)
        output = self.call(x_input)
        model = tf.keras.Model(inputs=x_input, outputs=output)
        model.summary()


model = CnnModel()
model.build(input_shape=(None, MAX_LEN))
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 200)]             0         
                                                                 
 embedding (Embedding)       (None, 200, 7)            70000     
                                                                 
 conv_1 (Conv1D)             (None, 196, 16)           576       
                                                                 
 pool_1 (MaxPooling1D)       (None, 98, 16)            0         
                                                                 
 conv_2 (Conv1D)             (None, 97, 128)           4224      
                                                                 
 pool_2 (MaxPooling1D)       (None, 48, 128)           0         
                                                                 
 flatten (Flatten)           (None, 6144)              0     

### 三，训练模型

训练模型通常有 `3` 种方法，内置 `fit` 方法，内置 `train_on_batch` 方法，以及自定义训练循环。此处我们通过自定义训练循环训练模型。

In [9]:
# 打印时间分割线
@tf.function
def printbar():
    today_ts = tf.timestamp() % (24 * 60 * 60)
    hour = tf.cast(today_ts // 3600 + 8, tf.int32) % tf.constant(24)
    minute = tf.cast((today_ts % 3600) // 60, tf.int32)
    second = tf.cast(tf.floor(today_ts % 60), tf.int32)

    def timeformat(m):
        if tf.strings.length(tf.strings.format("{}", m)) == 1:
            return tf.strings.format("0{}", m)
        else:
            return tf.strings.format("{}", m)

    timestring = tf.strings.join(
        [timeformat(hour), timeformat(minute), timeformat(second)], separator=":"
    )
    tf.print("==========" * 2 + timestring)

In [10]:
optimizer = optimizers.Nadam()
loss_func = losses.BinaryCrossentropy()

train_loss = metrics.Mean(name="train_loss")
train_metric = metrics.BinaryAccuracy(name="train_accuracy")

valid_loss = metrics.Mean(name="valid_loss")
valid_metric = metrics.BinaryAccuracy(name="valid_accuracy")


@tf.function
def train_step(model, features, labels):
    with tf.GradientTape() as tape:
        predictions = model(features, training=True)
        loss = loss_func(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss.update_state(loss)
    train_metric.update_state(labels, predictions)


@tf.function
def valid_step(model, features, labels):
    predictions = model(features, training=False)
    batch_loss = loss_func(labels, predictions)
    valid_loss.update_state(batch_loss)
    valid_metric.update_state(labels, predictions)


def train_model(model, ds_train, ds_valid, epochs):
    for epoch in tf.range(1, epochs + 1):
        for features, labels in ds_train:
            train_step(model, features, labels)
        for features, labels in ds_valid:
            valid_step(model, features, labels)

        logs = "Epoch={}, Loss:{}, Accuracy:{}, Valid Loss:{}, Valid Accuracy:{}"

        if epoch % 1 == 0:
            printbar()
            tf.print(
                tf.strings.format(
                    logs,
                    (
                        epoch,
                        train_loss.result(),
                        train_metric.result(),
                        valid_loss.result(),
                        valid_metric.result(),
                    ),
                )
            )
            tf.print("")
        train_loss.reset_states()
        valid_loss.reset_states()
        train_metric.reset_states()
        valid_metric.reset_states()


train_model(model, ds_train, ds_test, epochs=10)

2022-04-22 15:11:01.454338: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8303


Epoch=1, Loss:0.446296602, Accuracy:0.76495, Valid Loss:0.320769042, Valid Accuracy:0.8624

Epoch=2, Loss:0.247373357, Accuracy:0.90055, Valid Loss:0.322131723, Valid Accuracy:0.868

Epoch=3, Loss:0.178686142, Accuracy:0.9301, Valid Loss:0.36509043, Valid Accuracy:0.865

Epoch=4, Loss:0.120583653, Accuracy:0.95685, Valid Loss:0.450968593, Valid Accuracy:0.86

Epoch=5, Loss:0.0750969499, Accuracy:0.97565, Valid Loss:0.612897396, Valid Accuracy:0.854

Epoch=6, Loss:0.044601161, Accuracy:0.98515, Valid Loss:0.780356705, Valid Accuracy:0.8544

Epoch=7, Loss:0.0249042232, Accuracy:0.9921, Valid Loss:0.970365345, Valid Accuracy:0.85

Epoch=8, Loss:0.0145032955, Accuracy:0.9957, Valid Loss:1.04256284, Valid Accuracy:0.8518

Epoch=9, Loss:0.0173363537, Accuracy:0.99425, Valid Loss:1.11363089, Valid Accuracy:0.8484

Epoch=10, Loss:0.0148880268, Accuracy:0.9949, Valid Loss:1.2527951, Valid Accuracy:0.8474



### 四，评估模型

通过自定义训练的模型没有经过编译，无法直接使用 `model.evaluate(ds_valid)` 方法

In [18]:
def evaluate_model(model, ds_valid):
    for features, labels in ds_valid:
        valid_step(model, features, labels)
    logs = "Valid Loss:{}, Valid Accuracy:{}"
    tf.print(tf.strings.format(logs, (valid_loss.result(), valid_metric.result())))
    # print(logs.format(valid_loss.result(), valid_metric.result()))
    valid_loss.reset_states()
    valid_metric.reset_states()


evaluate_model(model, ds_test)

Valid Loss:1.2527951, Valid Accuracy:0.8474


### 五，使用模型

可以使用以下方法:

* `model.predict(ds_test)`
* `model(x_test)`
* `model.call(x_test)`
* `model.predict_on_batch(x_test)`

推荐优先使用 `model.predict(ds_test)` 方法，既可以对 `Dataset`，也可以对 `Tensor` 使用。

In [20]:
model.predict(ds_test)

array([[0.21379325],
       [0.9999994 ],
       [0.9999975 ],
       ...,
       [0.99999666],
       [0.47219488],
       [1.        ]], dtype=float32)

In [23]:
for x_test, _ in ds_test.take(1):
    print(model(x_test))
    # 以下方法等价：
    # print(model.call(x_test))
    # print(model.predict_on_batch(x_test))

tf.Tensor(
[[2.13793248e-01]
 [9.99999404e-01]
 [9.99997497e-01]
 [1.43055489e-12]
 [9.99670744e-01]
 [2.58837136e-11]
 [2.78694867e-09]
 [2.80160157e-06]
 [1.00000000e+00]
 [9.99019980e-01]
 [1.00000000e+00]
 [7.03761399e-01]
 [3.84878284e-12]
 [9.99991775e-01]
 [1.94970307e-11]
 [2.77695153e-02]
 [1.42233901e-07]
 [9.98865128e-01]
 [1.21927485e-02]
 [9.99902487e-01]], shape=(20, 1), dtype=float32)


### 六，保存模型

推荐使用 `TensorFlow` 原生方式保存模型。

In [24]:
model.save("./data/tf_model_saved_model", save_format="tf")
print("export saved model.")

model_loaded = tf.keras.models.load_model("./data/tf_model_saved_model")
model_loaded.predict(ds_test)

2022-04-22 15:21:06.494532: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: ./data/tf_model_saved_model/assets
export saved model.


array([[0.21379325],
       [0.9999994 ],
       [0.9999975 ],
       ...,
       [0.99999666],
       [0.47219488],
       [1.        ]], dtype=float32)