# 使用 tf.data 加载文本数据

本教程为你提供了一个如何使用 `tf.data.TextLineDataset` 来加载文本文件的示例。`TextLineDataset` 通常被用来以文本文件构建数据集（原文件中的一行为一个样本) 。这适用于大多数的基于行的文本数据（例如，诗歌或错误日志) 。下面我们将使用相同作品（荷马的伊利亚特）三个不同版本的英文翻译，然后训练一个模型来通过单行文本确定译者。


## 环境搭建

In [23]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

import tensorflow_datasets as tfds
import os

三个版本的翻译分别来自于:

 - [William Cowper](https://en.wikipedia.org/wiki/William_Cowper) — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt)

 - [Edward, Earl of Derby](https://en.wikipedia.org/wiki/Edward_Smith-Stanley,_14th_Earl_of_Derby) — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt)

- [Samuel Butler](https://en.wikipedia.org/wiki/Samuel_Butler_%28novelist%29) — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt)

本教程中使用的文本文件已经进行过一些典型的预处理，主要包括删除了文档页眉和页脚，行号，章节标题。请下载这些已经被局部改动过的文件。


In [24]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)
  
parent_dir = os.path.dirname(text_dir)

parent_dir

'C:\\Users\\wangxingda\\.keras\\datasets'

## 将文本加载到数据集中

迭代整个文件，将整个文件加载到自己的数据集中。

每个样本都需要单独标记，所以请使用 `tf.data.Dataset.map` 来为每个样本设定标签。这将迭代数据集中的每一个样本并且返回（ `example, label` ）对。

In [25]:
def labeler(example, index):
  return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

将这些标记的数据集合并到一个数据集中，然后对其进行随机化操作。


In [26]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [27]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)
  
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

你可以使用 `tf.data.Dataset.take` 与 `print` 来查看 `(example, label)` 对的外观。`numpy` 属性显示每个 Tensor 的值。

In [28]:
for ex in all_labeled_data.take(5):
  print(ex)

(<tf.Tensor: id=620469, shape=(), dtype=string, numpy=b"And, stripping slain Patroclus, thought'st thee safe,">, <tf.Tensor: id=620470, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=620471, shape=(), dtype=string, numpy=b'stock of all kinds would come from far and near to water; here, then,'>, <tf.Tensor: id=620472, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=620473, shape=(), dtype=string, numpy=b"take Amphimachus's helmet from off his temples, and in a moment Ajax">, <tf.Tensor: id=620474, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=620475, shape=(), dtype=string, numpy=b'But when they came, at length, where Xanthus winds'>, <tf.Tensor: id=620476, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=620477, shape=(), dtype=string, numpy=b'better, that earth should open and swallow us here in this place, than'>, <tf.Tensor: id=620478, shape=(), dtype=int64, numpy=2>)


## 将文本编码成数字

机器学习基于的是数字而非文本，所以字符串需要被转化成数字列表。
为了达到此目的，我们需要构建文本与整数的一一映射。

### 建立词汇表


首先，通过将文本标记为单独的单词集合来构建词汇表。在 TensorFlow 和 Python 中均有很多方法来达成这一目的。在本教程中:

1. 迭代每个样本的 `numpy` 值。
2. 使用 `tfds.features.text.Tokenizer` 来将其分割成 `token`。
3. 将这些 `token` 放入一个 Python 集合中，借此来清除重复项。
4. 获取该词汇表的大小以便于以后使用。


In [29]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
  some_tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

17178

### 样本编码

通过传递 `vocabulary_set` 到 `tfds.features.text.TokenTextEncoder` 来构建一个编码器。编码器的 `encode` 方法传入一行文本，返回一个整数列表。

In [30]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

你可以尝试运行这一行代码并查看输出的样式。

In [31]:
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)

b"And, stripping slain Patroclus, thought'st thee safe,"


In [32]:
encoded_example = encoder.encode(example_text)
print(encoded_example)

[3571, 1104, 985, 16725, 8867, 9302, 757, 8239]


现在，在数据集上运行编码器（通过将编码器打包到 `tf.py_function` 并且传参至数据集的 `map` 方法的方式来运行）。 

In [33]:
def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

def encode_map_fn(text, label):
  return tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))

all_encoded_data = all_labeled_data.map(encode_map_fn)

## 将数据集分割为测试集和训练集且进行分支

使用 `tf.data.Dataset.take` 和 `tf.data.Dataset.skip` 来建立一个小一些的测试数据集和稍大一些的训练数据集。

在数据集被传入模型之前，数据集需要被分批。最典型的是，每个分支中的样本大小与格式需要一致。但是数据集中样本并不全是相同大小的（每行文本字数并不相同）。因此，使用 `tf.data.Dataset.padded_batch`（而不是 `batch` ）将样本填充到相同的大小。

In [34]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([-1],[]))

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([-1],[]))


现在，test_data 和 train_data 不是（ `example, label` ）对的集合，而是批次的集合。每个批次都是一对（*多样本*, *多标签* ），表示为数组。


In [35]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

(<tf.Tensor: id=719942, shape=(16,), dtype=int64, numpy=
 array([ 3571,  1104,   985, 16725,  8867,  9302,   757,  8239,     0,
            0,     0,     0,     0,     0,     0,     0], dtype=int64)>,
 <tf.Tensor: id=719946, shape=(), dtype=int64, numpy=0>)


由于我们引入了一个新的 token 来编码（填充零），因此词汇表大小增加了一个。

In [36]:
vocab_size += 1

## 建立模型



In [37]:
model = tf.keras.Sequential()

第一层将整数表示转换为密集矢量嵌入。更多内容请查阅 [Word Embeddings](../../tutorials/sequences/word_embeddings) 教程。

In [38]:
model.add(tf.keras.layers.Embedding(vocab_size, 64))


下一层是 [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) 层，它允许模型利用上下文中理解单词含义。 LSTM 上的双向包装器有助于模型理解当前数据点与其之前和之后的数据点的关系。


In [39]:
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))


最后，我们将获得一个或多个紧密连接的层，其中最后一层是输出层。输出层输出样本属于各个标签的概率，最后具有最高概率的分类标签即为最终预测结果。


In [40]:
# 一个或多个紧密连接的层
# 编辑 `for` 行的列表去检测层的大小
for units in [64, 64]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))

# 输出层。第一个参数是标签个数。
model.add(tf.keras.layers.Dense(3, activation='softmax'))

最后，编译这个模型。对于一个 softmax 分类模型来说，通常使用 `sparse_categorical_crossentropy` 作为其损失函数。你可以尝试其他的优化器，但是 `adam` 是最常用的。

In [41]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## 训练模型

利用提供的数据训练出的模型有着不错的精度（大约 83% ）。

In [42]:
model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3


    168/Unknown - 18s 18s/step - loss: 1.0979 - accuracy: 0.43 - 18s 9s/step - loss: 1.0977 - accuracy: 0.3828 - 18s 6s/step - loss: 1.0961 - accuracy: 0.369 - 18s 4s/step - loss: 1.0936 - accuracy: 0.367 - 18s 4s/step - loss: 1.0920 - accuracy: 0.387 - 18s 3s/step - loss: 1.0919 - accuracy: 0.375 - 18s 3s/step - loss: 1.0892 - accuracy: 0.386 - 18s 2s/step - loss: 1.0868 - accuracy: 0.388 - 18s 2s/step - loss: 1.0853 - accuracy: 0.378 - 18s 2s/step - loss: 1.0821 - accuracy: 0.390 - 18s 2s/step - loss: 1.0791 - accuracy: 0.390 - 18s 1s/step - loss: 1.0762 - accuracy: 0.394 - 18s 1s/step - loss: 1.0720 - accuracy: 0.400 - 18s 1s/step - loss: 1.0699 - accuracy: 0.395 - 18s 1s/step - loss: 1.0644 - accuracy: 0.403 - 18s 1s/step - loss: 1.0595 - accuracy: 0.399 - 18s 1s/step - loss: 1.0547 - accuracy: 0.402 - 18s 1s/step - loss: 1.0475 - accuracy: 0.410 - 18s 956ms/step - loss: 1.0413 - accuracy: 0.41 - 18s 910ms/step - loss: 1.0353 - accuracy: 0.42 - 18s 868ms/step - loss: 1.0294 - accur

    335/Unknown - 24s 140ms/step - loss: 0.7026 - accuracy: 0.61 - 24s 139ms/step - loss: 0.7019 - accuracy: 0.61 - 24s 138ms/step - loss: 0.7012 - accuracy: 0.61 - 24s 138ms/step - loss: 0.7007 - accuracy: 0.61 - 24s 137ms/step - loss: 0.7005 - accuracy: 0.61 - 24s 137ms/step - loss: 0.6997 - accuracy: 0.62 - 24s 136ms/step - loss: 0.6985 - accuracy: 0.62 - 24s 136ms/step - loss: 0.6975 - accuracy: 0.62 - 24s 135ms/step - loss: 0.6976 - accuracy: 0.62 - 24s 134ms/step - loss: 0.6978 - accuracy: 0.62 - 24s 134ms/step - loss: 0.6967 - accuracy: 0.62 - 24s 133ms/step - loss: 0.6960 - accuracy: 0.62 - 24s 133ms/step - loss: 0.6947 - accuracy: 0.62 - 24s 132ms/step - loss: 0.6939 - accuracy: 0.62 - 24s 132ms/step - loss: 0.6934 - accuracy: 0.62 - 24s 131ms/step - loss: 0.6932 - accuracy: 0.62 - 24s 131ms/step - loss: 0.6925 - accuracy: 0.62 - 24s 130ms/step - loss: 0.6918 - accuracy: 0.62 - 24s 130ms/step - loss: 0.6910 - accuracy: 0.62 - 24s 129ms/step - loss: 0.6902 - accuracy: 0.62 - 24

    502/Unknown - 30s 88ms/step - loss: 0.6220 - accuracy: 0.678 - 30s 88ms/step - loss: 0.6220 - accuracy: 0.678 - 30s 88ms/step - loss: 0.6217 - accuracy: 0.678 - 30s 88ms/step - loss: 0.6215 - accuracy: 0.678 - 30s 88ms/step - loss: 0.6212 - accuracy: 0.679 - 30s 87ms/step - loss: 0.6205 - accuracy: 0.679 - 30s 87ms/step - loss: 0.6200 - accuracy: 0.680 - 30s 87ms/step - loss: 0.6196 - accuracy: 0.680 - 30s 87ms/step - loss: 0.6194 - accuracy: 0.680 - 30s 87ms/step - loss: 0.6187 - accuracy: 0.681 - 30s 87ms/step - loss: 0.6186 - accuracy: 0.681 - 30s 87ms/step - loss: 0.6185 - accuracy: 0.681 - 30s 86ms/step - loss: 0.6178 - accuracy: 0.681 - 30s 86ms/step - loss: 0.6176 - accuracy: 0.681 - 30s 86ms/step - loss: 0.6170 - accuracy: 0.682 - 30s 86ms/step - loss: 0.6164 - accuracy: 0.682 - 30s 86ms/step - loss: 0.6160 - accuracy: 0.683 - 30s 86ms/step - loss: 0.6156 - accuracy: 0.683 - 30s 86ms/step - loss: 0.6153 - accuracy: 0.683 - 30s 85ms/step - loss: 0.6144 - accuracy: 0.683 - 30

    669/Unknown - 36s 71ms/step - loss: 0.5660 - accuracy: 0.716 - 36s 71ms/step - loss: 0.5657 - accuracy: 0.716 - 36s 71ms/step - loss: 0.5654 - accuracy: 0.717 - 36s 71ms/step - loss: 0.5651 - accuracy: 0.717 - 36s 70ms/step - loss: 0.5646 - accuracy: 0.717 - 36s 70ms/step - loss: 0.5645 - accuracy: 0.717 - 36s 70ms/step - loss: 0.5641 - accuracy: 0.717 - 36s 70ms/step - loss: 0.5637 - accuracy: 0.718 - 36s 70ms/step - loss: 0.5634 - accuracy: 0.718 - 36s 70ms/step - loss: 0.5631 - accuracy: 0.718 - 36s 70ms/step - loss: 0.5629 - accuracy: 0.718 - 36s 70ms/step - loss: 0.5623 - accuracy: 0.719 - 36s 70ms/step - loss: 0.5620 - accuracy: 0.719 - 36s 70ms/step - loss: 0.5619 - accuracy: 0.719 - 36s 70ms/step - loss: 0.5618 - accuracy: 0.719 - 36s 70ms/step - loss: 0.5614 - accuracy: 0.719 - 36s 70ms/step - loss: 0.5611 - accuracy: 0.719 - 36s 70ms/step - loss: 0.5607 - accuracy: 0.719 - 36s 70ms/step - loss: 0.5605 - accuracy: 0.719 - 36s 69ms/step - loss: 0.5605 - accuracy: 0.720 - 36

Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x2276c45f4a8>

In [47]:
eval_loss, eval_acc = model.evaluate(test_data)
print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))


Eval loss: 0.3797863131459755, Eval accuracy: 0.826200008392334
