本教程提供了一个如何使用tf.data.TextLineDataset示例从文本文件中加载示例的示例。TextLineDataset用于从文本文件创建数据集，其中每个示例都是原始文件中的一行文本。对于主要基于行的任何文本数据（例如，诗歌或错误日志），这可能很有用。

在本教程中，我们将使用同一作品的三种不同的英语翻译，即荷马的《伊利亚特》，并训练一个模型以在单行文本的情况下识别翻译者。

## 1.通过`tf.data.TextLineDataset`加载数据

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

import tensorflow_datasets as tfds
import os

这三种翻译的文本是：

* 威廉·考珀（William Cowper） - 文字

* 德比伯爵爱德华（Edward） — 文本

* 塞缪尔·巴特勒（Samuel Butler） - 文字

本教程中使用的文本文件已经执行了一些典型的预处理任务，主要是删除了内容-文档的页眉和页脚，行号，章节标题。从本地下载这些轻描淡写的文件。

In [2]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
    text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

parent_dir = os.path.dirname(text_dir)

parent_dir

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt


'C:\\Users\\sha\\.keras\\datasets'

将文本加载到数据集中<br>
遍历文件，将每个文件加载到其自己的数据集中。

每个示例都需要单独标记，因此可以tf.data.Dataset.map对每个示例应用标记功能。这将遍历数据集中的每个示例，并返回（example, label）对。

In [14]:
def labeler(example, index):
    return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

In [15]:
len(labeled_data_sets)

3

将这些带标签的数据集合并为一个数据集，然后对其进行随机排序。

In [16]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [17]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

In [18]:
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

您可以使用tf.data.Dataset.take和print来查看(example, label)配对的外观。该numpy属性显示每个张量的值。

In [19]:
for ex in all_labeled_data.take(5):
    print(ex)

(<tf.Tensor: id=202, shape=(), dtype=string, numpy=b'mounted the chariot sick and sorry at heart, while Iris sat beside her'>, <tf.Tensor: id=203, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=204, shape=(), dtype=string, numpy=b'Crazed as he is, and by the stroke of Jove'>, <tf.Tensor: id=205, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=206, shape=(), dtype=string, numpy=b"Slew and despoil'd, and through the Grecian host">, <tf.Tensor: id=207, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=208, shape=(), dtype=string, numpy=b'Then he called on his horses and said to them, "Keep your pace, and'>, <tf.Tensor: id=209, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=210, shape=(), dtype=string, numpy=b'Disease portends to miserable man;'>, <tf.Tensor: id=211, shape=(), dtype=int64, numpy=0>)


## 2、编码

将文本行编码为数字
机器学习模型处理数字而不是单词，因此需要将字符串值转换为数字列表。为此，请将每个唯一的单词映射到唯一的整数。

建立词汇
首先，通过将文本标记为单个独特单词的集合来建立词汇表。TensorFlow和Python中都有几种方法可以做到这一点。对于本教程：

1. 遍历每个示例的numpy值。
1. 用于tfds.features.text.Tokenizer将其拆分为令牌。
1. 将这些令牌收集到Python集中，以删除重复项。
1. 获取词汇表的大小以备后用。

In [20]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy())
    vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

17178

编码示例<br>
将传递给，vocabulary_set以创建编码器tfds.features.text.TokenTextEncoder。编码器的encode方法接收一个文本字符串，并返回一个整数列表。

In [21]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

您可以在一行上尝试一下，以查看输出结果。

In [22]:
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)

b'mounted the chariot sick and sorry at heart, while Iris sat beside her'


In [23]:
encoded_example = encoder.encode(example_text)
print(encoded_example)

[9077, 16923, 7733, 10293, 10794, 9463, 8400, 8203, 13312, 3596, 17020, 1323, 17007]


现在，通过将编码器包装tf.py_function并传递到数据集的map方法，从而在数据集上运行编码器。

In [24]:
def encode(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

def encode_map_fn(text, label):
    return tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))

all_encoded_data = all_labeled_data.map(encode_map_fn)

将数据集拆分为测试和训练批次

使用tf.data.Dataset.take和tf.data.Dataset.skip创建一个小的测试数据集和一个更大的训练集。

在传递到模型之前，需要对数据集进行批处理。通常，批内的示例必须具有相同的大小和形状。但是，这些数据集中的示例的大小并不完全相同-每一行文本的单词数量不同。因此，请使用tf.data.Dataset.padded_batch（而不是batch）将示例填充为相同大小。

## 3、构造训练数据

In [25]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([-1],[]))

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([-1],[]))

In [28]:
BATCH_SIZE

64

现在，test_data和train_data不是（example, label）对的集合，而是批次的集合。每一批是一对（很多示例，很多标签），以数组表示。

为了显示：

In [26]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

(<tf.Tensor: id=99675, shape=(16,), dtype=int64, numpy=
 array([ 9077, 16923,  7733, 10293, 10794,  9463,  8400,  8203, 13312,
         3596, 17020,  1323, 17007,     0,     0,     0], dtype=int64)>,
 <tf.Tensor: id=99679, shape=(), dtype=int64, numpy=2>)

In [29]:
sample_text.shape

TensorShape([64, 16])

自从我们引入了新的令牌编码（用于填充的零）以来，词汇量就增加了一个。

In [30]:
vocab_size += 1

## 4、构建模型，并训练测试

建立模型

In [31]:
model = tf.keras.Sequential()

第一层将整数表示转换为密集的矢量嵌入。有关更多详细信息，请参见Word Embeddings教程。

In [32]:
model.add(tf.keras.layers.Embedding(vocab_size, 64))

下一层是“ 长期短期记忆”层，它使模型可以将上下文中的单词与其他单词一起理解。LSTM上的双向包装器可帮助它了解与之前和之后的数据点相关的数据点。

In [33]:
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

最后，我们将具有一系列由一个或多个紧密连接的层组成的层，最后一层是输出层。输出层为所有标签产生概率。可能性最高的是示例标签的模型预测。

In [34]:
# One or more dense layers.
# Edit the list in the `for` line to experiment with layer sizes.
for units in [64, 64]:
    model.add(tf.keras.layers.Dense(units, activation='relu'))

# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(3, activation='softmax'))

最后，编译模型。对于softmax分类模型，sparse_categorical_crossentropy用作损失函数。您可以尝试其他优化器，但这adam很常见。

In [35]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

训练模型

在此数据上运行的该模型产生了不错的结果（约83％）。

In [20]:
model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3


    168/Unknown - 16s 16s/step - loss: 1.0982 - accuracy: 0.37 - 16s 8s/step - loss: 1.0961 - accuracy: 0.4062 - 16s 5s/step - loss: 1.0943 - accuracy: 0.390 - 16s 4s/step - loss: 1.0942 - accuracy: 0.359 - 16s 3s/step - loss: 1.0916 - accuracy: 0.368 - 16s 3s/step - loss: 1.0892 - accuracy: 0.372 - 16s 2s/step - loss: 1.0877 - accuracy: 0.361 - 16s 2s/step - loss: 1.0849 - accuracy: 0.361 - 16s 2s/step - loss: 1.0834 - accuracy: 0.357 - 16s 2s/step - loss: 1.0787 - accuracy: 0.373 - 16s 1s/step - loss: 1.0758 - accuracy: 0.379 - 16s 1s/step - loss: 1.0744 - accuracy: 0.381 - 16s 1s/step - loss: 1.0713 - accuracy: 0.384 - 16s 1s/step - loss: 1.0680 - accuracy: 0.383 - 16s 1s/step - loss: 1.0644 - accuracy: 0.389 - 16s 1s/step - loss: 1.0644 - accuracy: 0.389 - 16s 942ms/step - loss: 1.0613 - accuracy: 0.39 - 16s 891ms/step - loss: 1.0566 - accuracy: 0.39 - 16s 845ms/step - loss: 1.0551 - accuracy: 0.38 - 16s 804ms/step - loss: 1.0505 - accuracy: 0.39 - 16s 767ms/step - loss: 1.0463 - a

    502/Unknown - 22s 67ms/step - loss: 0.6268 - accuracy: 0.673 - 22s 67ms/step - loss: 0.6264 - accuracy: 0.673 - 22s 66ms/step - loss: 0.6255 - accuracy: 0.674 - 22s 66ms/step - loss: 0.6254 - accuracy: 0.674 - 22s 66ms/step - loss: 0.6248 - accuracy: 0.674 - 23s 66ms/step - loss: 0.6242 - accuracy: 0.675 - 23s 66ms/step - loss: 0.6234 - accuracy: 0.675 - 23s 66ms/step - loss: 0.6235 - accuracy: 0.675 - 23s 66ms/step - loss: 0.6231 - accuracy: 0.676 - 23s 65ms/step - loss: 0.6225 - accuracy: 0.676 - 23s 65ms/step - loss: 0.6219 - accuracy: 0.677 - 23s 65ms/step - loss: 0.6218 - accuracy: 0.677 - 23s 65ms/step - loss: 0.6215 - accuracy: 0.677 - 23s 65ms/step - loss: 0.6211 - accuracy: 0.677 - 23s 65ms/step - loss: 0.6206 - accuracy: 0.677 - 23s 65ms/step - loss: 0.6202 - accuracy: 0.678 - 23s 65ms/step - loss: 0.6197 - accuracy: 0.678 - 23s 64ms/step - loss: 0.6190 - accuracy: 0.679 - 23s 64ms/step - loss: 0.6184 - accuracy: 0.679 - 23s 64ms/step - loss: 0.6177 - accuracy: 0.679 - 23

    669/Unknown - 26s 51ms/step - loss: 0.5654 - accuracy: 0.714 - 26s 51ms/step - loss: 0.5651 - accuracy: 0.714 - 26s 51ms/step - loss: 0.5649 - accuracy: 0.714 - 26s 51ms/step - loss: 0.5647 - accuracy: 0.714 - 26s 51ms/step - loss: 0.5645 - accuracy: 0.714 - 26s 51ms/step - loss: 0.5643 - accuracy: 0.714 - 26s 51ms/step - loss: 0.5640 - accuracy: 0.715 - 26s 51ms/step - loss: 0.5638 - accuracy: 0.715 - 26s 51ms/step - loss: 0.5635 - accuracy: 0.715 - 26s 51ms/step - loss: 0.5632 - accuracy: 0.715 - 26s 51ms/step - loss: 0.5629 - accuracy: 0.715 - 26s 51ms/step - loss: 0.5630 - accuracy: 0.715 - 26s 51ms/step - loss: 0.5628 - accuracy: 0.715 - 26s 50ms/step - loss: 0.5624 - accuracy: 0.715 - 26s 50ms/step - loss: 0.5619 - accuracy: 0.716 - 26s 50ms/step - loss: 0.5617 - accuracy: 0.716 - 26s 50ms/step - loss: 0.5615 - accuracy: 0.716 - 26s 50ms/step - loss: 0.5612 - accuracy: 0.716 - 26s 50ms/step - loss: 0.5609 - accuracy: 0.716 - 26s 50ms/step - loss: 0.5605 - accuracy: 0.716 - 26

Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1bd70b0e160>

In [21]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          1099456   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 195       
Total params: 1,178,115
Trainable params: 1,178,115
Non-trainable params: 0
_________________________________________________________________


In [25]:
eval_loss, eval_acc = model.evaluate(test_data)



In [24]:
print('Eval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))

Eval loss: 0.398, Eval accuracy: 0.832
