# Ungraded Lab: Building a Vocabulary 构建词汇表

In most natural language processing (NLP) tasks, the initial step in preparing your data is to extract a vocabulary of words from your corpus (i.e. input texts). You will need to define how to represent the texts into numeric features which can be used to train a neural network. Tensorflow and Keras makes it easy to generate these using its APIs. You will see how to do that in the next cells.

在大多数自然语言处理（NLP）任务中，准备数据的第一步是从语料库（即输入文本）中提取词汇表。你需要定义如何将文本表示为数值特征，以便用于训练神经网络。TensorFlow 和 Keras 提供了便捷的 API 来实现这一过程。在接下来的单元格中，你将看到如何操作。

The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the [TextVectorization()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) preprocessing layer and its [adapt()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#adapt) method.

As mentioned in the docs above, this layer does several things including:

1. Standardizing each example. The default behavior is to lowercase and strip punctuation. See its `standardize` argument for other options.
2. Splitting each example into substrings. By default, it will split into words. See its `split` argument for other options.
3. Recombining substrings into tokens. See its `ngrams` argument for reference.
4. Indexing tokens.
5. Transforming each example using this index, either into a vector of ints or a dense float vector.

Run the cells below to see this in action.

下面的代码会处理一个句子列表，将其中的每个单词转换为对应的整数索引。这是通过 TextVectorization() 预处理层及其 adapt() 方法来实现的。

根据上述文档，这个层执行多个操作，包括：

标准化 每个样本。默认行为是转换为小写并去除标点符号。可以使用 standardize 参数调整此行为。
拆分 每个样本为子字符串。默认情况下按单词拆分，可使用 split 参数自定义拆分方式。
重新组合 子字符串为标记（token），可以参考 ngrams 参数来调整。
索引化 标记（token）。
转换 每个样本，使用索引将文本表示为整数向量或密集浮点向量。

In [1]:
import tensorflow as tf

# Sample inputs
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initialize the layer
vectorize_layer = tf.keras.layers.TextVectorization()

# Build the vocabulary
vectorize_layer.adapt(sentences)

# Get the vocabulary list. Ignore special tokens for now.
vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)

The resulting `vocabulary` will be a list where more frequently used words will have a lower index. By default, it will also reserve indices for special tokens but , for clarity, let's reserve that for later.

生成的 vocabulary（词汇表）将是一个列表，其中更常用的单词会被分配较低的索引。默认情况下，该层还会为特殊标记保留索引，但为了清晰起见，我们暂时不考虑这一点。

In [2]:
# Print the token index
for index, word in enumerate(vocabulary):
  print(index, word)

0 my
1 love
2 i
3 dog
4 cat


If you add another sentence, you'll notice new words in the vocabulary and new punctuation is still ignored as expected.


如果你添加另一句话，你会发现词汇表中包含了新的单词，同时新的标点符号仍然会被忽略，这是预期的行为。

In [3]:
# Add another input
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the layer
vectorize_layer = tf.keras.layers.TextVectorization()

# Build the vocabulary
vectorize_layer.adapt(sentences)

# Get the vocabulary list. Ignore special tokens for now.
vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)

In [4]:
# Print the token index
for index, word in enumerate(vocabulary):
  print(index, word)

0 my
1 love
2 i
3 dog
4 you
5 cat


Now that you see how it behaves, let's include the two special tokens. The first one at `0` is used for padding and `1` is used for out-of-vocabulary words. These are important when you use the layer to convert input texts to integer sequences. You'll see that in the next lab.

现在你已经了解了它的行为，让我们加入两个特殊标记。索引 0 用于填充（padding），索引 1 用于未登录词（out-of-vocabulary, OOV）。当你使用该层将输入文本转换为整数序列时，这些标记会很重要。你将在下一个实验中看到它们的作用。

In [5]:
# Get the vocabulary list.
vocabulary = vectorize_layer.get_vocabulary()

# Print the token index
for index, word in enumerate(vocabulary):
  print(index, word)

0 
1 [UNK]
2 my
3 love
4 i
5 dog
6 you
7 cat


That concludes this short exercise on building a vocabulary!