Loading text tutorial: fixed OOV handling #2260

vaharoni · 2023-08-27T11:27:08Z

In the Beginners tutorials > Load and preprocess data > Text, under Example 2, I believe there is an issue handling OOV tokens. The code does this:

keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

However, note that StaticVocabularyTable does not return the value 1 for OOV, but rather hash(<term>) % num_oov_buckets + vocab_size per its documentation. In our case, this is equal to vocab_size, which is actually 10000 (and not 1 nor 10002). This in fact collides with the token in the original index 9998. To fix this, I suggest to first map the OOV to 10002 by providing vocabulary that includes the padding and OOV tokens inherently:

# Reserve `0` for padding, `1` for OOV tokens.
keys = ['', '[UNK]'] + vocab
values = range(len(keys))

Then, in the subsequent cell remap the 10002 to 1. I'm using len(keys) here rather than vocab_size + 2 as later in the tutorial we do vocab_size += 2.

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  # StaticVocabularyTable returns the OOV token as vocab_size + 2. We overwrite it to be 1.
  vectorized = tf.where(vectorized == len(keys), tf.constant(1, dtype=tf.int64), vectorized)
  return vectorized, label

github-actions · 2023-08-27T11:27:33Z

Preview

Preview and run these notebook edits with Google Colab:

site/en/tutorials/load_data/text.ipynb

Rendered notebook diffs available on ReviewNB.com.

Format and style

Use the TensorFlow docs notebook tools to format for consistent source diffs and lint for style:

$ python3 -m pip install -U --user git+https://github.com/tensorflow/docs

$ python3 -m tensorflow_docs.tools.nbfmt notebook.ipynb

$ python3 -m tensorflow_docs.tools.nblint --arg=repo:tensorflow/docs notebook.ipynb

If commits are added to the pull request, synchronize your local branch: git pull origin tutorial_text_fix_2

MarkDaoust

Nice catch. Thanks for all the PRs!

I did a quick edit to use the builtin OOV indexes instead of remapping, and added a cell to test run it.

vaharoni · 2023-08-31T11:33:12Z

This approach works for that section. However, note that in the subsequent section "Export the model", we do the following:

To make the model capable of taking raw strings as input, you will create a Keras TextVectorization layer that performs the same steps as your custom preprocessing function. Since you have already trained a vocabulary, you can use TextVectorization.set_vocabulary (instead of TextVectorization.adapt), which trains a new vocabulary.

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

preprocess_layer.set_vocabulary(vocab)

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

We do not train export_model. We just export it, relying on the weights of model. However, there is a subtle difference here. model was trained to expect 10001 for the OOV token from the StaticVocabularyTable returned from your make_vocab_table function. But exported_model receives 1 for the OOV token from the TextVectorization layer. This could impact inference, and the statement "you will create a Keras TextVectorization layer that performs the same steps as your custom preprocessing function" seems incorrect.

MarkDaoust · 2023-09-07T13:17:16Z

Thanks again. What a strange setup? I'll fix it.

MarkDaoust · 2023-09-08T18:18:09Z

Okay, I've removed my commits, let's merge this, and I'll fix the weirdness in a separate PR.

Loading text tutorial: fixed OOV handling

f2ea5cf

vaharoni requested a review from a team as a code owner August 27, 2023 11:27

8bitmp3 assigned markmcd, MarkDaoust and 8bitmp3 Aug 29, 2023

8bitmp3 added the review in progress Someone is actively reviewing this PR label Aug 29, 2023

MarkDaoust previously approved these changes Aug 30, 2023

View reviewed changes

github-actions bot added the lgtm Community-added approval label Aug 30, 2023

MarkDaoust added ready to pull Start merge process and removed lgtm Community-added approval review in progress Someone is actively reviewing this PR labels Aug 30, 2023

MarkDaoust dismissed their stale review via 4daf8a1 August 30, 2023 21:25

8bitmp3 added ready to pull Start merge process and removed ready to pull Start merge process labels Sep 7, 2023

MarkDaoust removed the ready to pull Start merge process label Sep 7, 2023

MarkDaoust force-pushed the tutorial_text_fix_2 branch from 4daf8a1 to f2ea5cf Compare September 8, 2023 18:16

MarkDaoust added the ready to pull Start merge process label Sep 8, 2023

copybara-service bot merged commit 6c7e49a into tensorflow:master Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loading text tutorial: fixed OOV handling #2260

Loading text tutorial: fixed OOV handling #2260

Uh oh!

vaharoni commented Aug 27, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Aug 27, 2023

Uh oh!

MarkDaoust left a comment

Uh oh!

vaharoni commented Aug 31, 2023 •

edited

Loading

Uh oh!

MarkDaoust commented Sep 7, 2023 •

edited

Loading

Uh oh!

MarkDaoust commented Sep 8, 2023

Uh oh!

Uh oh!

Loading text tutorial: fixed OOV handling #2260

Loading text tutorial: fixed OOV handling #2260

Uh oh!

Conversation

vaharoni commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 27, 2023

Preview

Format and style

Uh oh!

MarkDaoust left a comment

Choose a reason for hiding this comment

Uh oh!

vaharoni commented Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarkDaoust commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarkDaoust commented Sep 8, 2023

Uh oh!

Uh oh!

vaharoni commented Aug 27, 2023 •

edited

Loading

vaharoni commented Aug 31, 2023 •

edited

Loading

MarkDaoust commented Sep 7, 2023 •

edited

Loading