Skip to content

Conversation

vaharoni
Copy link
Contributor

@vaharoni vaharoni commented Aug 27, 2023

In the Beginners tutorials > Load and preprocess data > Text, under Example 2, I believe there is an issue handling OOV tokens. The code does this:

keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

However, note that StaticVocabularyTable does not return the value 1 for OOV, but rather hash(<term>) % num_oov_buckets + vocab_size per its documentation. In our case, this is equal to vocab_size, which is actually 10000 (and not 1 nor 10002). This in fact collides with the token in the original index 9998. To fix this, I suggest to first map the OOV to 10002 by providing vocabulary that includes the padding and OOV tokens inherently:

# Reserve `0` for padding, `1` for OOV tokens.
keys = ['', '[UNK]'] + vocab
values = range(len(keys))

Then, in the subsequent cell remap the 10002 to 1. I'm using len(keys) here rather than vocab_size + 2 as later in the tutorial we do vocab_size += 2.

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  # StaticVocabularyTable returns the OOV token as vocab_size + 2. We overwrite it to be 1.
  vectorized = tf.where(vectorized == len(keys), tf.constant(1, dtype=tf.int64), vectorized)
  return vectorized, label

@vaharoni vaharoni requested a review from a team as a code owner August 27, 2023 11:27
@github-actions
Copy link

Preview

Preview and run these notebook edits with Google Colab: Rendered notebook diffs available on ReviewNB.com.

Format and style

Use the TensorFlow docs notebook tools to format for consistent source diffs and lint for style:
$ python3 -m pip install -U --user git+https://github.com/tensorflow/docs

$ python3 -m tensorflow_docs.tools.nbfmt notebook.ipynb
$ python3 -m tensorflow_docs.tools.nblint --arg=repo:tensorflow/docs notebook.ipynb
If commits are added to the pull request, synchronize your local branch: git pull origin tutorial_text_fix_2

@8bitmp3 8bitmp3 added the review in progress Someone is actively reviewing this PR label Aug 29, 2023
MarkDaoust
MarkDaoust previously approved these changes Aug 30, 2023
Copy link
Member

@MarkDaoust MarkDaoust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thanks for all the PRs!

I did a quick edit to use the builtin OOV indexes instead of remapping, and added a cell to test run it.

@github-actions github-actions bot added the lgtm Community-added approval label Aug 30, 2023
@MarkDaoust MarkDaoust added ready to pull Start merge process and removed lgtm Community-added approval review in progress Someone is actively reviewing this PR labels Aug 30, 2023
@vaharoni
Copy link
Contributor Author

vaharoni commented Aug 31, 2023

This approach works for that section. However, note that in the subsequent section "Export the model", we do the following:

To make the model capable of taking raw strings as input, you will create a Keras TextVectorization layer that performs the same steps as your custom preprocessing function. Since you have already trained a vocabulary, you can use TextVectorization.set_vocabulary (instead of TextVectorization.adapt), which trains a new vocabulary.

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

preprocess_layer.set_vocabulary(vocab)

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

We do not train export_model. We just export it, relying on the weights of model. However, there is a subtle difference here. model was trained to expect 10001 for the OOV token from the StaticVocabularyTable returned from your make_vocab_table function. But exported_model receives 1 for the OOV token from the TextVectorization layer. This could impact inference, and the statement "you will create a Keras TextVectorization layer that performs the same steps as your custom preprocessing function" seems incorrect.

@8bitmp3 8bitmp3 added ready to pull Start merge process and removed ready to pull Start merge process labels Sep 7, 2023
@MarkDaoust MarkDaoust removed the ready to pull Start merge process label Sep 7, 2023
@MarkDaoust
Copy link
Member

MarkDaoust commented Sep 7, 2023

Thanks again. What a strange setup? I'll fix it.

@MarkDaoust
Copy link
Member

Okay, I've removed my commits, let's merge this, and I'll fix the weirdness in a separate PR.

@MarkDaoust MarkDaoust added the ready to pull Start merge process label Sep 8, 2023
@copybara-service copybara-service bot merged commit 6c7e49a into tensorflow:master Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to pull Start merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants