-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Loading text tutorial: fixed OOV handling #2260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading text tutorial: fixed OOV handling #2260
Conversation
PreviewPreview and run these notebook edits with Google Colab: Rendered notebook diffs available on ReviewNB.com.Format and styleUse the TensorFlow docs notebook tools to format for consistent source diffs and lint for style:$ python3 -m pip install -U --user git+https://github.com/tensorflow/docsIf commits are added to the pull request, synchronize your local branch: git pull origin tutorial_text_fix_2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Thanks for all the PRs!
I did a quick edit to use the builtin OOV indexes instead of remapping, and added a cell to test run it.
This approach works for that section. However, note that in the subsequent section "Export the model", we do the following:
preprocess_layer = TextVectorization(
max_tokens=vocab_size,
standardize=tf_text.case_fold_utf8,
split=tokenizer.tokenize,
output_mode='int',
output_sequence_length=MAX_SEQUENCE_LENGTH)
preprocess_layer.set_vocabulary(vocab)
export_model = tf.keras.Sequential(
[preprocess_layer, model,
layers.Activation('sigmoid')]) We do not train |
Thanks again. What a strange setup? I'll fix it. |
4daf8a1
to
f2ea5cf
Compare
Okay, I've removed my commits, let's merge this, and I'll fix the weirdness in a separate PR. |
In the Beginners tutorials > Load and preprocess data > Text, under Example 2, I believe there is an issue handling OOV tokens. The code does this:
However, note that
StaticVocabularyTable
does not return the value 1 for OOV, but ratherhash(<term>) % num_oov_buckets + vocab_size
per its documentation. In our case, this is equal to vocab_size, which is actually 10000 (and not 1 nor 10002). This in fact collides with the token in the original index 9998. To fix this, I suggest to first map the OOV to 10002 by providing vocabulary that includes the padding and OOV tokens inherently:Then, in the subsequent cell remap the 10002 to 1. I'm using
len(keys)
here rather thanvocab_size + 2
as later in the tutorial we dovocab_size += 2
.