Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug for tf.keras.layers.TextVectorization when built from saved configs and weights #52109

Closed
lankuohsing opened this issue Sep 23, 2021 · 4 comments
Assignees
Labels
comp:keras Keras related issues type:bug Bug

Comments

@lankuohsing
Copy link

lankuohsing commented Sep 23, 2021

I have tried writing a python program to save tf.keras.layers.TextVectorization to disk and load it with the answer of https://stackoverflow.com/questions/65103526/how-to-save-textvectorization-to-disk-in-tensorflow.
The TextVectorization layer built from saved configs outputs a vector with wrong length when the arg output_sequence_length is not None and output_mode='int'.
For example, if I set output_sequence_length= 10, and output_mode='int', it is expected that given a text, TextVectorization should output a vector with length of 10, see vectorizer and new_v2 in the code below.
However, if TextVectorization's arg output_mode='int' is set from saved configs, it doesn't output a vector with length of 10(actually it is 9, the real length of the sentence. It seems like output_sequence_length is not set successfully). See the object new_v1 in the code below.
The interesting thing is, I have compared from_disk['config']['output_mode'] and 'int', they equal to each other.

import tensorflow as tf
from tensorflow.keras.models import load_model
import pickle
# In[]
max_len = 10  # Sequence length to pad the outputs to.
text_dataset = tf.data.Dataset.from_tensor_slices([
                                                   "I like natural language processing",
                                                   "You like computer vision",
                                                   "I like computer games and computer science"])
# Fit a TextVectorization layer
VOCAB_SIZE = 10  # Maximum vocab size.
vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode='int',
        output_sequence_length=max_len
        )
vectorizer.adapt(text_dataset.batch(64))
# In[]
#print(vectorizer.get_vocabulary())
#print(vectorizer.get_config())
#print(vectorizer.get_weights())
# In[]


# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
             'weights': vectorizer.get_weights()}
            , open("./models/tv_layer.pkl", "wb"))


# Later you can unpickle and use
# `config` to create object and
# `weights` to load the trained weights.

from_disk = pickle.load(open("./models/tv_layer.pkl", "rb"))

new_v1 = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode=from_disk['config']['output_mode'],
        output_sequence_length=from_disk['config']['output_sequence_length'],
        )
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v1.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v1.set_weights(from_disk['weights'])
new_v2 = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode='int',
        output_sequence_length=from_disk['config']['output_sequence_length'],
        )

# You have to call `adapt` with some dummy data (BUG in Keras)
new_v2.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v2.set_weights(from_disk['weights'])
print ("*"*10)
# In[]
test_sentence="Jack likes computer scinece, computer games, and foreign language"

print(vectorizer(test_sentence))
print (new_v1(test_sentence))
print (new_v2(test_sentence))
print(from_disk['config']['output_mode']=='int')

Here are the print() outputs:

**********
tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
tf.Tensor([ 1  1  3  1  3 11 12  1 10], shape=(9,), dtype=int64)
tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
True

Does anyone know why?
I have also raised a same issue as this in the repo of Keras keras-team/keras#15382

@lankuohsing lankuohsing added the type:bug Bug label Sep 23, 2021
@sushreebarsa sushreebarsa added the comp:keras Keras related issues label Sep 27, 2021
@sushreebarsa
Copy link
Contributor

@lankuohsing We see that the issue is posted in Keras repo and PR 15422 is merged for this issue .Please let us know if we can close this ticket here ?Thanks!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Sep 28, 2021
@lankuohsing
Copy link
Author

@lankuohsing We see that the issue is posted in Keras repo and PR 15422 is merged for this issue .Please let us know if we can close this ticket here ?Thanks!

@sushreebarsa The solution in that PR is useful! Thanks! You can close the issue.

@sushreebarsa sushreebarsa removed the stat:awaiting response Status - Awaiting response from author label Sep 29, 2021
@sushreebarsa
Copy link
Contributor

@lankuohsing
Thank you for your update, moving this issue to closed status as it is resolved.Thanks!

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants