A bug for tf.keras.layers.TextVectorization when built from saved configs and weights #52109

lankuohsing · 2021-09-23T11:44:08Z

I have tried writing a python program to save tf.keras.layers.TextVectorization to disk and load it with the answer of https://stackoverflow.com/questions/65103526/how-to-save-textvectorization-to-disk-in-tensorflow.
The TextVectorization layer built from saved configs outputs a vector with wrong length when the arg output_sequence_length is not None and output_mode='int'.
For example, if I set output_sequence_length= 10, and output_mode='int', it is expected that given a text, TextVectorization should output a vector with length of 10, see vectorizer and new_v2 in the code below.
However, if TextVectorization's arg output_mode='int' is set from saved configs, it doesn't output a vector with length of 10(actually it is 9, the real length of the sentence. It seems like output_sequence_length is not set successfully). See the object new_v1 in the code below.
The interesting thing is, I have compared from_disk['config']['output_mode'] and 'int', they equal to each other.

import tensorflow as tf
from tensorflow.keras.models import load_model
import pickle
# In[]
max_len = 10  # Sequence length to pad the outputs to.
text_dataset = tf.data.Dataset.from_tensor_slices([
                                                   "I like natural language processing",
                                                   "You like computer vision",
                                                   "I like computer games and computer science"])
# Fit a TextVectorization layer
VOCAB_SIZE = 10  # Maximum vocab size.
vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode='int',
        output_sequence_length=max_len
        )
vectorizer.adapt(text_dataset.batch(64))
# In[]
#print(vectorizer.get_vocabulary())
#print(vectorizer.get_config())
#print(vectorizer.get_weights())
# In[]


# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
             'weights': vectorizer.get_weights()}
            , open("./models/tv_layer.pkl", "wb"))


# Later you can unpickle and use
# `config` to create object and
# `weights` to load the trained weights.

from_disk = pickle.load(open("./models/tv_layer.pkl", "rb"))

new_v1 = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode=from_disk['config']['output_mode'],
        output_sequence_length=from_disk['config']['output_sequence_length'],
        )
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v1.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v1.set_weights(from_disk['weights'])
new_v2 = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode='int',
        output_sequence_length=from_disk['config']['output_sequence_length'],
        )

# You have to call `adapt` with some dummy data (BUG in Keras)
new_v2.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v2.set_weights(from_disk['weights'])
print ("*"*10)
# In[]
test_sentence="Jack likes computer scinece, computer games, and foreign language"

print(vectorizer(test_sentence))
print (new_v1(test_sentence))
print (new_v2(test_sentence))
print(from_disk['config']['output_mode']=='int')

Here are the print() outputs:

**********
tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
tf.Tensor([ 1  1  3  1  3 11 12  1 10], shape=(9,), dtype=int64)
tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
True

Does anyone know why?
I have also raised a same issue as this in the repo of Keras keras-team/keras#15382

The text was updated successfully, but these errors were encountered:

sushreebarsa · 2021-09-28T09:05:34Z

@lankuohsing We see that the issue is posted in Keras repo and PR 15422 is merged for this issue .Please let us know if we can close this ticket here ?Thanks!

lankuohsing · 2021-09-28T13:50:56Z

@lankuohsing We see that the issue is posted in Keras repo and PR 15422 is merged for this issue .Please let us know if we can close this ticket here ?Thanks!

@sushreebarsa The solution in that PR is useful! Thanks! You can close the issue.

sushreebarsa · 2021-09-29T02:30:00Z

@lankuohsing
Thank you for your update, moving this issue to closed status as it is resolved.Thanks!

google-ml-butler · 2021-09-29T02:30:01Z

Are you satisfied with the resolution of your issue?
Yes
No

lankuohsing added the type:bug Bug label Sep 23, 2021

google-ml-butler bot assigned sushreebarsa Sep 23, 2021

sushreebarsa added the comp:keras Keras related issues label Sep 27, 2021

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Sep 28, 2021

sushreebarsa removed the stat:awaiting response Status - Awaiting response from author label Sep 29, 2021

sushreebarsa closed this as completed Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A bug for tf.keras.layers.TextVectorization when built from saved configs and weights #52109

A bug for tf.keras.layers.TextVectorization when built from saved configs and weights #52109

lankuohsing commented Sep 23, 2021 •

edited

sushreebarsa commented Sep 28, 2021

lankuohsing commented Sep 28, 2021

sushreebarsa commented Sep 29, 2021

google-ml-butler bot commented Sep 29, 2021

A bug for tf.keras.layers.TextVectorization when built from saved configs and weights #52109

A bug for tf.keras.layers.TextVectorization when built from saved configs and weights #52109

Comments

lankuohsing commented Sep 23, 2021 • edited

sushreebarsa commented Sep 28, 2021

lankuohsing commented Sep 28, 2021

sushreebarsa commented Sep 29, 2021

google-ml-butler bot commented Sep 29, 2021

lankuohsing commented Sep 23, 2021 •

edited