Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading embeddings into the graph fails with: libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc 164 Cannot allocate buffer larger than kint32max for StringOutputStream #31093

Closed
vitalyli opened this issue Jul 27, 2019 · 6 comments
Assignees
Labels
comp:keras Keras related issues TF 1.13 Issues related to TF 1.13 type:bug Bug

Comments

@vitalyli
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    MacOS 10.14
  • TensorFlow installed from (source or binary):
    Binary
  • TensorFlow version (use command below):
    1.13.1
  • Python version:
    Python 3.6.7 |Anaconda

Describe the current behavior
Loading 1.4mil 100dim embeddings into the graph
words:1457657; dim:100

Describe the expected behavior
Want to be able to use TFRecord data set with words and graph to lookup indexes via TF table
and then parallel lookup imbedding vectors; compute using 3 dim tensor.
This used to work with smaller set of embeddings.
Anyway to overcome this problem without rewriting data feed?

Code to reproduce the issue
w_embedding_vocab = tf.constant(embDic.vocab, dtype=tf.string, shape=[embDic.vocab_size], name="w_embedding_vocab")

        w_embedding_vocab_table = lookup_ops.index_table_from_tensor(w_embedding_vocab, default_value=0, name="word_embidx_tbl")

        w_embeddings = tf.get_variable(name="word_embeddings", shape=[embDic.vocab_size, embDic.dim],
                                       initializer=tf.constant_initializer(np.asmatrix(embDic.embeddings)),
                                       dtype=tf.float32, trainable=False)

Other info / logs
No other logs, just one ERROR message

[libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc:164] Cannot allocate buffer larger than kint32max for StringOutputStream.

@vitalyli
Copy link
Author

vitalyli commented Jul 27, 2019

Work around so far was to reduce phrase vocabulary, which reduced graph size from 1.4G to 500Mb and above error doesn't appear, but it's not really scalable solution.

@gadagashwini-zz gadagashwini-zz self-assigned this Jul 29, 2019
@gadagashwini-zz gadagashwini-zz added TF 1.13 Issues related to TF 1.13 comp:ops OPs related issues labels Jul 29, 2019
@gadagashwini-zz
Copy link
Contributor

@vitalyli, Provide us the full minimal code snippet. It will indeed help us to move faster.

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Jul 29, 2019
@vitalyli
Copy link
Author

vitalyli commented Jul 29, 2019

I can't share embeddings file, but the core issue is it's too large for the graph; The file is about 1.5G in size with words and word vectors of dim 100.
My guess is some phrase words may have ended up lengthy and would take more space.

The error prints while loading w_embeddings matrix. I don't remember seeing this in the past, but that could be because I have not crossed this limit.
So my question is if there is a better/scalable way to do this as to avoid this graph size limit or if there is a way to upgrade something and relax this limit.
It's definitely not machine memory problem as there is enough of memory to load many times that size.
The way I could solve this is by removing all phrase vectors, which prevented error from happening,
however I'm looking for more general solution to this issue and not by dropping pre-trained embeddings.
Training data is being loaded from TFRecord files via dataset api. There I have a tensor with list of words. Below is to map word to word index and then to word embeddings.

The TFRecord are parsed this way, which works, but there seem to be no place to have external mapping of word->index->embedding unless it's being done as part of graph.
The only other alternative is to generate TFRecord with word indexes instead of words themselves, which will avoid hash table lookup in the graph, but it's operational headache and less efficient.

def _parse_function(example_proto):

    template = {

        'qw': tf.FixedLenFeature([8], tf.string),
        'qw_seq': tf.FixedLenFeature([1], tf.int64),

        'ph_w': tf.FixedLenFeature([1], tf.string),
        'ph_seq': tf.FixedLenFeature([1], tf.int64),

        'qu_f': tf.FixedLenFeature([N_QU_F], tf.float32),

        'lbl': tf.FixedLenFeature([2], tf.int64)
    }

    parsed_features = tf.parse_single_example(example_proto, template)

    return parsed_features

Below is how graph embeddings are initialized, where:
embDic.embeddings is array of 100dim numpy vectors loaded from Word2Vec file.
embDic.vocab is corresponding vocab as array of words also loaded from Word2Vec file.

    with graph.as_default():

        with tf.name_scope("EmbLoadScope"):

            w_embedding_vocab = tf.constant(embDic.vocab, dtype=tf.string, shape=[embDic.vocab_size], name="w_embedding_vocab")

            w_embedding_vocab_table = lookup_ops.index_table_from_tensor(w_embedding_vocab, default_value=0, name="word_embidx_tbl")

            w_embeddings = tf.get_variable(name="word_embeddings", shape=[embDic.vocab_size, embDic.dim],
                                           initializer=tf.constant_initializer(np.asmatrix(embDic.embeddings)),
                                           dtype=tf.float32, trainable=False)

    with tf.name_scope("InitVar"):
            init_variables = tf.group(tf.local_variables_initializer(),tf.global_variables_initializer(),tf.tables_initializer(), name='init_var_op')

    config = tf.ConfigProto(allow_soft_placement=True)
   
    with graph.as_default() as gg:

        with tf.Session(graph=gg, config=config) as sess:

            init_variables.run()

            #etc. training loop here
            while True:
                        _, g_step = sess.run([train_op, global_step],
                                 feed_dict={ds_handle: train_handle,
                                            x_learning_rate: learn_rate,
                                            x_pkeep: training_dropout})

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 30, 2019
@gadagashwini-zz gadagashwini-zz added the comp:data tf.data related issues label Jul 30, 2019
@jvishnuvardhan jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower comp:keras Keras related issues and removed comp:ops OPs related issues labels Jul 30, 2019
@jsimsa jsimsa removed the comp:data tf.data related issues label Jul 30, 2019
@jsimsa
Copy link
Contributor

jsimsa commented Jul 30, 2019

This issue is not related to tf.data.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 31, 2019
@vitalyli
Copy link
Author

vitalyli commented Aug 9, 2019

Never mind solution to this is to avoid constant initializer and load vectors via placeholder, that keeps protobuf size small.

@vitalyli vitalyli closed this as completed Aug 9, 2019
@tensorflow-bot
Copy link

tensorflow-bot bot commented Aug 9, 2019

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues TF 1.13 Issues related to TF 1.13 type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants