Loading embeddings into the graph fails with: libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc 164 Cannot allocate buffer larger than kint32max for StringOutputStream #31093

vitalyli · 2019-07-27T02:10:55Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
MacOS 10.14
TensorFlow installed from (source or binary):
Binary
TensorFlow version (use command below):
1.13.1
Python version:
Python 3.6.7 |Anaconda

Describe the current behavior
Loading 1.4mil 100dim embeddings into the graph
words:1457657; dim:100

Describe the expected behavior
Want to be able to use TFRecord data set with words and graph to lookup indexes via TF table
and then parallel lookup imbedding vectors; compute using 3 dim tensor.
This used to work with smaller set of embeddings.
Anyway to overcome this problem without rewriting data feed?

Code to reproduce the issue
w_embedding_vocab = tf.constant(embDic.vocab, dtype=tf.string, shape=[embDic.vocab_size], name="w_embedding_vocab")

        w_embedding_vocab_table = lookup_ops.index_table_from_tensor(w_embedding_vocab, default_value=0, name="word_embidx_tbl")

        w_embeddings = tf.get_variable(name="word_embeddings", shape=[embDic.vocab_size, embDic.dim],
                                       initializer=tf.constant_initializer(np.asmatrix(embDic.embeddings)),
                                       dtype=tf.float32, trainable=False)

Other info / logs
No other logs, just one ERROR message

[libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc:164] Cannot allocate buffer larger than kint32max for StringOutputStream.

The text was updated successfully, but these errors were encountered:

vitalyli · 2019-07-27T19:48:22Z

Work around so far was to reduce phrase vocabulary, which reduced graph size from 1.4G to 500Mb and above error doesn't appear, but it's not really scalable solution.

gadagashwini-zz · 2019-07-29T06:47:19Z

@vitalyli, Provide us the full minimal code snippet. It will indeed help us to move faster.

vitalyli · 2019-07-29T16:18:44Z

I can't share embeddings file, but the core issue is it's too large for the graph; The file is about 1.5G in size with words and word vectors of dim 100.
My guess is some phrase words may have ended up lengthy and would take more space.

The error prints while loading w_embeddings matrix. I don't remember seeing this in the past, but that could be because I have not crossed this limit.
So my question is if there is a better/scalable way to do this as to avoid this graph size limit or if there is a way to upgrade something and relax this limit.
It's definitely not machine memory problem as there is enough of memory to load many times that size.
The way I could solve this is by removing all phrase vectors, which prevented error from happening,
however I'm looking for more general solution to this issue and not by dropping pre-trained embeddings.
Training data is being loaded from TFRecord files via dataset api. There I have a tensor with list of words. Below is to map word to word index and then to word embeddings.

The TFRecord are parsed this way, which works, but there seem to be no place to have external mapping of word->index->embedding unless it's being done as part of graph.
The only other alternative is to generate TFRecord with word indexes instead of words themselves, which will avoid hash table lookup in the graph, but it's operational headache and less efficient.

def _parse_function(example_proto):

    template = {

        'qw': tf.FixedLenFeature([8], tf.string),
        'qw_seq': tf.FixedLenFeature([1], tf.int64),

        'ph_w': tf.FixedLenFeature([1], tf.string),
        'ph_seq': tf.FixedLenFeature([1], tf.int64),

        'qu_f': tf.FixedLenFeature([N_QU_F], tf.float32),

        'lbl': tf.FixedLenFeature([2], tf.int64)
    }

    parsed_features = tf.parse_single_example(example_proto, template)

    return parsed_features

Below is how graph embeddings are initialized, where:
embDic.embeddings is array of 100dim numpy vectors loaded from Word2Vec file.
embDic.vocab is corresponding vocab as array of words also loaded from Word2Vec file.

    with graph.as_default():

        with tf.name_scope("EmbLoadScope"):

            w_embedding_vocab = tf.constant(embDic.vocab, dtype=tf.string, shape=[embDic.vocab_size], name="w_embedding_vocab")

            w_embedding_vocab_table = lookup_ops.index_table_from_tensor(w_embedding_vocab, default_value=0, name="word_embidx_tbl")

            w_embeddings = tf.get_variable(name="word_embeddings", shape=[embDic.vocab_size, embDic.dim],
                                           initializer=tf.constant_initializer(np.asmatrix(embDic.embeddings)),
                                           dtype=tf.float32, trainable=False)

    with tf.name_scope("InitVar"):
            init_variables = tf.group(tf.local_variables_initializer(),tf.global_variables_initializer(),tf.tables_initializer(), name='init_var_op')

    config = tf.ConfigProto(allow_soft_placement=True)
   
    with graph.as_default() as gg:

        with tf.Session(graph=gg, config=config) as sess:

            init_variables.run()

            #etc. training loop here
            while True:
                        _, g_step = sess.run([train_op, global_step],
                                 feed_dict={ds_handle: train_handle,
                                            x_learning_rate: learn_rate,
                                            x_pkeep: training_dropout})

jsimsa · 2019-07-30T17:58:29Z

This issue is not related to tf.data.

vitalyli · 2019-08-09T06:01:31Z

Never mind solution to this is to avoid constant initializer and load vectors via placeholder, that keeps protobuf size small.

tensorflow-bot · 2019-08-09T06:01:32Z

Are you satisfied with the resolution of your issue?
Yes
No

gadagashwini-zz self-assigned this Jul 29, 2019

gadagashwini-zz added TF 1.13 Issues related to TF 1.13 comp:ops OPs related issues labels Jul 29, 2019

gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Jul 29, 2019

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 30, 2019

gadagashwini-zz added the type:bug Bug label Jul 30, 2019

gadagashwini-zz assigned jvishnuvardhan and unassigned gadagashwini-zz Jul 30, 2019

gadagashwini-zz added the comp:data tf.data related issues label Jul 30, 2019

jvishnuvardhan assigned jsimsa and unassigned jvishnuvardhan Jul 30, 2019

jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower comp:keras Keras related issues and removed comp:ops OPs related issues labels Jul 30, 2019

jsimsa removed the comp:data tf.data related issues label Jul 30, 2019

jvishnuvardhan assigned jvishnuvardhan and unassigned jsimsa Jul 30, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 31, 2019

vitalyli closed this as completed Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading embeddings into the graph fails with: libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc 164 Cannot allocate buffer larger than kint32max for StringOutputStream #31093

Loading embeddings into the graph fails with: libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc 164 Cannot allocate buffer larger than kint32max for StringOutputStream #31093

vitalyli commented Jul 27, 2019

vitalyli commented Jul 27, 2019 •

edited

gadagashwini-zz commented Jul 29, 2019

vitalyli commented Jul 29, 2019 •

edited

jsimsa commented Jul 30, 2019

vitalyli commented Aug 9, 2019

tensorflow-bot bot commented Aug 9, 2019

Loading embeddings into the graph fails with: libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc 164 Cannot allocate buffer larger than kint32max for StringOutputStream #31093

Loading embeddings into the graph fails with: libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc 164 Cannot allocate buffer larger than kint32max for StringOutputStream #31093

Comments

vitalyli commented Jul 27, 2019

vitalyli commented Jul 27, 2019 • edited

gadagashwini-zz commented Jul 29, 2019

vitalyli commented Jul 29, 2019 • edited

jsimsa commented Jul 30, 2019

vitalyli commented Aug 9, 2019

tensorflow-bot bot commented Aug 9, 2019

vitalyli commented Jul 27, 2019 •

edited

vitalyli commented Jul 29, 2019 •

edited