Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ran out of memory in eager execution #20218

Closed
qijimrc opened this issue Jun 22, 2018 · 20 comments
Closed

ran out of memory in eager execution #20218

qijimrc opened this issue Jun 22, 2018 · 20 comments
Assignees
Labels
comp:eager Eager related issues

Comments

@qijimrc
Copy link

qijimrc commented Jun 22, 2018

System information

  • Have I written custom code:Yes
  • OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow installed from:source
  • TensorFlow version:1.8.0-gpu
  • Python version: 3.5
  • CUDA/cuDNN version: 9.0/7.0
  • GPU model and memory:GeForce GTX 1060 / 6G
  • Bazel version (if compiling from source): N/A
  • Exact command to reproduce:
    General training model steps

Describe the problem

I wrote a Neural Language Model using TF eager mode, which has only one LSTM cell.
Some of these hyperparameters are:

hidden_dim : 128
vocab_size : 7393
seq_len : <=50
batch_size : 32

In particular, I did not use the inherited keras base class to write the model, but instead used tfe.Variable to gradually implement some of the details of the model. The model was normal during the early training and performed well. However, a memory exhaustion error occurs every time the model runs to 960 steps or so. I think it's because some variables haven't released this error that has been accumulated during training. Because I still have this error in 960 steps after adjusting batch-size.
The following is the error log, I think this is a TF eager mode memory usage problem, however the custom model is a necessary choice for some attempts.
How can we avoid this problem and let the model train smoothly?
Thanks a lot.

Source code / logs

loss at epoch 0 step 954: 6.665831
loss at epoch 0 step 955: 6.716213
loss at epoch 0 step 956: 6.761374
loss at epoch 0 step 957: 6.712702
loss at epoch 0 step 958: 6.753476
loss at epoch 0 step 959: 6.481436
loss at epoch 0 step 960: 6.544363
loss at epoch 0 step 961: 6.562449
loss at epoch 0 step 962: 6.701943
loss at epoch 0 step 963: 6.415178
loss at epoch 0 step 964: 6.630089
loss at epoch 0 step 965: 6.580114
2018-06-22 11:53:38.547278: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 92.05MiB. Current allocation summary follows.
2018-06-22 11:53:38.547428: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-requested in
use in bin.
2018-06-22 11:53:38.547428: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-r[1767/1952]
use in bin.
2018-06-22 11:53:38.547459: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 17, Chunks in use: 16. 8.5KiB allocated for chunks. 8.0KiB in use in bin. 8.0KiB client-req
uested in use in bin.
2018-06-22 11:53:38.547482: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-reque
sted in use in bin.
2018-06-22 11:53:38.547504: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use
in bin.
2018-06-22 11:53:38.547525: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (4096): Total Chunks: 1, Chunks in use: 0. 4.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in
use in bin.
2018-06-22 11:53:38.547544: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use
in bin.
2018-06-22 11:53:38.547568: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (16384): Total Chunks: 5, Chunks in use: 4. 135.5KiB allocated for chunks. 110.0KiB in use in bin. 108.8Ki
B client-requested in use in bin.
2018-06-22 11:53:38.547593: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (32768): Total Chunks: 920, Chunks in use: 918. 28.87MiB allocated for chunks. 28.81MiB in use in bin. 28.
69MiB client-requested in use in bin.
2018-06-22 11:53:38.547617: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (65536): Total Chunks: 410, Chunks in use: 410. 26.12MiB allocated for chunks. 26.12MiB in use in bin. 25.
66MiB client-requested in use in bin.
2018-06-22 11:53:38.547640: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (131072): Total Chunks: 20, Chunks in use: 20. 2.69MiB allocated for chunks. 2.69MiB in use in bin. 2.69MiB
client-requested in use in bin.
.
.
.
.
.
.
.
2018-06-22 11:53:38.548461: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089bc700 of size 512 [1691/1952]
2018-06-22 11:53:38.548478: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089bc900 of size 131072
2018-06-22 11:53:38.548494: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089dc900 of size 512
2018-06-22 11:53:38.548510: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089dcb00 of size 131072
2018-06-22 11:53:38.548526: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089fcb00 of size 512
2018-06-22 11:53:38.548542: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089fcd00 of size 131072
2018-06-22 11:53:38.548558: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a1cd00 of size 512
2018-06-22 11:53:38.548574: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a1cf00 of size 131072
2018-06-22 11:53:38.548589: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a3cf00 of size 512
2018-06-22 11:53:38.548605: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a3d100 of size 131072
2018-06-22 11:53:38.548622: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a5d100 of size 512
2018-06-22 11:53:38.548638: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a5d300 of size 131072
2018-06-22 11:53:38.548654: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a7d300 of size 512
2018-06-22 11:53:38.548669: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a7d500 of size 131072
2018-06-22 11:53:38.548685: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a9d500 of size 512
2018-06-22 11:53:38.548702: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a9d700 of size 7570432
2018-06-22 11:53:38.548718: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091d5b00 of size 29696
2018-06-22 11:53:38.548735: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091dcf00 of size 32768
2018-06-22 11:53:38.548752: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091e4f00 of size 32768
2018-06-22 11:53:38.548768: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091ecf00 of size 32768
2018-06-22 11:53:38.548783: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091f4f00 of size 32768
.
.
.
.
.
.
.
.
.
2018-06-22 11:53:38.796751: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 5272961024
2018-06-22 11:53:38.796751: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: [19/1952]
Limit: 5272961024
InUse: 3087996928
MaxInUse: 3174426368
NumAllocs: 2214674
MaxAllocSize: 140052992

2018-06-22 11:53:38.796832: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ____________****
2018-06-22 11:53:38.796866: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at bias_op.cc:341 : Resource exhausted: OOM when allocating tensor with shape[3200,7393] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

@tensorflowbutler tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Jun 22, 2018
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Bazel version
Exact command to reproduce

@tatianashp tatianashp added the comp:eager Eager related issues label Jun 22, 2018
@tatianashp tatianashp assigned alextp and unassigned tatianashp Jun 22, 2018
@alextp
Copy link
Contributor

alextp commented Jun 22, 2018

Interesting. Can you share some code about how you're creating your variables?

Alternatively, an easy way to tell that you're leaking variables is turning log_device_placement=True in the configproto you can pass to enable_eager_execution. If you see it creating VarHandleOp many times after the first minibatch then you know somewhere in your code you're creating variables in every iteration.

@qijimrc
Copy link
Author

qijimrc commented Jun 25, 2018

Hi alextp, thank you very much for you helping. The following is the code of part of my model:

`
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np

class Model():
    def __init__(self,params,vocab_size):

        # Hy-parameters
        self.vocab_size = vocab_size
        self.embedding_dim = params.embedding_dim
        self.hidden_dim = params.hidden_dim
        self.num_layers = params.num_layers
        self.keep_ratio = params.keep_ratio
        self.time_major = params.time_major

        # Embeddings
        self.embedding = Embedding(self.vocab_size,self.embedding_dim,self.time_major)

        # Dynamic RNN
        self.rnn = RNN(self.hidden_dim, self.embedding_dim,self.num_layers,
                               self.keep_ratio,self.time_major, name='Forward')

        # Fully Connections Layer
        self.fc = FC(self.embedding_dim,self.vocab_size,self.time_major)

        # Build all trainable variables
        self.build()

        # Add saver
        self.saver = tfe.Saver(self.variables)

    def build(self):
        # initialize trainable variables
        self.variables = []

        self.variables.append(self.embedding.embedding)
        for variable in self.rnn.variables:
            self.variables.append(variable)
        self.variables.append(self.fc.weights)
        self.variables.append(self.fc.bias)

    def __call__(self,inputs,training):
        # Parallel computing
        if self.time_major:
            inputs = tf.transpose(inputs,(1,0))
        # Get word embeddings for all
        embed = self.embedding(inputs)
        # moving
        rnn_outputs = self.rnn(embed,training=training)

        # Fully connections
        logits = self.fc(rnn_outputs)
        if self.time_major:
            seq_len,batch_size,last_dim=logits.shape.as_list()
            logits = tf.reshape(logits,(batch_size,seq_len,-1))

        return logits
class Embedding():
    # Embedding with different lengths.
    def __init__(self, vocab_size, embedding_dim,time_major=True):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.time_major = time_major

        self.embedding = tfe.Variable(initial_value=tf.random_normal(shape=(vocab_size,embedding_dim)),dtype=tf.float32,name='embedding',trainable=True)

    def __call__(self, x):
        # implement word embedding
        # whether parallel computing
        if self.time_major:
                #all_embedded=tf.nn.embedding_lookup(self.embedding,x)
                seq_len, batch_size = x.shape.as_list()
                x = tf.one_hot(x, depth=self.vocab_size)
                flat_x = tf.reshape(x, shape=(-1,self.vocab_size))
                all_embedded = tf.matmul(flat_x,self.embedding)
                all_embedded = tf.reshape(all_embedded,(seq_len,batch_size,-1))
        else:
                all_embedded = []
                for seq in x:
                      #embedded = tf.nn.embedding_lookup(self.embedding,seq)
                      seq = tf.one_hot(seq, depth=self.vocab_size)
                      embedded = tf.matmul(seq, self.embedding)
                      all_embedded.append(embedded)

        return all_embedded
class LSTMCell():
    def __init__(self,inputs_dim, hidden_dim, name):

        self.inputs_dim = inputs_dim
        self.hidden_dim = hidden_dim

        with tf.name_scope(name=name):
            # Forget gate
            self.W_f = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='forget_weights', trainable=True)
            self.b_f = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                    dtype=tf.float32, name='forget_bias', trainable=True)
            # Input gate
            self.W_i = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='input_weights', trainable=True)
            self.b_i = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                    dtype=tf.float32, name='input_bias', trainable=True)
            self.W_c = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='input_filter_weights', trainable=True)
            self.b_c = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                    dtype=tf.float32, name='input_filter_bias', trainable=True)
            # Output gate
            self.W_o = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='output_weights', trainable=True)
            self.b_o = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                dtype=tf.float32, name='output_bias', trainable=True)

            self.build()

    def build(self,):
        # build all variables
        self.variables = [self.W_f, self.b_f,self.W_i, self.b_i,
                          self.W_c, self.b_c,self.W_o, self.b_o]

    def zero_state(self,batch_size):
        c_t = tf.zeros(shape=(batch_size,self.hidden_dim),name='c_t')
        h_t = tf.zeros(shape=(batch_size,self.hidden_dim),name='h_t')
        return c_t, h_t

    def __call__(self,x_t,c_t,h_t):
        # Define LSTM forward propagation

        # Forget
        f_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_f,self.b_f),name='Forget_Gate')
        # Input
        i_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_i,self.b_i),name='Input_Gate')
        uc_t=tf.tanh(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_c,self.b_c),name='Input_Filter')
        # Update Cell State
        c_t = f_t*c_t + i_t*uc_t
        # Output
        o_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_o,self.b_o),name='Output_Gate')
        h_t = o_t * tf.tanh(c_t)

        return o_t,c_t,h_t
class RNN():
    # A static RNN. Similar to tf.nn.static_rnn, implemented as a class.
    def __init__(self, hidden_dim,inputs_dim, num_layers, keep_ratio,time_major=True,name=''):
        self.keep_ratio = keep_ratio
        self.time_major = time_major
        self.inputs_dim=inputs_dim
        self.hidden_dim=hidden_dim

        self.cells = [
            LSTMCell(inputs_dim,hidden_dim,name=name+'_LSTMCell'+str(idx))
            for idx in range(num_layers)
        ]

        self.build()

    def build(self):
        # build all trainable variables
        self.variables = []

        for cell in self.cells:
            for variable in cell.variables:
                self.variables.append(variable)


    def __call__(self, inputs_seq, training=False):
        # rnn with different length sequences, inputs : [batch, (embedded tensor)]

        # parallel computing
        if self.time_major:
            seq_len,batch_size,last_dim = inputs_seq.shape.as_list()
            for cell in self.cells:
                c_t, h_t = cell.zero_state(batch_size)
                outputs = []
                for inp in inputs_seq:
                    output, c_t, h_t = cell(inp, c_t, h_t)
                    outputs.append(output)

                inputs_seq = tf.stack(outputs, axis=0)
                if training:
                    inputs_seq = tf.nn.dropout(inputs_seq, self.keep_ratio)
            batch_outputs = inputs_seq
        # linear computing
        else:
            for cell in self.cells:
                c_t,h_t = cell.zero_state(1)
                outputs = []
                for seq in inputs_seq:
                    seq_outputs = []
                    for word in seq:
                        word = tf.reshape(word,(1,-1))
                        output, c_t,h_t = cell(word, c_t,h_t)
                        if training:
                            output = tf.nn.dropout(output, self.keep_ratio)
                        seq_outputs.append(output)
                    outputs.append(seq_outputs)
                inputs_seq = outputs
            batch_outputs = inputs_seq
        return batch_outputs
class FC():
    # A Fully Connection Layer.
    def __init__(self, inputs_dim, outputs_dim,time_major=True,active=tf.nn.relu,name=None):
        self.inputs_dim = inputs_dim
        self.outputs_dim= outputs_dim
        self.time_major = time_major
        self.active = active

        # Build trainable variables
        self.weights = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim,outputs_dim)),
                                  dtype=tf.float32,name=name,trainable=True)
        self.bias = tfe.Variable(initial_value=tf.random_normal(shape=(outputs_dim,)),
                                  dtype=tf.float32,name=name,trainable=True)

    def __call__(self, x):
        # different length sequences, inputs : [batch, (RNN outputs tensor)]
        # whether to parallel computing
        if self.time_major:
            seq_len,batch_size,last_dim = x.shape.as_list()
            # flatten
            flat = tf.reshape(x,(-1,last_dim))
            fc = tf.nn.xw_plus_b(flat,self.weights,self.bias)
            outputs = self.active(fc)
            outputs = tf.reshape(outputs,(seq_len,batch_size,-1))
        else:
            outputs = []
            for seq in x:
                fc = tf.nn.xw_plus_b(seq, self.weights, self.bias)
                fc = self.active(fc)
                outputs.append(fc)
        return outputs

`

Part of the training code is as follows:

with tf.GradientTape() as tape: logits = model(batch_forward,batch_backward,training=True) loss = training_helper.loss_fn_time_major(batch_labels,logits,batch_lengths) # add loss summary tf.contrib.summary.scalar('loss', loss) grads = tape.gradient(loss, model.variables) optimizer.apply_gradients(zip(grads,model.variables)) print('loss at epoch %d step %d: %f' % (epoch,train_step,loss))

I used your recommended settings turning log_device_placement=True, then I observed an interesting phenomenon. At every certain training step, there are three operation logs that are cycled, and the interval between the training steps is gradually decreasing.
The following is the log after the setting log_device_placement=True is turned on. It can be seen, at the beginning, a log of three operations is printed every 16 training steps. As the training process progresses, a log of three operations is printed every three training steps until the OOM problem happened. I don't know why and what these three operations represent respectively. I hope this information will help solve the problem. Thank you again for your reply.

some logs

loss at epoch 0 step 805: 6.740626
loss at epoch 0 step 806: 7.095036
loss at epoch 0 step 807: 7.000319
loss at epoch 0 step 808: 6.904347
loss at epoch 0 step 809: 6.922751
loss at epoch 0 step 810: 6.838681
loss at epoch 0 step 811: 6.905680
loss at epoch 0 step 812: 7.053514
loss at epoch 0 step 813: 6.855484
loss at epoch 0 step 814: 6.755508
loss at epoch 0 step 815: 7.050679
loss at epoch 0 step 816: 7.023479
loss at epoch 0 step 817: 6.827959
loss at epoch 0 step 818: 6.776998
loss at epoch 0 step 819: 6.765797
loss at epoch 0 step 820: 6.759283
2018-06-25 14:57:15.878743: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:15.935075: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:15.970760: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 821: 6.984869
loss at epoch 0 step 822: 6.748662
loss at epoch 0 step 823: 6.772003
loss at epoch 0 step 824: 6.942709
loss at epoch 0 step 825: 6.892979
loss at epoch 0 step 826: 6.680350
loss at epoch 0 step 827: 6.867269
loss at epoch 0 step 828: 7.118308
loss at epoch 0 step 829: 6.601588
loss at epoch 0 step 830: 6.652170
loss at epoch 0 step 831: 7.263901
loss at epoch 0 step 832: 6.902826
loss at epoch 0 step 833: 6.791662
loss at epoch 0 step 834: 7.087551
loss at epoch 0 step 835: 6.820101
2018-06-25 14:57:18.650329: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:18.710324: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:18.746490: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 836: 7.031796
loss at epoch 0 step 837: 6.905032
loss at epoch 0 step 838: 6.889287
loss at epoch 0 step 839: 6.732944
loss at epoch 0 step 840: 6.715842
loss at epoch 0 step 841: 6.757300
loss at epoch 0 step 842: 6.882929
loss at epoch 0 step 843: 6.658679
loss at epoch 0 step 844: 6.788300
loss at epoch 0 step 845: 6.952966
loss at epoch 0 step 846: 7.061527
loss at epoch 0 step 847: 6.603632
loss at epoch 0 step 848: 7.163596
loss at epoch 0 step 849: 7.038760
2018-06-25 14:57:21.325933: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:21.386127: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:21.423910: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 850: 6.885034
loss at epoch 0 step 851: 6.931499
loss at epoch 0 step 852: 6.713628
loss at epoch 0 step 853: 6.777599
loss at epoch 0 step 854: 6.799969
loss at epoch 0 step 855: 6.985339
loss at epoch 0 step 856: 6.620006
loss at epoch 0 step 857: 6.878152
loss at epoch 0 step 858: 6.783222
loss at epoch 0 step 859: 6.712417
loss at epoch 0 step 860: 6.734530
loss at epoch 0 step 861: 7.009552
2018-06-25 14:57:23.706079: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:23.769647: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:23.809172: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 862: 6.570765
loss at epoch 0 step 863: 6.818643
loss at epoch 0 step 864: 6.707229
loss at epoch 0 step 865: 6.860226
loss at epoch 0 step 866: 6.832708
loss at epoch 0 step 867: 6.746671
loss at epoch 0 step 868: 6.925843
loss at epoch 0 step 869: 6.845281
loss at epoch 0 step 870: 6.901897
loss at epoch 0 step 871: 6.936728
loss at epoch 0 step 872: 6.793565
2018-06-25 14:57:25.894904: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:25.958699: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:25.998658: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 873: 6.795174
loss at epoch 0 step 874: 6.897914
loss at epoch 0 step 875: 6.964663
loss at epoch 0 step 876: 6.933444
loss at epoch 0 step 877: 6.889087
loss at epoch 0 step 878: 6.728238
loss at epoch 0 step 879: 6.543335
loss at epoch 0 step 880: 6.785100
loss at epoch 0 step 881: 6.867868
loss at epoch 0 step 882: 6.626728
2018-06-25 14:57:28.011803: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:28.078783: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:28.121272: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 883: 6.632171
loss at epoch 0 step 884: 6.638273
loss at epoch 0 step 885: 6.727106
loss at epoch 0 step 886: 6.755214
loss at epoch 0 step 887: 6.788674
loss at epoch 0 step 888: 6.618323
loss at epoch 0 step 889: 6.802135
loss at epoch 0 step 890: 6.812809
loss at epoch 0 step 891: 6.905511
2018-06-25 14:57:29.982565: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:30.048872: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:30.089943: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 892: 6.660409
loss at epoch 0 step 893: 6.817567
loss at epoch 0 step 894: 6.782594
loss at epoch 0 step 895: 6.746886
loss at epoch 0 step 896: 6.679479
loss at epoch 0 step 897: 6.696771
loss at epoch 0 step 898: 6.642287
loss at epoch 0 step 899: 6.870078
2018-06-25 14:57:31.693334: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:31.761710: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:31.805339: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 900: 6.774796
loss at epoch 0 step 901: 6.841524
loss at epoch 0 step 902: 6.682564
loss at epoch 0 step 903: 6.693571
loss at epoch 0 step 904: 6.603614
loss at epoch 0 step 905: 6.705801
loss at epoch 0 step 906: 6.713558
loss at epoch 0 step 907: 6.889330
2018-06-25 14:57:33.459800: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:33.528563: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:33.573187: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 908: 6.726191
loss at epoch 0 step 909: 6.945642
loss at epoch 0 step 910: 6.768640
loss at epoch 0 step 911: 6.801554
loss at epoch 0 step 912: 6.867862
loss at epoch 0 step 913: 6.563491
loss at epoch 0 step 914: 6.853379
loss at epoch 0 step 915: 6.724048
2018-06-25 14:57:35.341966: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:35.412082: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:35.456774: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 916: 6.639415
loss at epoch 0 step 917: 6.881902
loss at epoch 0 step 918: 6.735335
loss at epoch 0 step 919: 6.688423
loss at epoch 0 step 920: 6.636367
loss at epoch 0 step 921: 6.717305
loss at epoch 0 step 922: 6.629582
2018-06-25 14:57:36.960963: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:37.033898: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:37.079990: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 923: 6.682681
loss at epoch 0 step 924: 6.882014
loss at epoch 0 step 925: 6.731138
loss at epoch 0 step 926: 6.588252
loss at epoch 0 step 927: 6.824276
loss at epoch 0 step 928: 6.707603
loss at epoch 0 step 929: 6.748030
2018-06-25 14:57:38.690123: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:38.767782: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:38.816409: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 930: 6.658401
loss at epoch 0 step 931: 6.748807
loss at epoch 0 step 932: 6.802845
loss at epoch 0 step 933: 6.590845
loss at epoch 0 step 934: 6.749659
loss at epoch 0 step 935: 6.701960
2018-06-25 14:57:40.158598: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:40.234263: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:40.282647: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 936: 6.775355
loss at epoch 0 step 937: 6.781133
loss at epoch 0 step 938: 6.600548
loss at epoch 0 step 939: 6.747424
loss at epoch 0 step 940: 6.605201
2018-06-25 14:57:41.416798: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:41.496970: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:41.550523: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 941: 6.672260
loss at epoch 0 step 942: 6.614303
loss at epoch 0 step 943: 6.726663
loss at epoch 0 step 944: 6.694652
loss at epoch 0 step 945: 6.590765
2018-06-25 14:57:42.720835: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:42.800870: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:42.852718: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 946: 6.639668
loss at epoch 0 step 947: 6.581154
loss at epoch 0 step 948: 6.780156
loss at epoch 0 step 949: 6.619957
loss at epoch 0 step 950: 6.680293
2018-06-25 14:57:44.034047: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:44.116253: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:44.167625: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 951: 6.767905
loss at epoch 0 step 952: 6.694592
loss at epoch 0 step 953: 6.586446
loss at epoch 0 step 954: 6.691846
2018-06-25 14:57:45.122539: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:45.204973: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:45.258423: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 955: 6.994371
loss at epoch 0 step 956: 6.858140
loss at epoch 0 step 957: 6.646007
loss at epoch 0 step 958: 6.721205
2018-06-25 14:57:46.238911: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:46.325236: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:46.380011: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 959: 6.589066
loss at epoch 0 step 960: 6.683684
loss at epoch 0 step 961: 6.672047
2018-06-25 14:57:47.046723: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:47.203337: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:47.260355: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 962: 6.725653
loss at epoch 0 step 963: 6.720684
loss at epoch 0 step 964: 6.636925
2018-06-25 14:57:47.932850: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:48.021669: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:48.077942: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 965: 6.461943
loss at epoch 0 step 966: 6.716685
loss at epoch 0 step 967: 6.541767
2018-06-25 14:57:48.839285: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:58.895255: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 90.19MiB. Current allocation summary follows.
2018-06-25 14:57:58.895406: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-requested in use in bin.
2018-06-25 14:57:58.895440: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 17, Chunks in use: 16. 8.5KiB allocated for chunks. 8.0KiB in use in bin. 8.0KiB client-requested in use in bin.
2018-06-25 14:57:58.895466: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2018-06-25 14:57:58.895488: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-06-25 14:57:58.895517: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-06-25 14:57:58.895538: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-06-25 14:57:58.895565: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (16384): Total Chunks: 5, Chunks in use: 3. 124.0KiB allocated for chunks. 79.0KiB in use in bin. 78.9KiB client-requested in use in bin.
2018-06-25 14:57:58.895592: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (32768): Total Chunks: 903, Chunks in use: 901. 28.30MiB allocated for chunks. 28.24MiB in use in bin. 28.15MiB client-requested in use in bin.
2018-06-25 14:57:58.895618: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (65536): Total Chunks: 402, Chunks in use: 402. 25.56MiB allocated for chunks. 25.56MiB in use in bin. 25.16MiB client-requested in use in bin.
2018-06-25 14:57:58.895639: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (131072): Total Chunks: 19, Chunks in use: 19. 2.51MiB allocated for chunks. 2.51MiB in use in bin. 2.47MiB client-requested in use in bin.
2018-06-25 14:57:58.895673: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (262144): Total Chunks: 8, Chunks in use: 8. 2.97MiB allocated for chunks. 2.97MiB in use in bin. 2.75MiB client-requested in use in bin.

@alextp
Copy link
Contributor

alextp commented Jun 25, 2018 via email

@allenlavoie
Copy link
Member

Looks like one issue is that summaries are adding to global collections. Could you try taking summaries out and seeing if there's still a leak? Or post a training loop and I can give it a try.

@qijimrc
Copy link
Author

qijimrc commented Jun 26, 2018

Hi @allenlavoie , I tried to remove the summaries, but got the same result. So it looks like this is not an error about summaries. Still thank you for your reply.

@qijimrc
Copy link
Author

qijimrc commented Jun 26, 2018

@alextp , I think so. Is there any solution? Or I sent you the complete program code, can you help me to debug it? Can I communicate with you about this issue via email? Thank you very much!

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 26, 2018
@allenlavoie
Copy link
Member

Well, the summary issue is fixed at least. I'm happy to track down the other memory issue, but you'll need to post a training loop which reproduces the issue (and which I can run).

@allenlavoie
Copy link
Member

(Leaving it closed for now, but feel free to re-open with a repro.)

@LeoLai930603
Copy link

Still facing this problem in 1.10.0, hope it could be fixed soon.

@allenlavoie
Copy link
Member

@LeoLai930603 could you post a snippet to reproduce?

@LeoLai930603
Copy link

Sure, I post it on my github: https://github.com/LeoLai930603/Eager_OOM_reproduce/tree/master

BTW, I tried this code on Windows 10 with Anaconda (Python 3.5, TensorFlow 1.8.0), but I think this error could also occur on Linux. I met this error on both CPU and GPU, hope such information would be helpful for you.

@allenlavoie
Copy link
Member

@LeoLai930603 thanks, that's super helpful. Looks like one issue was assigning functions to EagerTensors (as when slicing); I have a workaround in the process of submitting for that (and will properly patch it soon).

Looks like something RNN related is still holding on to LSTMStateTuples, so I'll take a look at that too.

@allenlavoie
Copy link
Member

I will follow up and properly patch variable slicing.

After that it's still leaking memory, but it looks like it's a zeros/ones caching issue with backprop. @akshaym was going to take a look (at https://github.com/LeoLai930603/Eager_OOM_reproduce/tree/master ).

I think it's a caching issue because clearing the cache after train_step (in eager_biLSTM_trainer.py) keeps the memory stable and means the number of PyObjects doesn't increase:

from tensorflow.python.eager import context
context.context()._clear_caches()  # Increasing memory and PyObject count without this
import gc
gc.collect()
print(len(gc.get_objects()))

@LeoLai930603
Copy link

Thanks a lot! I will give it a try later. Would this work for Eager on GPU? I am not sure whether this issue is only related to the gc on host. As the author mentioned at first, the problem could also be caused by insufficient GPU memory allocation.

@allenlavoie
Copy link
Member

The PyObject issue was probably just host-side (and now that I think about it, you may not have run into it unless you were using a nightly from the past week or so). Not sure the gc.collect() is necessary there; just using it to get a stable count of objects in case there are reference cycles.

The zeros caching issue will affect GPU too, since (I assume) we cache GPU Tensors. But all my testing was CPU-only so far.

This caching helps avoid re-creating zero Tensors during backprop and is limited to 256 entries, but when running your model ~60 entries was tens of gigabytes. Sounds from Akshay like the solution will be to cap the size of the cache. (And maybe we should cap it more aggressively for GPU Tensors since memory will be more constrained?)

@LeoLai930603
Copy link

Agree, and I was wondering whether it is possible to expose this setting (caching size or entry limit) to the user, so we can manage the training and deployment setting by ourselves? And I believe it deserves more attention to the RNN model implementation with Eager mode, for example, it is very inconvenient to process the trainable variables in LSTMCells during training, because 1) it is lazy-initialized, and I was expecting to have a explicit way to initialize the weights in it like Conv2d, 2) the trainable variables are nested list, which makes the following pair-matching with gradients more difficult (you have to flatten the list by your own in stead of just a zip function in tutorial). I hope such cells could provide more features that make them more eager-friendly. At least that would help debugging when facing memory leak.

@allenlavoie
Copy link
Member

Akshay's change limits caching to a few megabytes at most, so hopefully never an issue.

I believe the future of RNN APIs are being decided now (until Sept. 10): tensorflow/community#15 May be worth reading over the design document and seeing if your concerns are already addressed. Describing the pain points in a comment would be helpful if not.

@LeoLai930603
Copy link

Much appreciate that! I will wait for the dev20180831 version to test it. Hope this could be fixed perfectly.

@rmanak
Copy link

rmanak commented Feb 5, 2020

There seems to be still a problem with slicing operator resulting in ResourceExhausteError, when using gradient tape, I was able to get around it by avoiding the slicing operators in the loss calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:eager Eager related issues
Projects
None yet
Development

No branches or pull requests

8 participants