ran out of memory in eager execution #20218

qijimrc · 2018-06-22T08:39:35Z

System information

Have I written custom code:Yes
OS Platform and Distribution: Linux Ubuntu 16.04
TensorFlow installed from:source
TensorFlow version:1.8.0-gpu
Python version: 3.5
CUDA/cuDNN version: 9.0/7.0
GPU model and memory:GeForce GTX 1060 / 6G
Bazel version (if compiling from source): N/A
Exact command to reproduce:
General training model steps

Describe the problem

I wrote a Neural Language Model using TF eager mode, which has only one LSTM cell.
Some of these hyperparameters are:

hidden_dim : 128
vocab_size : 7393
seq_len : <=50
batch_size : 32

In particular, I did not use the inherited keras base class to write the model, but instead used tfe.Variable to gradually implement some of the details of the model. The model was normal during the early training and performed well. However, a memory exhaustion error occurs every time the model runs to 960 steps or so. I think it's because some variables haven't released this error that has been accumulated during training. Because I still have this error in 960 steps after adjusting batch-size.
The following is the error log, I think this is a TF eager mode memory usage problem, however the custom model is a necessary choice for some attempts.
How can we avoid this problem and let the model train smoothly?
Thanks a lot.

Source code / logs

loss at epoch 0 step 954: 6.665831
loss at epoch 0 step 955: 6.716213
loss at epoch 0 step 956: 6.761374
loss at epoch 0 step 957: 6.712702
loss at epoch 0 step 958: 6.753476
loss at epoch 0 step 959: 6.481436
loss at epoch 0 step 960: 6.544363
loss at epoch 0 step 961: 6.562449
loss at epoch 0 step 962: 6.701943
loss at epoch 0 step 963: 6.415178
loss at epoch 0 step 964: 6.630089
loss at epoch 0 step 965: 6.580114
2018-06-22 11:53:38.547278: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 92.05MiB. Current allocation summary follows.
2018-06-22 11:53:38.547428: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-requested in
use in bin.
2018-06-22 11:53:38.547428: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-r[1767/1952]
use in bin.
2018-06-22 11:53:38.547459: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 17, Chunks in use: 16. 8.5KiB allocated for chunks. 8.0KiB in use in bin. 8.0KiB client-req
uested in use in bin.
2018-06-22 11:53:38.547482: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-reque
sted in use in bin.
2018-06-22 11:53:38.547504: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use
in bin.
2018-06-22 11:53:38.547525: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (4096): Total Chunks: 1, Chunks in use: 0. 4.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in
use in bin.
2018-06-22 11:53:38.547544: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use
in bin.
2018-06-22 11:53:38.547568: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (16384): Total Chunks: 5, Chunks in use: 4. 135.5KiB allocated for chunks. 110.0KiB in use in bin. 108.8Ki
B client-requested in use in bin.
2018-06-22 11:53:38.547593: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (32768): Total Chunks: 920, Chunks in use: 918. 28.87MiB allocated for chunks. 28.81MiB in use in bin. 28.
69MiB client-requested in use in bin.
2018-06-22 11:53:38.547617: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (65536): Total Chunks: 410, Chunks in use: 410. 26.12MiB allocated for chunks. 26.12MiB in use in bin. 25.
66MiB client-requested in use in bin.
2018-06-22 11:53:38.547640: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (131072): Total Chunks: 20, Chunks in use: 20. 2.69MiB allocated for chunks. 2.69MiB in use in bin. 2.69MiB
client-requested in use in bin.
.
.
.
.
.
.
.
2018-06-22 11:53:38.548461: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089bc700 of size 512 [1691/1952]
2018-06-22 11:53:38.548478: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089bc900 of size 131072
2018-06-22 11:53:38.548494: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089dc900 of size 512
2018-06-22 11:53:38.548510: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089dcb00 of size 131072
2018-06-22 11:53:38.548526: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089fcb00 of size 512
2018-06-22 11:53:38.548542: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102089fcd00 of size 131072
2018-06-22 11:53:38.548558: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a1cd00 of size 512
2018-06-22 11:53:38.548574: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a1cf00 of size 131072
2018-06-22 11:53:38.548589: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a3cf00 of size 512
2018-06-22 11:53:38.548605: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a3d100 of size 131072
2018-06-22 11:53:38.548622: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a5d100 of size 512
2018-06-22 11:53:38.548638: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a5d300 of size 131072
2018-06-22 11:53:38.548654: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a7d300 of size 512
2018-06-22 11:53:38.548669: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a7d500 of size 131072
2018-06-22 11:53:38.548685: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a9d500 of size 512
2018-06-22 11:53:38.548702: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x10208a9d700 of size 7570432
2018-06-22 11:53:38.548718: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091d5b00 of size 29696
2018-06-22 11:53:38.548735: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091dcf00 of size 32768
2018-06-22 11:53:38.548752: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091e4f00 of size 32768
2018-06-22 11:53:38.548768: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091ecf00 of size 32768
2018-06-22 11:53:38.548783: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x102091f4f00 of size 32768
.
.
.
.
.
.
.
.
.
2018-06-22 11:53:38.796751: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 5272961024
2018-06-22 11:53:38.796751: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: [19/1952]
Limit: 5272961024
InUse: 3087996928
MaxInUse: 3174426368
NumAllocs: 2214674
MaxAllocSize: 140052992

2018-06-22 11:53:38.796832: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ____________****
2018-06-22 11:53:38.796866: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at bias_op.cc:341 : Resource exhausted: OOM when allocating tensor with shape[3200,7393] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

The text was updated successfully, but these errors were encountered:

tensorflowbutler · 2018-06-22T18:56:10Z

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Bazel version
Exact command to reproduce

alextp · 2018-06-22T20:16:36Z

Interesting. Can you share some code about how you're creating your variables?

Alternatively, an easy way to tell that you're leaking variables is turning log_device_placement=True in the configproto you can pass to enable_eager_execution. If you see it creating VarHandleOp many times after the first minibatch then you know somewhere in your code you're creating variables in every iteration.

qijimrc · 2018-06-25T08:27:45Z

Hi alextp, thank you very much for you helping. The following is the code of part of my model:

`
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np

class Model():
    def __init__(self,params,vocab_size):

        # Hy-parameters
        self.vocab_size = vocab_size
        self.embedding_dim = params.embedding_dim
        self.hidden_dim = params.hidden_dim
        self.num_layers = params.num_layers
        self.keep_ratio = params.keep_ratio
        self.time_major = params.time_major

        # Embeddings
        self.embedding = Embedding(self.vocab_size,self.embedding_dim,self.time_major)

        # Dynamic RNN
        self.rnn = RNN(self.hidden_dim, self.embedding_dim,self.num_layers,
                               self.keep_ratio,self.time_major, name='Forward')

        # Fully Connections Layer
        self.fc = FC(self.embedding_dim,self.vocab_size,self.time_major)

        # Build all trainable variables
        self.build()

        # Add saver
        self.saver = tfe.Saver(self.variables)

    def build(self):
        # initialize trainable variables
        self.variables = []

        self.variables.append(self.embedding.embedding)
        for variable in self.rnn.variables:
            self.variables.append(variable)
        self.variables.append(self.fc.weights)
        self.variables.append(self.fc.bias)

    def __call__(self,inputs,training):
        # Parallel computing
        if self.time_major:
            inputs = tf.transpose(inputs,(1,0))
        # Get word embeddings for all
        embed = self.embedding(inputs)
        # moving
        rnn_outputs = self.rnn(embed,training=training)

        # Fully connections
        logits = self.fc(rnn_outputs)
        if self.time_major:
            seq_len,batch_size,last_dim=logits.shape.as_list()
            logits = tf.reshape(logits,(batch_size,seq_len,-1))

        return logits
class Embedding():
    # Embedding with different lengths.
    def __init__(self, vocab_size, embedding_dim,time_major=True):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.time_major = time_major

        self.embedding = tfe.Variable(initial_value=tf.random_normal(shape=(vocab_size,embedding_dim)),dtype=tf.float32,name='embedding',trainable=True)

    def __call__(self, x):
        # implement word embedding
        # whether parallel computing
        if self.time_major:
                #all_embedded=tf.nn.embedding_lookup(self.embedding,x)
                seq_len, batch_size = x.shape.as_list()
                x = tf.one_hot(x, depth=self.vocab_size)
                flat_x = tf.reshape(x, shape=(-1,self.vocab_size))
                all_embedded = tf.matmul(flat_x,self.embedding)
                all_embedded = tf.reshape(all_embedded,(seq_len,batch_size,-1))
        else:
                all_embedded = []
                for seq in x:
                      #embedded = tf.nn.embedding_lookup(self.embedding,seq)
                      seq = tf.one_hot(seq, depth=self.vocab_size)
                      embedded = tf.matmul(seq, self.embedding)
                      all_embedded.append(embedded)

        return all_embedded
class LSTMCell():
    def __init__(self,inputs_dim, hidden_dim, name):

        self.inputs_dim = inputs_dim
        self.hidden_dim = hidden_dim

        with tf.name_scope(name=name):
            # Forget gate
            self.W_f = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='forget_weights', trainable=True)
            self.b_f = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                    dtype=tf.float32, name='forget_bias', trainable=True)
            # Input gate
            self.W_i = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='input_weights', trainable=True)
            self.b_i = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                    dtype=tf.float32, name='input_bias', trainable=True)
            self.W_c = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='input_filter_weights', trainable=True)
            self.b_c = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                    dtype=tf.float32, name='input_filter_bias', trainable=True)
            # Output gate
            self.W_o = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)),
                                    dtype=tf.float32, name='output_weights', trainable=True)
            self.b_o = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)),
                                dtype=tf.float32, name='output_bias', trainable=True)

            self.build()

    def build(self,):
        # build all variables
        self.variables = [self.W_f, self.b_f,self.W_i, self.b_i,
                          self.W_c, self.b_c,self.W_o, self.b_o]

    def zero_state(self,batch_size):
        c_t = tf.zeros(shape=(batch_size,self.hidden_dim),name='c_t')
        h_t = tf.zeros(shape=(batch_size,self.hidden_dim),name='h_t')
        return c_t, h_t

    def __call__(self,x_t,c_t,h_t):
        # Define LSTM forward propagation

        # Forget
        f_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_f,self.b_f),name='Forget_Gate')
        # Input
        i_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_i,self.b_i),name='Input_Gate')
        uc_t=tf.tanh(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_c,self.b_c),name='Input_Filter')
        # Update Cell State
        c_t = f_t*c_t + i_t*uc_t
        # Output
        o_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_o,self.b_o),name='Output_Gate')
        h_t = o_t * tf.tanh(c_t)

        return o_t,c_t,h_t
class RNN():
    # A static RNN. Similar to tf.nn.static_rnn, implemented as a class.
    def __init__(self, hidden_dim,inputs_dim, num_layers, keep_ratio,time_major=True,name=''):
        self.keep_ratio = keep_ratio
        self.time_major = time_major
        self.inputs_dim=inputs_dim
        self.hidden_dim=hidden_dim

        self.cells = [
            LSTMCell(inputs_dim,hidden_dim,name=name+'_LSTMCell'+str(idx))
            for idx in range(num_layers)
        ]

        self.build()

    def build(self):
        # build all trainable variables
        self.variables = []

        for cell in self.cells:
            for variable in cell.variables:
                self.variables.append(variable)


    def __call__(self, inputs_seq, training=False):
        # rnn with different length sequences, inputs : [batch, (embedded tensor)]

        # parallel computing
        if self.time_major:
            seq_len,batch_size,last_dim = inputs_seq.shape.as_list()
            for cell in self.cells:
                c_t, h_t = cell.zero_state(batch_size)
                outputs = []
                for inp in inputs_seq:
                    output, c_t, h_t = cell(inp, c_t, h_t)
                    outputs.append(output)

                inputs_seq = tf.stack(outputs, axis=0)
                if training:
                    inputs_seq = tf.nn.dropout(inputs_seq, self.keep_ratio)
            batch_outputs = inputs_seq
        # linear computing
        else:
            for cell in self.cells:
                c_t,h_t = cell.zero_state(1)
                outputs = []
                for seq in inputs_seq:
                    seq_outputs = []
                    for word in seq:
                        word = tf.reshape(word,(1,-1))
                        output, c_t,h_t = cell(word, c_t,h_t)
                        if training:
                            output = tf.nn.dropout(output, self.keep_ratio)
                        seq_outputs.append(output)
                    outputs.append(seq_outputs)
                inputs_seq = outputs
            batch_outputs = inputs_seq
        return batch_outputs
class FC():
    # A Fully Connection Layer.
    def __init__(self, inputs_dim, outputs_dim,time_major=True,active=tf.nn.relu,name=None):
        self.inputs_dim = inputs_dim
        self.outputs_dim= outputs_dim
        self.time_major = time_major
        self.active = active

        # Build trainable variables
        self.weights = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim,outputs_dim)),
                                  dtype=tf.float32,name=name,trainable=True)
        self.bias = tfe.Variable(initial_value=tf.random_normal(shape=(outputs_dim,)),
                                  dtype=tf.float32,name=name,trainable=True)

    def __call__(self, x):
        # different length sequences, inputs : [batch, (RNN outputs tensor)]
        # whether to parallel computing
        if self.time_major:
            seq_len,batch_size,last_dim = x.shape.as_list()
            # flatten
            flat = tf.reshape(x,(-1,last_dim))
            fc = tf.nn.xw_plus_b(flat,self.weights,self.bias)
            outputs = self.active(fc)
            outputs = tf.reshape(outputs,(seq_len,batch_size,-1))
        else:
            outputs = []
            for seq in x:
                fc = tf.nn.xw_plus_b(seq, self.weights, self.bias)
                fc = self.active(fc)
                outputs.append(fc)
        return outputs

`

Part of the training code is as follows:

with tf.GradientTape() as tape: logits = model(batch_forward,batch_backward,training=True) loss = training_helper.loss_fn_time_major(batch_labels,logits,batch_lengths) # add loss summary tf.contrib.summary.scalar('loss', loss) grads = tape.gradient(loss, model.variables) optimizer.apply_gradients(zip(grads,model.variables)) print('loss at epoch %d step %d: %f' % (epoch,train_step,loss))

I used your recommended settings turning log_device_placement=True, then I observed an interesting phenomenon. At every certain training step, there are three operation logs that are cycled, and the interval between the training steps is gradually decreasing.
The following is the log after the setting log_device_placement=True is turned on. It can be seen, at the beginning, a log of three operations is printed every 16 training steps. As the training process progresses, a log of three operations is printed every three training steps until the OOM problem happened. I don't know why and what these three operations represent respectively. I hope this information will help solve the problem. Thank you again for your reply.

some logs

loss at epoch 0 step 805: 6.740626
loss at epoch 0 step 806: 7.095036
loss at epoch 0 step 807: 7.000319
loss at epoch 0 step 808: 6.904347
loss at epoch 0 step 809: 6.922751
loss at epoch 0 step 810: 6.838681
loss at epoch 0 step 811: 6.905680
loss at epoch 0 step 812: 7.053514
loss at epoch 0 step 813: 6.855484
loss at epoch 0 step 814: 6.755508
loss at epoch 0 step 815: 7.050679
loss at epoch 0 step 816: 7.023479
loss at epoch 0 step 817: 6.827959
loss at epoch 0 step 818: 6.776998
loss at epoch 0 step 819: 6.765797
loss at epoch 0 step 820: 6.759283
2018-06-25 14:57:15.878743: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:15.935075: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:15.970760: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 821: 6.984869
loss at epoch 0 step 822: 6.748662
loss at epoch 0 step 823: 6.772003
loss at epoch 0 step 824: 6.942709
loss at epoch 0 step 825: 6.892979
loss at epoch 0 step 826: 6.680350
loss at epoch 0 step 827: 6.867269
loss at epoch 0 step 828: 7.118308
loss at epoch 0 step 829: 6.601588
loss at epoch 0 step 830: 6.652170
loss at epoch 0 step 831: 7.263901
loss at epoch 0 step 832: 6.902826
loss at epoch 0 step 833: 6.791662
loss at epoch 0 step 834: 7.087551
loss at epoch 0 step 835: 6.820101
2018-06-25 14:57:18.650329: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:18.710324: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:18.746490: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 836: 7.031796
loss at epoch 0 step 837: 6.905032
loss at epoch 0 step 838: 6.889287
loss at epoch 0 step 839: 6.732944
loss at epoch 0 step 840: 6.715842
loss at epoch 0 step 841: 6.757300
loss at epoch 0 step 842: 6.882929
loss at epoch 0 step 843: 6.658679
loss at epoch 0 step 844: 6.788300
loss at epoch 0 step 845: 6.952966
loss at epoch 0 step 846: 7.061527
loss at epoch 0 step 847: 6.603632
loss at epoch 0 step 848: 7.163596
loss at epoch 0 step 849: 7.038760
2018-06-25 14:57:21.325933: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:21.386127: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:21.423910: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 850: 6.885034
loss at epoch 0 step 851: 6.931499
loss at epoch 0 step 852: 6.713628
loss at epoch 0 step 853: 6.777599
loss at epoch 0 step 854: 6.799969
loss at epoch 0 step 855: 6.985339
loss at epoch 0 step 856: 6.620006
loss at epoch 0 step 857: 6.878152
loss at epoch 0 step 858: 6.783222
loss at epoch 0 step 859: 6.712417
loss at epoch 0 step 860: 6.734530
loss at epoch 0 step 861: 7.009552
2018-06-25 14:57:23.706079: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:23.769647: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:23.809172: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 862: 6.570765
loss at epoch 0 step 863: 6.818643
loss at epoch 0 step 864: 6.707229
loss at epoch 0 step 865: 6.860226
loss at epoch 0 step 866: 6.832708
loss at epoch 0 step 867: 6.746671
loss at epoch 0 step 868: 6.925843
loss at epoch 0 step 869: 6.845281
loss at epoch 0 step 870: 6.901897
loss at epoch 0 step 871: 6.936728
loss at epoch 0 step 872: 6.793565
2018-06-25 14:57:25.894904: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:25.958699: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:25.998658: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 873: 6.795174
loss at epoch 0 step 874: 6.897914
loss at epoch 0 step 875: 6.964663
loss at epoch 0 step 876: 6.933444
loss at epoch 0 step 877: 6.889087
loss at epoch 0 step 878: 6.728238
loss at epoch 0 step 879: 6.543335
loss at epoch 0 step 880: 6.785100
loss at epoch 0 step 881: 6.867868
loss at epoch 0 step 882: 6.626728
2018-06-25 14:57:28.011803: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:28.078783: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:28.121272: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 883: 6.632171
loss at epoch 0 step 884: 6.638273
loss at epoch 0 step 885: 6.727106
loss at epoch 0 step 886: 6.755214
loss at epoch 0 step 887: 6.788674
loss at epoch 0 step 888: 6.618323
loss at epoch 0 step 889: 6.802135
loss at epoch 0 step 890: 6.812809
loss at epoch 0 step 891: 6.905511
2018-06-25 14:57:29.982565: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:30.048872: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:30.089943: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 892: 6.660409
loss at epoch 0 step 893: 6.817567
loss at epoch 0 step 894: 6.782594
loss at epoch 0 step 895: 6.746886
loss at epoch 0 step 896: 6.679479
loss at epoch 0 step 897: 6.696771
loss at epoch 0 step 898: 6.642287
loss at epoch 0 step 899: 6.870078
2018-06-25 14:57:31.693334: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:31.761710: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:31.805339: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 900: 6.774796
loss at epoch 0 step 901: 6.841524
loss at epoch 0 step 902: 6.682564
loss at epoch 0 step 903: 6.693571
loss at epoch 0 step 904: 6.603614
loss at epoch 0 step 905: 6.705801
loss at epoch 0 step 906: 6.713558
loss at epoch 0 step 907: 6.889330
2018-06-25 14:57:33.459800: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:33.528563: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:33.573187: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 908: 6.726191
loss at epoch 0 step 909: 6.945642
loss at epoch 0 step 910: 6.768640
loss at epoch 0 step 911: 6.801554
loss at epoch 0 step 912: 6.867862
loss at epoch 0 step 913: 6.563491
loss at epoch 0 step 914: 6.853379
loss at epoch 0 step 915: 6.724048
2018-06-25 14:57:35.341966: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:35.412082: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:35.456774: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 916: 6.639415
loss at epoch 0 step 917: 6.881902
loss at epoch 0 step 918: 6.735335
loss at epoch 0 step 919: 6.688423
loss at epoch 0 step 920: 6.636367
loss at epoch 0 step 921: 6.717305
loss at epoch 0 step 922: 6.629582
2018-06-25 14:57:36.960963: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:37.033898: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:37.079990: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 923: 6.682681
loss at epoch 0 step 924: 6.882014
loss at epoch 0 step 925: 6.731138
loss at epoch 0 step 926: 6.588252
loss at epoch 0 step 927: 6.824276
loss at epoch 0 step 928: 6.707603
loss at epoch 0 step 929: 6.748030
2018-06-25 14:57:38.690123: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:38.767782: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:38.816409: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 930: 6.658401
loss at epoch 0 step 931: 6.748807
loss at epoch 0 step 932: 6.802845
loss at epoch 0 step 933: 6.590845
loss at epoch 0 step 934: 6.749659
loss at epoch 0 step 935: 6.701960
2018-06-25 14:57:40.158598: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:40.234263: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:40.282647: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 936: 6.775355
loss at epoch 0 step 937: 6.781133
loss at epoch 0 step 938: 6.600548
loss at epoch 0 step 939: 6.747424
loss at epoch 0 step 940: 6.605201
2018-06-25 14:57:41.416798: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:41.496970: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:41.550523: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 941: 6.672260
loss at epoch 0 step 942: 6.614303
loss at epoch 0 step 943: 6.726663
loss at epoch 0 step 944: 6.694652
loss at epoch 0 step 945: 6.590765
2018-06-25 14:57:42.720835: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:42.800870: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:42.852718: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 946: 6.639668
loss at epoch 0 step 947: 6.581154
loss at epoch 0 step 948: 6.780156
loss at epoch 0 step 949: 6.619957
loss at epoch 0 step 950: 6.680293
2018-06-25 14:57:44.034047: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:44.116253: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:44.167625: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 951: 6.767905
loss at epoch 0 step 952: 6.694592
loss at epoch 0 step 953: 6.586446
loss at epoch 0 step 954: 6.691846
2018-06-25 14:57:45.122539: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:45.204973: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:45.258423: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 955: 6.994371
loss at epoch 0 step 956: 6.858140
loss at epoch 0 step 957: 6.646007
loss at epoch 0 step 958: 6.721205
2018-06-25 14:57:46.238911: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:46.325236: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:46.380011: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 959: 6.589066
loss at epoch 0 step 960: 6.683684
loss at epoch 0 step 961: 6.672047
2018-06-25 14:57:47.046723: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:47.203337: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:47.260355: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 962: 6.725653
loss at epoch 0 step 963: 6.720684
loss at epoch 0 step 964: 6.636925
2018-06-25 14:57:47.932850: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:48.021669: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:48.077942: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0
loss at epoch 0 step 965: 6.461943
loss at epoch 0 step 966: 6.716685
loss at epoch 0 step 967: 6.541767
2018-06-25 14:57:48.839285: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0
2018-06-25 14:57:58.895255: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 90.19MiB. Current allocation summary follows.
2018-06-25 14:57:58.895406: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-requested in use in bin.
2018-06-25 14:57:58.895440: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 17, Chunks in use: 16. 8.5KiB allocated for chunks. 8.0KiB in use in bin. 8.0KiB client-requested in use in bin.
2018-06-25 14:57:58.895466: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2018-06-25 14:57:58.895488: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-06-25 14:57:58.895517: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-06-25 14:57:58.895538: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-06-25 14:57:58.895565: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (16384): Total Chunks: 5, Chunks in use: 3. 124.0KiB allocated for chunks. 79.0KiB in use in bin. 78.9KiB client-requested in use in bin.
2018-06-25 14:57:58.895592: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (32768): Total Chunks: 903, Chunks in use: 901. 28.30MiB allocated for chunks. 28.24MiB in use in bin. 28.15MiB client-requested in use in bin.
2018-06-25 14:57:58.895618: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (65536): Total Chunks: 402, Chunks in use: 402. 25.56MiB allocated for chunks. 25.56MiB in use in bin. 25.16MiB client-requested in use in bin.
2018-06-25 14:57:58.895639: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (131072): Total Chunks: 19, Chunks in use: 19. 2.51MiB allocated for chunks. 2.51MiB in use in bin. 2.47MiB client-requested in use in bin.
2018-06-25 14:57:58.895673: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (262144): Total Chunks: 8, Chunks in use: 8. 2.97MiB allocated for chunks. 2.97MiB in use in bin. 2.75MiB client-requested in use in bin.

alextp · 2018-06-25T17:05:05Z

Very interesting. So this is not a straightforward leak of tf variables.

…

On Mon, Jun 25, 2018 at 1:37 AM jaye ***@***.***> wrote: Hi alextp, thank you very much for you helping. The following is the code of part of my model: `import tensorflow as tf import tensorflow.contrib.eager as tfe import numpy as np class Model(): def *init*(self,params,vocab_size): # Hy-parameters self.vocab_size = vocab_size self.embedding_dim = params.embedding_dim self.hidden_dim = params.hidden_dim self.num_layers = params.num_layers self.keep_ratio = params.keep_ratio self.time_major = params.time_major # Embeddings self.embedding = Embedding(self.vocab_size,self.embedding_dim,self.time_major) # Dynamic RNN self.rnn = RNN(self.hidden_dim, self.embedding_dim,self.num_layers, self.keep_ratio,self.time_major, name='Forward') # Fully Connections Layer self.fc = FC(self.embedding_dim,self.vocab_size,self.time_major) # Build all trainable variables self.build() # Add saver self.saver = tfe.Saver(self.variables) def build(self): # initialize trainable variables self.variables = [] self.variables.append(self.embedding.embedding) for variable in self.rnn.variables: self.variables.append(variable) self.variables.append(self.fc.weights) self.variables.append(self.fc.bias) def __call__(self,inputs,training): # Parallel computing if self.time_major: inputs = tf.transpose(inputs,(1,0)) # Get word embeddings for all embed = self.embedding(inputs) # moving rnn_outputs = self.rnn(embed,training=training) # Fully connections logits = self.fc(rnn_outputs) if self.time_major: seq_len,batch_size,last_dim=logits.shape.as_list() logits = tf.reshape(logits,(batch_size,seq_len,-1)) return logits class Embedding(): # Embedding with different lengths. def *init*(self, vocab_size, embedding_dim,time_major=True): self.vocab_size = vocab_size self.embedding_dim = embedding_dim self.time_major = time_major self.embedding = tfe.Variable(initial_value=tf.random_normal(shape=(vocab_size,embedding_dim)), dtype=tf.float32,name='embedding',trainable=True) def __call__(self, x): # implement word embedding # whether parallel computing if self.time_major: #all_embedded=tf.nn.embedding_lookup(self.embedding,x) seq_len, batch_size = x.shape.as_list() x = tf.one_hot(x, depth=self.vocab_size) flat_x = tf.reshape(x, shape=(-1,self.vocab_size)) all_embedded = tf.matmul(flat_x,self.embedding) all_embedded = tf.reshape(all_embedded,(seq_len,batch_size,-1)) else: all_embedded = [] for seq in x: #embedded = tf.nn.embedding_lookup(self.embedding,seq) seq = tf.one_hot(seq, depth=self.vocab_size) embedded = tf.matmul(seq, self.embedding) all_embedded.append(embedded) return all_embedded class LSTMCell(): def *init*(self,inputs_dim, hidden_dim, name): self.inputs_dim = inputs_dim self.hidden_dim = hidden_dim with tf.name_scope(name=name): # Forget gate self.W_f = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)), dtype=tf.float32, name='forget_weights', trainable=True) self.b_f = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)), dtype=tf.float32, name='forget_bias', trainable=True) # Input gate self.W_i = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)), dtype=tf.float32, name='input_weights', trainable=True) self.b_i = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)), dtype=tf.float32, name='input_bias', trainable=True) self.W_c = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)), dtype=tf.float32, name='input_filter_weights', trainable=True) self.b_c = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)), dtype=tf.float32, name='input_filter_bias', trainable=True) # Output gate self.W_o = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim + hidden_dim, hidden_dim)), dtype=tf.float32, name='output_weights', trainable=True) self.b_o = tfe.Variable(initial_value=tf.random_normal(shape=(hidden_dim,)), dtype=tf.float32, name='output_bias', trainable=True) self.build() def build(self,): # build all variables self.variables = [self.W_f, self.b_f,self.W_i, self.b_i, self.W_c, self.b_c,self.W_o, self.b_o] def zero_state(self,batch_size): c_t = tf.zeros(shape=(batch_size,self.hidden_dim),name='c_t') h_t = tf.zeros(shape=(batch_size,self.hidden_dim),name='h_t') return c_t, h_t def __call__(self,x_t,c_t,h_t): # Define LSTM forward propagation # Forget f_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_f,self.b_f),name='Forget_Gate') # Input i_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_i,self.b_i),name='Input_Gate') uc_t=tf.tanh(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_c,self.b_c),name='Input_Filter') # Update Cell State c_t = f_t*c_t + i_t*uc_t # Output o_t=tf.sigmoid(tf.nn.xw_plus_b(tf.concat((h_t,x_t),-1),self.W_o,self.b_o),name='Output_Gate') h_t = o_t * tf.tanh(c_t) return o_t,c_t,h_t class RNN(): # A static RNN. Similar to tf.nn.static_rnn, implemented as a class. def *init*(self, hidden_dim,inputs_dim, num_layers, keep_ratio,time_major=True,name=''): self.keep_ratio = keep_ratio self.time_major = time_major self.inputs_dim=inputs_dim self.hidden_dim=hidden_dim self.cells = [ LSTMCell(inputs_dim,hidden_dim,name=name+'_LSTMCell'+str(idx)) for idx in range(num_layers) ] self.build() def build(self): # build all trainable variables self.variables = [] for cell in self.cells: for variable in cell.variables: self.variables.append(variable) def __call__(self, inputs_seq, training=False): # rnn with different length sequences, inputs : [batch, (embedded tensor)] # parallel computing if self.time_major: seq_len,batch_size,last_dim = inputs_seq.shape.as_list() for cell in self.cells: c_t, h_t = cell.zero_state(batch_size) outputs = [] for inp in inputs_seq: output, c_t, h_t = cell(inp, c_t, h_t) outputs.append(output) inputs_seq = tf.stack(outputs, axis=0) if training: inputs_seq = tf.nn.dropout(inputs_seq, self.keep_ratio) batch_outputs = inputs_seq # linear computing else: for cell in self.cells: c_t,h_t = cell.zero_state(1) outputs = [] for seq in inputs_seq: seq_outputs = [] for word in seq: word = tf.reshape(word,(1,-1)) output, c_t,h_t = cell(word, c_t,h_t) if training: output = tf.nn.dropout(output, self.keep_ratio) seq_outputs.append(output) outputs.append(seq_outputs) inputs_seq = outputs batch_outputs = inputs_seq return batch_outputs class FC(): # A Fully Connection Layer. def *init*(self, inputs_dim, outputs_dim,time_major=True,active=tf.nn.relu,name=None): self.inputs_dim = inputs_dim self.outputs_dim= outputs_dim self.time_major = time_major self.active = active # Build trainable variables self.weights = tfe.Variable(initial_value=tf.random_normal(shape=(inputs_dim,outputs_dim)), dtype=tf.float32,name=name,trainable=True) self.bias = tfe.Variable(initial_value=tf.random_normal(shape=(outputs_dim,)), dtype=tf.float32,name=name,trainable=True) def __call__(self, x): # different length sequences, inputs : [batch, (RNN outputs tensor)] # whether to parallel computing if self.time_major: seq_len,batch_size,last_dim = x.shape.as_list() # flatten flat = tf.reshape(x,(-1,last_dim)) fc = tf.nn.xw_plus_b(flat,self.weights,self.bias) outputs = self.active(fc) outputs = tf.reshape(outputs,(seq_len,batch_size,-1)) else: outputs = [] for seq in x: fc = tf.nn.xw_plus_b(seq, self.weights, self.bias) fc = self.active(fc) outputs.append(fc) return outputs ` Part of the training code is as follows: with tf.GradientTape() as tape: logits = model(batch_forward,batch_backward,training=True) loss = training_helper.loss_fn_time_major(batch_labels,logits,batch_lengths) # add loss summary tf.contrib.summary.scalar('loss', loss) grads = tape.gradient(loss, model.variables) optimizer.apply_gradients(zip(grads,model.variables)) print('loss at epoch %d step %d: %f' % (epoch,train_step,loss)) I used your recommended settings turning log_device_placement=True, then I observed an interesting phenomenon. At every certain training step, there are three operation logs that are cycled, and the interval between the training steps is gradually decreasing. The following is the log after the setting log_device_placement=True is turned on. It can be seen, at the beginning, a log of three operations is printed every 16 training steps. As the training process progresses, a log of three operations is printed every three training steps until the OOM problem happened. I don't know why and what these three operations represent respectively. I hope this information will help solve the problem. Thank you again for your reply. some logs loss at epoch 0 step 805: 6.740626 loss at epoch 0 step 806: 7.095036 loss at epoch 0 step 807: 7.000319 loss at epoch 0 step 808: 6.904347 loss at epoch 0 step 809: 6.922751 loss at epoch 0 step 810: 6.838681 loss at epoch 0 step 811: 6.905680 loss at epoch 0 step 812: 7.053514 loss at epoch 0 step 813: 6.855484 loss at epoch 0 step 814: 6.755508 loss at epoch 0 step 815: 7.050679 loss at epoch 0 step 816: 7.023479 loss at epoch 0 step 817: 6.827959 loss at epoch 0 step 818: 6.776998 loss at epoch 0 step 819: 6.765797 loss at epoch 0 step 820: 6.759283 2018-06-25 14:57:15.878743: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:15.935075: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:15.970760: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 821: 6.984869 loss at epoch 0 step 822: 6.748662 loss at epoch 0 step 823: 6.772003 loss at epoch 0 step 824: 6.942709 loss at epoch 0 step 825: 6.892979 loss at epoch 0 step 826: 6.680350 loss at epoch 0 step 827: 6.867269 loss at epoch 0 step 828: 7.118308 loss at epoch 0 step 829: 6.601588 loss at epoch 0 step 830: 6.652170 loss at epoch 0 step 831: 7.263901 loss at epoch 0 step 832: 6.902826 loss at epoch 0 step 833: 6.791662 loss at epoch 0 step 834: 7.087551 loss at epoch 0 step 835: 6.820101 2018-06-25 14:57:18.650329: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:18.710324: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:18.746490: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 836: 7.031796 loss at epoch 0 step 837: 6.905032 loss at epoch 0 step 838: 6.889287 loss at epoch 0 step 839: 6.732944 loss at epoch 0 step 840: 6.715842 loss at epoch 0 step 841: 6.757300 loss at epoch 0 step 842: 6.882929 loss at epoch 0 step 843: 6.658679 loss at epoch 0 step 844: 6.788300 loss at epoch 0 step 845: 6.952966 loss at epoch 0 step 846: 7.061527 loss at epoch 0 step 847: 6.603632 loss at epoch 0 step 848: 7.163596 loss at epoch 0 step 849: 7.038760 2018-06-25 14:57:21.325933: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:21.386127: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:21.423910: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 850: 6.885034 loss at epoch 0 step 851: 6.931499 loss at epoch 0 step 852: 6.713628 loss at epoch 0 step 853: 6.777599 loss at epoch 0 step 854: 6.799969 loss at epoch 0 step 855: 6.985339 loss at epoch 0 step 856: 6.620006 loss at epoch 0 step 857: 6.878152 loss at epoch 0 step 858: 6.783222 loss at epoch 0 step 859: 6.712417 loss at epoch 0 step 860: 6.734530 loss at epoch 0 step 861: 7.009552 2018-06-25 14:57:23.706079: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:23.769647: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:23.809172: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 862: 6.570765 loss at epoch 0 step 863: 6.818643 loss at epoch 0 step 864: 6.707229 loss at epoch 0 step 865: 6.860226 loss at epoch 0 step 866: 6.832708 loss at epoch 0 step 867: 6.746671 loss at epoch 0 step 868: 6.925843 loss at epoch 0 step 869: 6.845281 loss at epoch 0 step 870: 6.901897 loss at epoch 0 step 871: 6.936728 loss at epoch 0 step 872: 6.793565 2018-06-25 14:57:25.894904: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:25.958699: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:25.998658: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 873: 6.795174 loss at epoch 0 step 874: 6.897914 loss at epoch 0 step 875: 6.964663 loss at epoch 0 step 876: 6.933444 loss at epoch 0 step 877: 6.889087 loss at epoch 0 step 878: 6.728238 loss at epoch 0 step 879: 6.543335 loss at epoch 0 step 880: 6.785100 loss at epoch 0 step 881: 6.867868 loss at epoch 0 step 882: 6.626728 2018-06-25 14:57:28.011803: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:28.078783: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:28.121272: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 883: 6.632171 loss at epoch 0 step 884: 6.638273 loss at epoch 0 step 885: 6.727106 loss at epoch 0 step 886: 6.755214 loss at epoch 0 step 887: 6.788674 loss at epoch 0 step 888: 6.618323 loss at epoch 0 step 889: 6.802135 loss at epoch 0 step 890: 6.812809 loss at epoch 0 step 891: 6.905511 2018-06-25 14:57:29.982565: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:30.048872: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:30.089943: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 892: 6.660409 loss at epoch 0 step 893: 6.817567 loss at epoch 0 step 894: 6.782594 loss at epoch 0 step 895: 6.746886 loss at epoch 0 step 896: 6.679479 loss at epoch 0 step 897: 6.696771 loss at epoch 0 step 898: 6.642287 loss at epoch 0 step 899: 6.870078 2018-06-25 14:57:31.693334: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:31.761710: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:31.805339: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 900: 6.774796 loss at epoch 0 step 901: 6.841524 loss at epoch 0 step 902: 6.682564 loss at epoch 0 step 903: 6.693571 loss at epoch 0 step 904: 6.603614 loss at epoch 0 step 905: 6.705801 loss at epoch 0 step 906: 6.713558 loss at epoch 0 step 907: 6.889330 2018-06-25 14:57:33.459800: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:33.528563: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:33.573187: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 908: 6.726191 loss at epoch 0 step 909: 6.945642 loss at epoch 0 step 910: 6.768640 loss at epoch 0 step 911: 6.801554 loss at epoch 0 step 912: 6.867862 loss at epoch 0 step 913: 6.563491 loss at epoch 0 step 914: 6.853379 loss at epoch 0 step 915: 6.724048 2018-06-25 14:57:35.341966: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:35.412082: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:35.456774: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 916: 6.639415 loss at epoch 0 step 917: 6.881902 loss at epoch 0 step 918: 6.735335 loss at epoch 0 step 919: 6.688423 loss at epoch 0 step 920: 6.636367 loss at epoch 0 step 921: 6.717305 loss at epoch 0 step 922: 6.629582 2018-06-25 14:57:36.960963: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:37.033898: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:37.079990: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 923: 6.682681 loss at epoch 0 step 924: 6.882014 loss at epoch 0 step 925: 6.731138 loss at epoch 0 step 926: 6.588252 loss at epoch 0 step 927: 6.824276 loss at epoch 0 step 928: 6.707603 loss at epoch 0 step 929: 6.748030 2018-06-25 14:57:38.690123: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:38.767782: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:38.816409: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 930: 6.658401 loss at epoch 0 step 931: 6.748807 loss at epoch 0 step 932: 6.802845 loss at epoch 0 step 933: 6.590845 loss at epoch 0 step 934: 6.749659 loss at epoch 0 step 935: 6.701960 2018-06-25 14:57:40.158598: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:40.234263: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:40.282647: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 936: 6.775355 loss at epoch 0 step 937: 6.781133 loss at epoch 0 step 938: 6.600548 loss at epoch 0 step 939: 6.747424 loss at epoch 0 step 940: 6.605201 2018-06-25 14:57:41.416798: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:41.496970: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:41.550523: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 941: 6.672260 loss at epoch 0 step 942: 6.614303 loss at epoch 0 step 943: 6.726663 loss at epoch 0 step 944: 6.694652 loss at epoch 0 step 945: 6.590765 2018-06-25 14:57:42.720835: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:42.800870: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:42.852718: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 946: 6.639668 loss at epoch 0 step 947: 6.581154 loss at epoch 0 step 948: 6.780156 loss at epoch 0 step 949: 6.619957 loss at epoch 0 step 950: 6.680293 2018-06-25 14:57:44.034047: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:44.116253: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:44.167625: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 951: 6.767905 loss at epoch 0 step 952: 6.694592 loss at epoch 0 step 953: 6.586446 loss at epoch 0 step 954: 6.691846 2018-06-25 14:57:45.122539: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:45.204973: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:45.258423: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 955: 6.994371 loss at epoch 0 step 956: 6.858140 loss at epoch 0 step 957: 6.646007 loss at epoch 0 step 958: 6.721205 2018-06-25 14:57:46.238911: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:46.325236: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:46.380011: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 959: 6.589066 loss at epoch 0 step 960: 6.683684 loss at epoch 0 step 961: 6.672047 2018-06-25 14:57:47.046723: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:47.203337: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:47.260355: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 962: 6.725653 loss at epoch 0 step 963: 6.720684 loss at epoch 0 step 964: 6.636925 2018-06-25 14:57:47.932850: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:48.021669: I tensorflow/c/eager/c_api.cc:856] Executing op Unpack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:48.077942: I tensorflow/c/eager/c_api.cc:856] Executing op AddN in device /job:localhost/replica:0/task:0/device:GPU:0 loss at epoch 0 step 965: 6.461943 loss at epoch 0 step 966: 6.716685 loss at epoch 0 step 967: 6.541767 2018-06-25 14:57:48.839285: I tensorflow/c/eager/c_api.cc:856] Executing op Pack in device /job:localhost/replica:0/task:0/device:GPU:0 2018-06-25 14:57:58.895255: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 90.19MiB. Current allocation summary follows. 2018-06-25 14:57:58.895406: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-requested in use in bin. 2018-06-25 14:57:58.895440: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 17, Chunks in use: 16. 8.5KiB allocated for chunks. 8.0KiB in use in bin. 8.0KiB client-requested in use in bin. 2018-06-25 14:57:58.895466: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin. 2018-06-25 14:57:58.895488: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2018-06-25 14:57:58.895517: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2018-06-25 14:57:58.895538: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2018-06-25 14:57:58.895565: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (16384): Total Chunks: 5, Chunks in use: 3. 124.0KiB allocated for chunks. 79.0KiB in use in bin. 78.9KiB client-requested in use in bin. 2018-06-25 14:57:58.895592: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (32768): Total Chunks: 903, Chunks in use: 901. 28.30MiB allocated for chunks. 28.24MiB in use in bin. 28.15MiB client-requested in use in bin. 2018-06-25 14:57:58.895618: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (65536): Total Chunks: 402, Chunks in use: 402. 25.56MiB allocated for chunks. 25.56MiB in use in bin. 25.16MiB client-requested in use in bin. 2018-06-25 14:57:58.895639: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (131072): Total Chunks: 19, Chunks in use: 19. 2.51MiB allocated for chunks. 2.51MiB in use in bin. 2.47MiB client-requested in use in bin. 2018-06-25 14:57:58.895673: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (262144): Total Chunks: 8, Chunks in use: 8. 2.97MiB allocated for chunks. 2.97MiB in use in bin. 2.75MiB client-requested in use in bin. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#20218 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAATxbOLEdJJdr-VX_sTRU0d_OFNXawVks5uAKFUgaJpZM4UzWuc> .

-- - Alex

allenlavoie · 2018-06-25T18:24:57Z

Looks like one issue is that summaries are adding to global collections. Could you try taking summaries out and seeing if there's still a leak? Or post a training loop and I can give it a try.

qijimrc · 2018-06-26T03:15:08Z

Hi @allenlavoie , I tried to remove the summaries, but got the same result. So it looks like this is not an error about summaries. Still thank you for your reply.

qijimrc · 2018-06-26T03:30:03Z

@alextp , I think so. Is there any solution? Or I sent you the complete program code, can you help me to debug it? Can I communicate with you about this issue via email? Thank you very much!

allenlavoie · 2018-06-26T16:39:37Z

Well, the summary issue is fixed at least. I'm happy to track down the other memory issue, but you'll need to post a training loop which reproduces the issue (and which I can run).

allenlavoie · 2018-06-26T21:51:18Z

(Leaving it closed for now, but feel free to re-open with a repro.)

LeoLai930603 · 2018-08-23T08:49:03Z

Still facing this problem in 1.10.0, hope it could be fixed soon.

allenlavoie · 2018-08-23T16:07:42Z

@LeoLai930603 could you post a snippet to reproduce?

LeoLai930603 · 2018-08-27T11:30:46Z

Sure, I post it on my github: https://github.com/LeoLai930603/Eager_OOM_reproduce/tree/master

BTW, I tried this code on Windows 10 with Anaconda (Python 3.5, TensorFlow 1.8.0), but I think this error could also occur on Linux. I met this error on both CPU and GPU, hope such information would be helpful for you.

allenlavoie · 2018-08-28T17:16:29Z

@LeoLai930603 thanks, that's super helpful. Looks like one issue was assigning functions to EagerTensors (as when slicing); I have a workaround in the process of submitting for that (and will properly patch it soon).

Looks like something RNN related is still holding on to LSTMStateTuples, so I'll take a look at that too.

allenlavoie · 2018-08-28T23:21:42Z

I will follow up and properly patch variable slicing.

After that it's still leaking memory, but it looks like it's a zeros/ones caching issue with backprop. @akshaym was going to take a look (at https://github.com/LeoLai930603/Eager_OOM_reproduce/tree/master ).

I think it's a caching issue because clearing the cache after train_step (in eager_biLSTM_trainer.py) keeps the memory stable and means the number of PyObjects doesn't increase:

from tensorflow.python.eager import context
context.context()._clear_caches()  # Increasing memory and PyObject count without this
import gc
gc.collect()
print(len(gc.get_objects()))

LeoLai930603 · 2018-08-29T01:15:11Z

Thanks a lot! I will give it a try later. Would this work for Eager on GPU? I am not sure whether this issue is only related to the gc on host. As the author mentioned at first, the problem could also be caused by insufficient GPU memory allocation.

allenlavoie · 2018-08-29T02:07:44Z

The PyObject issue was probably just host-side (and now that I think about it, you may not have run into it unless you were using a nightly from the past week or so). Not sure the gc.collect() is necessary there; just using it to get a stable count of objects in case there are reference cycles.

The zeros caching issue will affect GPU too, since (I assume) we cache GPU Tensors. But all my testing was CPU-only so far.

This caching helps avoid re-creating zero Tensors during backprop and is limited to 256 entries, but when running your model ~60 entries was tens of gigabytes. Sounds from Akshay like the solution will be to cap the size of the cache. (And maybe we should cap it more aggressively for GPU Tensors since memory will be more constrained?)

LeoLai930603 · 2018-08-29T03:51:57Z

Agree, and I was wondering whether it is possible to expose this setting (caching size or entry limit) to the user, so we can manage the training and deployment setting by ourselves? And I believe it deserves more attention to the RNN model implementation with Eager mode, for example, it is very inconvenient to process the trainable variables in LSTMCells during training, because 1) it is lazy-initialized, and I was expecting to have a explicit way to initialize the weights in it like Conv2d, 2) the trainable variables are nested list, which makes the following pair-matching with gradients more difficult (you have to flatten the list by your own in stead of just a zip function in tutorial). I hope such cells could provide more features that make them more eager-friendly. At least that would help debugging when facing memory leak.

…pyWithLogits See https://github.com/tensorflow/tensorflow/blob/065f9b833ffbb3b2f03d63febb186275674ba133/tensorflow/python/ops/nn_grad.py#L482 Should help with #20218 PiperOrigin-RevId: 210933185

allenlavoie · 2018-08-31T00:43:50Z

Akshay's change limits caching to a few megabytes at most, so hopefully never an issue.

I believe the future of RNN APIs are being decided now (until Sept. 10): tensorflow/community#15 May be worth reading over the design document and seeing if your concerns are already addressed. Describing the pain points in a comment would be helpful if not.

LeoLai930603 · 2018-08-31T01:36:20Z

Much appreciate that! I will wait for the dev20180831 version to test it. Hope this could be fixed perfectly.

rmanak · 2020-02-05T06:33:00Z

There seems to be still a problem with slicing operator resulting in ResourceExhausteError, when using gradient tape, I was able to get around it by avoiding the slicing operators in the loss calculation.

tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Jun 22, 2018

tensorflowbutler assigned tatianashp Jun 22, 2018

tatianashp added the comp:eager Eager related issues label Jun 22, 2018

tatianashp assigned alextp and unassigned tatianashp Jun 22, 2018

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 26, 2018

case540 closed this as completed in 044fe4c Jun 26, 2018

allenlavoie reopened this Aug 28, 2018

tensorflow-copybara closed this as completed in 9d8b58a Aug 28, 2018

allenlavoie reopened this Aug 28, 2018

allenlavoie assigned akshaym and allenlavoie and unassigned alextp Aug 28, 2018

tensorflow-copybara closed this as completed in 4a1570d Aug 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ran out of memory in eager execution #20218

ran out of memory in eager execution #20218

qijimrc commented Jun 22, 2018 •

edited

tensorflowbutler commented Jun 22, 2018

alextp commented Jun 22, 2018

qijimrc commented Jun 25, 2018 •

edited

alextp commented Jun 25, 2018 via email

allenlavoie commented Jun 25, 2018

qijimrc commented Jun 26, 2018

qijimrc commented Jun 26, 2018

allenlavoie commented Jun 26, 2018

allenlavoie commented Jun 26, 2018

LeoLai930603 commented Aug 23, 2018

allenlavoie commented Aug 23, 2018

LeoLai930603 commented Aug 27, 2018

allenlavoie commented Aug 28, 2018

allenlavoie commented Aug 28, 2018

LeoLai930603 commented Aug 29, 2018

allenlavoie commented Aug 29, 2018

LeoLai930603 commented Aug 29, 2018

allenlavoie commented Aug 31, 2018

LeoLai930603 commented Aug 31, 2018

rmanak commented Feb 5, 2020

ran out of memory in eager execution #20218

ran out of memory in eager execution #20218

Comments

qijimrc commented Jun 22, 2018 • edited

System information

Describe the problem

Source code / logs

tensorflowbutler commented Jun 22, 2018

alextp commented Jun 22, 2018

qijimrc commented Jun 25, 2018 • edited

some logs

alextp commented Jun 25, 2018 via email

allenlavoie commented Jun 25, 2018

qijimrc commented Jun 26, 2018

qijimrc commented Jun 26, 2018

allenlavoie commented Jun 26, 2018

allenlavoie commented Jun 26, 2018

LeoLai930603 commented Aug 23, 2018

allenlavoie commented Aug 23, 2018

LeoLai930603 commented Aug 27, 2018

allenlavoie commented Aug 28, 2018

allenlavoie commented Aug 28, 2018

LeoLai930603 commented Aug 29, 2018

allenlavoie commented Aug 29, 2018

LeoLai930603 commented Aug 29, 2018

allenlavoie commented Aug 31, 2018

LeoLai930603 commented Aug 31, 2018

rmanak commented Feb 5, 2020

qijimrc commented Jun 22, 2018 •

edited

qijimrc commented Jun 25, 2018 •

edited