[train_task_classifyapp.py] GPU memory grows unlimitedly #12

FineArtz · 2019-08-20T03:22:38Z

When I try to train the model on another dataset, I find that the programme will take up more and more GPU memory and finally trigger Out of Memory Error. Then I add

gp = tf.get_default_graph()
gp.finalize()

before model.train_gen is called (line 402) to test whether new tensorflow ops are added to the graph while training. The result is

RuntimeError: Graph is finalized and cannot be modified.

which is located at self.model.fit_generator (line 237).
I don't know whether it is a bug or not. Since all interfaces are written in Keras, it is difficult for me to find out the exact problem in tensorflow backend.

The text was updated successfully, but these errors were encountered:

tbennun · 2019-08-20T13:04:55Z

As far as I know with TF, manually running finalize on a graph is definitely not a good idea unless there is a need to do so.

I'm not sure whether the problem is with Keras or TF not deallocating memory, your dataset/loader, or that the GPU you are using simply runs out of memory. When we run the training process on our dataset we do not encounter this issue.

What is your sequence length limit? Maybe that, or the minibatch size, should be reduced.

FineArtz · 2019-08-21T10:23:02Z

On my dataset, the longest sequnce is of length 8618 and the mean sequence length is 425. I've tried to reduce the minibatch size to 1, but it didn't work.

The reason why I manually run finalize on the graph is to test whether new nodes are added to the graph, and it does happen. I'm now trying other methods to solve this problem.

Another question is, could you please briefly explain which part those parameters will affect? I'm very upset finding that the accuracy on my dataset is only about 0.3 and remains almost unchanged until OOM Error occurs. I have tried many combinations of parameters, but none of them performs well. Is there possibly something wrong?

Zacharias030 · 2019-08-21T16:12:46Z

Do you experience the same OOM problem when you run the code with the provided dataset? (POJ-104)? How many classes does your ds have? If you use the published embeddings as a starting point, you should be able to write your own LSTM with details from the paper fairly easily. This way you would be independent of keras. Zacharias Am 21. Aug. 2019, 12:23 +0200 schrieb Hanye Zhao <notifications@github.com>:

…

On my dataset, the longest sequnce is of length 8618 and the mean sequence length is 425. I've tried to reduce the minibatch size to 1, but it didn't work. The reason why I manually run finalize on the graph is to test whether new nodes are added to the graph, and it does happen. I'm now trying other methods to solve this problem. Another question is, could you please briefly explain which part those parameters will affect? I'm very upset finding that the accuracy on my dataset is only about 0.3 and remains almost unchanged until OOM Error occurs. I have tried many combinations of parameters, but none of them performs well. Is there possibly something wrong? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Zacharias030 · 2019-08-21T22:26:01Z

In fact, the POJ-104 dataset used for the classifyapp task does not contain very large files.

The histogram shows the number of statements per file for a subset of the dataset. As you can see, 8000 lines is an order of magnitude larger than any file that is included in the subset considered.

In order to train on significantly longer sequences than that, you probably need a few tricks that go beyond the code provided here, but you can try training on the shorter sequences in your dataset.
The network probably generalizes to longer sequences at inference time fairly well and inference through an LSTM is quite memory efficient (activations don't need to be stored for backpropagation).

Hope this helps,
Zacharias

FineArtz · 2019-09-02T03:25:06Z

Thanks for your patient reply. Finally I find that it is probably the function tf.nn.embedding_lookup (line 148, 169 in train_task_classifyapp.py) that leads to the OOM Error. It is used in the batch generator and adds new nodes to the current tensorflow graph, thus makes the graph grows at runtime. I manually rewrite it and the programme now works well. I know that tf.nn.embedding_lookup can be calculated in parallel so that it is faster, but I have to do that to avoid runtime error. Thanks again！

tbennun · 2019-09-02T04:42:07Z

If this is an issue with the current code base, would you mind creating a pull request with your fix? Thanks!

Zacharias030 · 2019-09-18T09:34:02Z

Thanks for your patient reply. Finally I find that it is probably the function tf.nn.embedding_lookup (line 148, 169 in train_task_classifyapp.py) that leads to the OOM Error. It is used in the batch generator and adds new nodes to the current tensorflow graph, thus makes the graph grows at runtime. I manually rewrite it and the programme now works well. I know that tf.nn.embedding_lookup can be calculated in parallel so that it is faster, but I have to do that to avoid runtime error. Thanks again！

Strangely, I didn't encounter this OOM issue with the current code base. How to reproduce the problem and how did you eventually fix it?

FineArtz · 2019-09-23T08:33:35Z

I do some other tests and now think it is because of some unknown error in my GPU server or keras/tensorflow. I finally give up finding the true reason why this bug appears. So just let this issue closed.

tbennun · 2019-09-23T16:27:28Z

Thanks for reporting anyway. Good luck

tbennun closed this as completed Sep 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train_task_classifyapp.py] GPU memory grows unlimitedly #12

[train_task_classifyapp.py] GPU memory grows unlimitedly #12

FineArtz commented Aug 20, 2019

tbennun commented Aug 20, 2019

FineArtz commented Aug 21, 2019

Zacharias030 commented Aug 21, 2019 via email

Zacharias030 commented Aug 21, 2019 •

edited

Loading

FineArtz commented Sep 2, 2019

tbennun commented Sep 2, 2019

Zacharias030 commented Sep 18, 2019

FineArtz commented Sep 23, 2019

tbennun commented Sep 23, 2019

[train_task_classifyapp.py] GPU memory grows unlimitedly #12

[train_task_classifyapp.py] GPU memory grows unlimitedly #12

Comments

FineArtz commented Aug 20, 2019

tbennun commented Aug 20, 2019

FineArtz commented Aug 21, 2019

Zacharias030 commented Aug 21, 2019 via email

Zacharias030 commented Aug 21, 2019 • edited Loading

FineArtz commented Sep 2, 2019

tbennun commented Sep 2, 2019

Zacharias030 commented Sep 18, 2019

FineArtz commented Sep 23, 2019

tbennun commented Sep 23, 2019

Zacharias030 commented Aug 21, 2019 •

edited

Loading