tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM #7652

arunumd · 2019-10-11T18:52:13Z

System information

What is the top-level directory of the model you are using: /home
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): 1.9.0
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: 10.1.243
GPU model and memory: NVIDIA Quadro RTX 5000; and 16 GB RAM
Exact command to reproduce:
I ran the following code in an ipython notebook in both my local machine (local GPU) and Google Colab :

!git clone https://github.com/charlesq34/pointnet.git
cd pointnet/sem_seg/
!sh download_data.sh
!python train.py --log_dir log6 --test_area 6

Describe the problem

The tensorflow API always tries to consume the maximum RAM even when I have a GPU and the kernel gets killed while training my deep learning algorithm. I referred online on multiple sources (1, 2, 3, 4, 5, 6) and tried the following things :

Reduce the batch size
Change the optimizer from adam to momentum

However, none of these suggestions helped to solve the problem.

Source code / logs

The error log is very long and hence I am attaching it in a separate text file here :
ERROR_LOG.txt

The text was updated successfully, but these errors were encountered:

rolba · 2019-11-25T16:38:52Z

Hello.
Be sure that you reduced your bath size well. I had the same issue with my code:
https://github.com/rolba/ai-nimals/blob/master/ai_nimals_train_alexnet.py
Reducing bath to 32 for generators did the job.
Moreover, I paid attention to my RAM memory while training using htop in the console. When SWAP starts to overflow it was a sign for me that I am having a problem with my bath size.

You can find hdf5 generators on my github account. Please check them, use them and let me know if you are still having problems.
Br.
Pawel

PrakashSuthar · 2020-04-11T09:25:04Z

Hello, I get the tcmalloc error very often when trying to run the code on colab from python files ( say train.py ) but the same code(content of train.py copied to cell) when run from the cell gives no such error.I would like to know the cause behind such a behaviour.

ravikyram · 2020-06-21T15:23:14Z

@arunumd

Is this still an issue?.Please, close this thread if your issue was resolved.Thanks!

arunumd · 2020-06-22T04:36:40Z

@ravikyram Yes. This is still the same issue

ravikyram · 2020-06-22T07:43:13Z

@arunumd

Please, let us know which pretrained model you are using and share related code .Thanks!

entorius · 2021-04-06T17:05:49Z

For example this issue still persists when i try to run https://github.com/dorarad/gansformer this model.
I'm using
Tensorflow 1.15.0
Google colab on GPU

ravikyram self-assigned this Jun 21, 2020

ravikyram added the type:support label Jun 21, 2020

ravikyram added the stat:awaiting response Waiting on input from the contributor label Jun 21, 2020

tensorflowbutler removed the stat:awaiting response Waiting on input from the contributor label Jun 24, 2020

ravikyram added the models:research models that come under research directory label Jul 22, 2020

ravikyram assigned sguada and marksandler2 and unassigned ravikyram Jul 22, 2020

aolko mentioned this issue Jul 21, 2021

Notebook automatically stops by performing ctrl-c KoboldAI/KoboldAI-Client#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM #7652

tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM #7652

arunumd commented Oct 11, 2019

rolba commented Nov 25, 2019

PrakashSuthar commented Apr 11, 2020

ravikyram commented Jun 21, 2020

arunumd commented Jun 22, 2020

ravikyram commented Jun 22, 2020

entorius commented Apr 6, 2021

tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM #7652

tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM #7652

Comments

arunumd commented Oct 11, 2019

System information

Describe the problem

Source code / logs

rolba commented Nov 25, 2019

PrakashSuthar commented Apr 11, 2020

ravikyram commented Jun 21, 2020

arunumd commented Jun 22, 2020

ravikyram commented Jun 22, 2020

entorius commented Apr 6, 2021