Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM #7652

Open
arunumd opened this issue Oct 11, 2019 · 6 comments
Assignees
Labels
models:research models that come under research directory type:support

Comments

@arunumd
Copy link

arunumd commented Oct 11, 2019

System information

  • What is the top-level directory of the model you are using: /home
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 1.9.0
  • Bazel version (if compiling from source): N/A
  • CUDA/cuDNN version: 10.1.243
  • GPU model and memory: NVIDIA Quadro RTX 5000; and 16 GB RAM
  • Exact command to reproduce:
    I ran the following code in an ipython notebook in both my local machine (local GPU) and Google Colab :
!git clone https://github.com/charlesq34/pointnet.git
cd pointnet/sem_seg/
!sh download_data.sh
!python train.py --log_dir log6 --test_area 6

Describe the problem

The tensorflow API always tries to consume the maximum RAM even when I have a GPU and the kernel gets killed while training my deep learning algorithm. I referred online on multiple sources (1, 2, 3, 4, 5, 6) and tried the following things :

  1. Reduce the batch size
  2. Change the optimizer from adam to momentum

However, none of these suggestions helped to solve the problem.

Source code / logs

The error log is very long and hence I am attaching it in a separate text file here :
ERROR_LOG.txt

@rolba
Copy link

rolba commented Nov 25, 2019

Hello.
Be sure that you reduced your bath size well. I had the same issue with my code:
https://github.com/rolba/ai-nimals/blob/master/ai_nimals_train_alexnet.py
Reducing bath to 32 for generators did the job.
Moreover, I paid attention to my RAM memory while training using htop in the console. When SWAP starts to overflow it was a sign for me that I am having a problem with my bath size.

You can find hdf5 generators on my github account. Please check them, use them and let me know if you are still having problems.
Br.
Pawel

@PrakashSuthar
Copy link

Hello, I get the tcmalloc error very often when trying to run the code on colab from python files ( say train.py ) but the same code(content of train.py copied to cell) when run from the cell gives no such error.I would like to know the cause behind such a behaviour.

@ravikyram ravikyram self-assigned this Jun 21, 2020
@ravikyram
Copy link

@arunumd

Is this still an issue?.Please, close this thread if your issue was resolved.Thanks!

@ravikyram ravikyram added the stat:awaiting response Waiting on input from the contributor label Jun 21, 2020
@arunumd
Copy link
Author

arunumd commented Jun 22, 2020

@ravikyram Yes. This is still the same issue

@ravikyram
Copy link

@arunumd

Please, let us know which pretrained model you are using and share related code .Thanks!

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Waiting on input from the contributor label Jun 24, 2020
@ravikyram ravikyram added the models:research models that come under research directory label Jul 22, 2020
@ravikyram ravikyram assigned sguada and marksandler2 and unassigned ravikyram Jul 22, 2020
@entorius
Copy link

entorius commented Apr 6, 2021

For example this issue still persists when i try to run https://github.com/dorarad/gansformer this model.
I'm using
Tensorflow 1.15.0
Google colab on GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:support
Projects
None yet
Development

No branches or pull requests

8 participants