Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow Object Detection API - SSD Continuously Increasing RAM Usage during Training #5296

Open
jonbakerfish opened this issue Sep 12, 2018 · 26 comments
Assignees
Labels
models:research models that come under research directory type:support

Comments

@jonbakerfish
Copy link

System information

  • What is the top-level directory of the model you are using: models/research/object_detection/
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.10.1
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: CUDA 9.0 / cuDNN 7.1
  • GPU model and memory: GeForce GTX 1080 Ti / 12GB
  • Exact command to reproduce: python object_detection/model_main.py --pipeline_config_path=${PIPELINE_CONFIG_PATH} --model_dir=${MODEL_DIR} --alsologtostderr

Describe the problem

I use the object detection API to train different models (ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync, ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync, faster_rcnn_resnet50_coco) on my own dataset. The faster rcnn can run without problem, but the ssd models will continuously increasing RAM usage during training and finally got OOM. There is a similar problem on SO.

Source code / logs

NA

@FreestylePocker
Copy link

hi, i have similar problem with this config and using model_main.py
trained on own dataset with batch size 1 (my GTX 970 simply cant handle more) with TensorFlow v1.10 compiled from source with cuda 9.2 support
i think this issue related to #5139
seems some leak occurs on every evaluation step, so eval delay may help to slowdown it

@karmel karmel assigned pkulzc and unassigned karmel Sep 14, 2018
@karmel
Copy link
Contributor

karmel commented Sep 14, 2018

@pkulzc -- can you comment on the memory issue described here? Also CC @robieta for SSD.

@KapoorHitesh
Copy link

Any update on this? memory leak issue from 2 months now.

@pkulzc
Copy link
Contributor

pkulzc commented Nov 12, 2018

Could you sync to HEAD and try again? I think one of our earlier PR should have fixed this.

@MirkoArnold1
Copy link

I'm using the current HEAD, ssd_mobilenet_v2_coco_2018_03_29, runtime 1.10, python 3.5, on a standard_p100 master on google cloud. Memory utilization still grows over time:
screenshot from 2018-11-20 09-40-36

@pkulzc
Copy link
Contributor

pkulzc commented Nov 20, 2018

@MirkoArnold1 Did you sync to latest?

@MirkoArnold1
Copy link

@pkulzc Yes, I did, and I rebuilt the packages for cloud ml engine.

@1byte2bytes
Copy link

I'm having the same issue, I'm using CPU instead of GPU though. Within a few steps it has completely eaten all 12GB of my RAM and my pagefile has ballooned.

@pkulzc
Copy link
Contributor

pkulzc commented Nov 23, 2018

The chart @MirkoArnold1 showed did indicate a memory leak. I'll look into that.

But using up 12GB RAM seems normal, @1byte2bytes did you try reducing batch size ?

@1byte2bytes
Copy link

It used all 12GB of my system RAM, plus an additional 24GB of my page file (max size) pretty quickly. I could try a reduced batch size, or try it on a machine with more RAM I suppose.

@donghyeon
Copy link

donghyeon commented Nov 29, 2018

I got a same problem during training faster r-cnn on the coco dataset. Initially the object detection API allocates memory about a few gigabytes(7-8GB). It grows up gradually, and suddenly consumes all of the memory to be killed itself. At the moment the API was allocating the memory over than 70GB (I did 4 parallel experiments using 4 GPUs). My system has 256GB memory, so I think those amount is quite large enough compared to the general requirements of memory. I didn't check the elapsed time precisely, but I guess it takes 10-20+ hours to reach the moment.

@donghyeon
Copy link

donghyeon commented Dec 13, 2018

I don't know exactly which factor makes memory allocation continuously increase, but as my experiences, the evaluation steps of the object detection API are strongly related to this problem. I'm training a faster r-cnn model (this means that ssd and frcnn architectures have the same problem) on coco dataset where validation set have over 40k examples. I found an obvious difference in amount of memory increase when I change the flag option "sample_1_of_n_eval_examples" value. Lower "sample_1_of_n_eval_examples" results in much faster memory usage increase. This problem happens not only "tf.estimator.train_and_evaluate" method but also "estimator.evaluate" method. Please check the evaluation steps if there's anyone willing to fix this problem.

For example, in my experiments, 60GB of memory was allocated after running the evaluation on coco dataset about 16 hours when I set "sample_1_of_n_eval_examples=1". When I change it to "sample_1_of_n_eval_examples=20", memory usage is reduced to 25GB at the same time.

@xxllp
Copy link

xxllp commented Dec 29, 2018

so bad ~~

@jewes
Copy link

jewes commented Jan 4, 2019

same issue here. The issue does not happen in an old version which was downloaed on May 2018.

@renanwille
Copy link

Using TF v1.12 and it seems to me that the issue continues. May be related to TF issue 24047

@Jasonnor
Copy link

Same issue here using TF 1.12, seems that I can only write a watchdog script to monitor the memory and restart it. 😞

@renanwille
Copy link

Tip: Only for documentation, one can use legacy/train.py to train the models too.

@wiseosho
Copy link

wiseosho commented Jun 5, 2019

Same issue continues using TF 1.13, I guess the tf_func operation could be the cause.
I observed memory takeup increases once evaluation starts and does not fall down even after the evaluation is done.also the amount of the leakage increases with the size of the evaluation data. Any other updates on this issues?

@liuchangf
Copy link

You can try steps as followings:

  1. Set batch_size=1 (or try your own)
  2. Change "default value": optional uint32 shuffle_buffer_size = 11 [default = 256] (or try your own)
    the code is here
    optional uint32 shuffle_buffer_size = 11 [default = 2048];

original set is :
optional uint32 shuffle_buffer_size = 11 [default = 2048]

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consume a lot of RAM in my opinion .

  1. Recompile Protobuf libraries
    From tensorflow/models/research/
    protoc object_detection/protos/*.proto --python_out=.

@Mageswaran1989
Copy link

I see similar increase in memory with dataset and custom estimator usage.

tensorflow/tensorflow#32052

@amussell
Copy link

amussell commented Oct 4, 2019

Can you post the config you are using?

@Jacqueline-L-Lane
Copy link

Has anyone got around this running out of memory error on a raspberry pi 4? I tried making the modifications above, but the process is still killed due to running out of memory. This is the error I get:

Out of memory: Kill process 945 (train.py) score 569 or sacrifice child
[Fri Feb 21 17:25:32 2020] Killed process 945 (train.py) total-vm:915108kB, anon-rss:507480kB, file-rss:0kB, shmem-rss:0kB

@jaeyounkim jaeyounkim added this to Needs triage in Object Detection May 8, 2020
@moulicm111
Copy link

This memory increasing issue exists in Tensorflow 2 also

@veonua
Copy link

veonua commented Jun 12, 2020

I believe this bug tensorflow/tensorflow#33516 is realted

in dataset_builder.py I've changed
dataset.map( ... , tf.data.experimental.AUTOTUNE)
to
dataset.map( ... , num_parallel_calls)

and memory leak seems to be fixed

@ravikyram ravikyram added models:research models that come under research directory type:support labels Jul 15, 2020
@MoscowskyAnton
Copy link

You can try steps as followings:

1. Set batch_size=1 (or try your own)

2. Change "default value": optional uint32 shuffle_buffer_size = 11 [**default = 256**] (or try your own)
   the code is here
   https://github.com/tensorflow/models/blob/ce03903f516731171633d92a50e2218a4d3303b6/research/object_detection/protos/input_reader.proto#L40

original set is :
optional uint32 shuffle_buffer_size = 11 [default = 2048]

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consume a lot of RAM in my opinion .

1. Recompile Protobuf libraries
   From tensorflow/models/research/
   protoc object_detection/protos/*.proto --python_out=.

Hello!
I did exactly you mentioned, but when start training it says that shuffle buffer is still 2048
2021-02-05 11:19:52.076080: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 1417 of 2048
Any suggestions?

@orangeronald
Copy link

image
Have similar issue before. After reducing the size of the model, somehow the usage of RAM drop back to a low point before reaching the RAM limit. The issue of RAM growth during the process is still unacknowledged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:support
Projects
Object Detection
  
Needs triage (Issues)
Development

No branches or pull requests