Tensorflow Object Detection API - SSD Continuously Increasing RAM Usage during Training #5296

jonbakerfish · 2018-09-12T09:48:11Z

System information

What is the top-level directory of the model you are using: models/research/object_detection/
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.10.1
Bazel version (if compiling from source):
CUDA/cuDNN version: CUDA 9.0 / cuDNN 7.1
GPU model and memory: GeForce GTX 1080 Ti / 12GB
Exact command to reproduce: python object_detection/model_main.py --pipeline_config_path=${PIPELINE_CONFIG_PATH} --model_dir=${MODEL_DIR} --alsologtostderr

Describe the problem

I use the object detection API to train different models (ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync, ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync, faster_rcnn_resnet50_coco) on my own dataset. The faster rcnn can run without problem, but the ssd models will continuously increasing RAM usage during training and finally got OOM. There is a similar problem on SO.

Source code / logs

NA

FreestylePocker · 2018-09-12T20:12:03Z

hi, i have similar problem with this config and using model_main.py
trained on own dataset with batch size 1 (my GTX 970 simply cant handle more) with TensorFlow v1.10 compiled from source with cuda 9.2 support
i think this issue related to #5139
seems some leak occurs on every evaluation step, so eval delay may help to slowdown it

karmel · 2018-09-14T23:14:16Z

@pkulzc -- can you comment on the memory issue described here? Also CC @robieta for SSD.

KapoorHitesh · 2018-11-12T08:59:40Z

Any update on this? memory leak issue from 2 months now.

pkulzc · 2018-11-12T16:16:24Z

Could you sync to HEAD and try again? I think one of our earlier PR should have fixed this.

MirkoArnold1 · 2018-11-20T09:46:28Z

I'm using the current HEAD, ssd_mobilenet_v2_coco_2018_03_29, runtime 1.10, python 3.5, on a standard_p100 master on google cloud. Memory utilization still grows over time:

pkulzc · 2018-11-20T16:50:27Z

@MirkoArnold1 Did you sync to latest?

MirkoArnold1 · 2018-11-20T17:28:44Z

@pkulzc Yes, I did, and I rebuilt the packages for cloud ml engine.

1byte2bytes · 2018-11-23T03:12:41Z

I'm having the same issue, I'm using CPU instead of GPU though. Within a few steps it has completely eaten all 12GB of my RAM and my pagefile has ballooned.

pkulzc · 2018-11-23T03:31:22Z

The chart @MirkoArnold1 showed did indicate a memory leak. I'll look into that.

But using up 12GB RAM seems normal, @1byte2bytes did you try reducing batch size ?

1byte2bytes · 2018-11-23T03:35:29Z

It used all 12GB of my system RAM, plus an additional 24GB of my page file (max size) pretty quickly. I could try a reduced batch size, or try it on a machine with more RAM I suppose.

donghyeon · 2018-11-29T01:55:53Z

I got a same problem during training faster r-cnn on the coco dataset. Initially the object detection API allocates memory about a few gigabytes(7-8GB). It grows up gradually, and suddenly consumes all of the memory to be killed itself. At the moment the API was allocating the memory over than 70GB (I did 4 parallel experiments using 4 GPUs). My system has 256GB memory, so I think those amount is quite large enough compared to the general requirements of memory. I didn't check the elapsed time precisely, but I guess it takes 10-20+ hours to reach the moment.

donghyeon · 2018-12-13T06:33:26Z

I don't know exactly which factor makes memory allocation continuously increase, but as my experiences, the evaluation steps of the object detection API are strongly related to this problem. I'm training a faster r-cnn model (this means that ssd and frcnn architectures have the same problem) on coco dataset where validation set have over 40k examples. I found an obvious difference in amount of memory increase when I change the flag option "sample_1_of_n_eval_examples" value. Lower "sample_1_of_n_eval_examples" results in much faster memory usage increase. This problem happens not only "tf.estimator.train_and_evaluate" method but also "estimator.evaluate" method. Please check the evaluation steps if there's anyone willing to fix this problem.

For example, in my experiments, 60GB of memory was allocated after running the evaluation on coco dataset about 16 hours when I set "sample_1_of_n_eval_examples=1". When I change it to "sample_1_of_n_eval_examples=20", memory usage is reduced to 25GB at the same time.

xxllp · 2018-12-29T07:43:36Z

so bad ~~

jewes · 2019-01-04T14:58:36Z

same issue here. The issue does not happen in an old version which was downloaed on May 2018.

renanwille · 2019-03-14T13:53:30Z

Using TF v1.12 and it seems to me that the issue continues. May be related to TF issue 24047

Jasonnor · 2019-03-18T02:47:26Z

Same issue here using TF 1.12, seems that I can only write a watchdog script to monitor the memory and restart it. 😞

renanwille · 2019-03-25T20:53:36Z

Tip: Only for documentation, one can use legacy/train.py to train the models too.

wiseosho · 2019-06-05T05:18:49Z

Same issue continues using TF 1.13, I guess the tf_func operation could be the cause.
I observed memory takeup increases once evaluation starts and does not fall down even after the evaluation is done.also the amount of the leakage increases with the size of the evaluation data. Any other updates on this issues?

liuchangf · 2019-06-12T01:26:28Z

You can try steps as followings:

Set batch_size=1 (or try your own)
Change "default value": optional uint32 shuffle_buffer_size = 11 [default = 256] (or try your own)
the code is here

models/research/object_detection/protos/input_reader.proto

Line 40 in ce03903

optional uint32 shuffle_buffer_size = 11 [default = 2048];

original set is :
optional uint32 shuffle_buffer_size = 11 [default = 2048]

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consume a lot of RAM in my opinion .

Recompile Protobuf libraries
From tensorflow/models/research/
protoc object_detection/protos/*.proto --python_out=.

Mageswaran1989 · 2019-08-30T10:26:55Z

I see similar increase in memory with dataset and custom estimator usage.

tensorflow/tensorflow#32052

amussell · 2019-10-04T17:50:42Z

Can you post the config you are using?

Jacqueline-L-Lane · 2020-02-22T01:28:19Z

Has anyone got around this running out of memory error on a raspberry pi 4? I tried making the modifications above, but the process is still killed due to running out of memory. This is the error I get:

Out of memory: Kill process 945 (train.py) score 569 or sacrifice child
[Fri Feb 21 17:25:32 2020] Killed process 945 (train.py) total-vm:915108kB, anon-rss:507480kB, file-rss:0kB, shmem-rss:0kB

moulicm111 · 2020-06-07T07:17:00Z

This memory increasing issue exists in Tensorflow 2 also

veonua · 2020-06-12T16:02:36Z

I believe this bug tensorflow/tensorflow#33516 is realted

in dataset_builder.py I've changed
dataset.map( ... , tf.data.experimental.AUTOTUNE)
to
dataset.map( ... , num_parallel_calls)

and memory leak seems to be fixed

MoscowskyAnton · 2021-02-05T08:26:42Z

You can try steps as followings:
1. Set batch_size=1 (or try your own)

2. Change "default value": optional uint32 shuffle_buffer_size = 11 [**default = 256**] (or try your own)
   the code is here
   https://github.com/tensorflow/models/blob/ce03903f516731171633d92a50e2218a4d3303b6/research/object_detection/protos/input_reader.proto#L40
original set is :
optional uint32 shuffle_buffer_size = 11 [default = 2048]

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consume a lot of RAM in my opinion .
1. Recompile Protobuf libraries
   From tensorflow/models/research/
   protoc object_detection/protos/*.proto --python_out=.

Hello!
I did exactly you mentioned, but when start training it says that shuffle buffer is still 2048
2021-02-05 11:19:52.076080: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 1417 of 2048
Any suggestions?

orangeronald · 2021-12-29T03:39:23Z

Have similar issue before. After reducing the size of the model, somehow the usage of RAM drop back to a low point before reaching the RAM limit. The issue of RAM growth during the process is still unacknowledged.

tensorflowbutler assigned karmel Sep 12, 2018

karmel assigned pkulzc and unassigned karmel Sep 14, 2018

mhtrinh mentioned this issue Oct 11, 2018

Object detection memory leak #5139

Closed

donghyeon mentioned this issue Nov 30, 2018

model_main.py Coco evaluation is unable to handle bigger validation sets. #5784

Closed

liuchangf mentioned this issue Aug 20, 2019

Out Of Memory when training on Big Images #1817

Closed

jaeyounkim added this to Needs triage in Object Detection May 8, 2020

ravikyram added models:research models that come under research directory type:support labels Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow Object Detection API - SSD Continuously Increasing RAM Usage during Training #5296

Tensorflow Object Detection API - SSD Continuously Increasing RAM Usage during Training #5296

jonbakerfish commented Sep 12, 2018

FreestylePocker commented Sep 12, 2018

karmel commented Sep 14, 2018

KapoorHitesh commented Nov 12, 2018

pkulzc commented Nov 12, 2018

MirkoArnold1 commented Nov 20, 2018

pkulzc commented Nov 20, 2018

MirkoArnold1 commented Nov 20, 2018

1byte2bytes commented Nov 23, 2018

pkulzc commented Nov 23, 2018 •

edited

Loading

1byte2bytes commented Nov 23, 2018

donghyeon commented Nov 29, 2018 •

edited

Loading

donghyeon commented Dec 13, 2018 •

edited

Loading

xxllp commented Dec 29, 2018

jewes commented Jan 4, 2019 •

edited

Loading

renanwille commented Mar 14, 2019

Jasonnor commented Mar 18, 2019

renanwille commented Mar 25, 2019

wiseosho commented Jun 5, 2019

liuchangf commented Jun 12, 2019

Mageswaran1989 commented Aug 30, 2019

amussell commented Oct 4, 2019

Jacqueline-L-Lane commented Feb 22, 2020

moulicm111 commented Jun 7, 2020

veonua commented Jun 12, 2020

MoscowskyAnton commented Feb 5, 2021

orangeronald commented Dec 29, 2021

Tensorflow Object Detection API - SSD Continuously Increasing RAM Usage during Training #5296

Tensorflow Object Detection API - SSD Continuously Increasing RAM Usage during Training #5296

Comments

jonbakerfish commented Sep 12, 2018

System information

Describe the problem

Source code / logs

FreestylePocker commented Sep 12, 2018

karmel commented Sep 14, 2018

KapoorHitesh commented Nov 12, 2018

pkulzc commented Nov 12, 2018

MirkoArnold1 commented Nov 20, 2018

pkulzc commented Nov 20, 2018

MirkoArnold1 commented Nov 20, 2018

1byte2bytes commented Nov 23, 2018

pkulzc commented Nov 23, 2018 • edited Loading

1byte2bytes commented Nov 23, 2018

donghyeon commented Nov 29, 2018 • edited Loading

donghyeon commented Dec 13, 2018 • edited Loading

xxllp commented Dec 29, 2018

jewes commented Jan 4, 2019 • edited Loading

renanwille commented Mar 14, 2019

Jasonnor commented Mar 18, 2019

renanwille commented Mar 25, 2019

wiseosho commented Jun 5, 2019

liuchangf commented Jun 12, 2019

Mageswaran1989 commented Aug 30, 2019

amussell commented Oct 4, 2019

Jacqueline-L-Lane commented Feb 22, 2020

moulicm111 commented Jun 7, 2020

veonua commented Jun 12, 2020

MoscowskyAnton commented Feb 5, 2021

orangeronald commented Dec 29, 2021

pkulzc commented Nov 23, 2018 •

edited

Loading

donghyeon commented Nov 29, 2018 •

edited

Loading

donghyeon commented Dec 13, 2018 •

edited

Loading

jewes commented Jan 4, 2019 •

edited

Loading