-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow Object Detection API - SSD Continuously Increasing RAM Usage during Training #5296
Comments
hi, i have similar problem with this config and using model_main.py |
Any update on this? memory leak issue from 2 months now. |
Could you sync to HEAD and try again? I think one of our earlier PR should have fixed this. |
@MirkoArnold1 Did you sync to latest? |
@pkulzc Yes, I did, and I rebuilt the packages for cloud ml engine. |
I'm having the same issue, I'm using CPU instead of GPU though. Within a few steps it has completely eaten all 12GB of my RAM and my pagefile has ballooned. |
The chart @MirkoArnold1 showed did indicate a memory leak. I'll look into that. But using up 12GB RAM seems normal, @1byte2bytes did you try reducing batch size ? |
It used all 12GB of my system RAM, plus an additional 24GB of my page file (max size) pretty quickly. I could try a reduced batch size, or try it on a machine with more RAM I suppose. |
I got a same problem during training faster r-cnn on the coco dataset. Initially the object detection API allocates memory about a few gigabytes(7-8GB). It grows up gradually, and suddenly consumes all of the memory to be killed itself. At the moment the API was allocating the memory over than 70GB (I did 4 parallel experiments using 4 GPUs). My system has 256GB memory, so I think those amount is quite large enough compared to the general requirements of memory. I didn't check the elapsed time precisely, but I guess it takes 10-20+ hours to reach the moment. |
I don't know exactly which factor makes memory allocation continuously increase, but as my experiences, the evaluation steps of the object detection API are strongly related to this problem. I'm training a faster r-cnn model (this means that ssd and frcnn architectures have the same problem) on coco dataset where validation set have over 40k examples. I found an obvious difference in amount of memory increase when I change the flag option "sample_1_of_n_eval_examples" value. Lower "sample_1_of_n_eval_examples" results in much faster memory usage increase. This problem happens not only "tf.estimator.train_and_evaluate" method but also "estimator.evaluate" method. Please check the evaluation steps if there's anyone willing to fix this problem. For example, in my experiments, 60GB of memory was allocated after running the evaluation on coco dataset about 16 hours when I set "sample_1_of_n_eval_examples=1". When I change it to "sample_1_of_n_eval_examples=20", memory usage is reduced to 25GB at the same time. |
so bad ~~ |
same issue here. The issue does not happen in an old version which was downloaed on May 2018. |
Using TF v1.12 and it seems to me that the issue continues. May be related to TF issue 24047 |
Same issue here using TF 1.12, seems that I can only write a watchdog script to monitor the memory and restart it. 😞 |
Tip: Only for documentation, one can use |
Same issue continues using TF 1.13, I guess the tf_func operation could be the cause. |
You can try steps as followings:
original set is : the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consume a lot of RAM in my opinion .
|
I see similar increase in memory with dataset and custom estimator usage. |
Can you post the config you are using? |
Has anyone got around this running out of memory error on a raspberry pi 4? I tried making the modifications above, but the process is still killed due to running out of memory. This is the error I get: Out of memory: Kill process 945 (train.py) score 569 or sacrifice child |
This memory increasing issue exists in Tensorflow 2 also |
I believe this bug tensorflow/tensorflow#33516 is realted in dataset_builder.py I've changed and memory leak seems to be fixed |
Hello! |
System information
Describe the problem
I use the object detection API to train different models (ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync, ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync, faster_rcnn_resnet50_coco) on my own dataset. The faster rcnn can run without problem, but the ssd models will continuously increasing RAM usage during training and finally got OOM. There is a similar problem on SO.
Source code / logs
NA
The text was updated successfully, but these errors were encountered: