-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calling trace_attributes continuously is gradually accumulating memory #4667
Comments
might very well be related to #4322. but yeah, we heard this before. kevin suggests it's a tile cache clear omission, but no time to look into that. |
Hi @nilsnolde thank you for the info. I ran my test with debug + valgrind with massif. I am attaching the file for reference. Here is my observation. Above code's memory is consistently increasing as per the visualisation from the massif file. Possibly some memory control is required at this place ? Kindly let me know if this is helping so I can continue debugging. |
yep that is the place i expected it. there is a secondary cache there that we use to find location candidates for the input coordinates and i suspect we just dont clear it ever. @mkandulavm give me a sec and ill throw up a branch you can test to see if it fixes it |
ok it looks like cache clearing is properly implemented both for the command line tool and for the regular worker path see: Line 323 in 9338820
valhalla/src/meili/map_matcher_factory.cc Line 88 in 9338820
valhalla/valhalla/meili/candidate_search.h Lines 68 to 72 in 9338820
first question if you set valhalla/scripts/valhalla_build_config Line 169 in 9338820
does it fix the problem for you? perhaps you are using the default and your machine is just not well matched to that? |
I will try with lower numbers, but should'nt cache_size be in order of bytes ? It seems its being checked with the map size. |
yeah it uses a count to control the cache, so yes the units are not in bytes. the number of bytes of ram used is a function of the number of items but its non trivial to count the exact number of bytes this uses. there are many other items that control cache by item count rather than bytes. the code is as intended |
ok. Just one more clarification, should I leave the grid size as 500 ? will reducing it affect any computation ? I will run the test and share the massif capture to verify. |
reducing grid size will use less ram but it will make map matching slower |
the problem seems rectified at my end with grid.cache_size = 100 |
perhaps we inadvertently changed the size of the objects being cached there and thus invalidated the default. how much ram does your system have? we typically run this on servers with 32 or 64 gb of ram (in the cloud) |
I am using a server with 64 GB RAM and have a swap of ~250 GB. To give a gross estimate, the memory consumption was upto ~15 GB (this is per thread, i think with my 8 threads it was ~100GB). Also once the test was done, this memory was not getting released. Possibly meaning that the element of each map object (and its associated references) in the grid represents a very large memory size. After making grid.cache_size to be 100 , the consumption of RAM did not exceed ~3-4 GB per thread. All above running only map matching (trace_attributes) calls only. Again, I am giving a very high level idea. I can try writing some code to calculate the actual size if it helps, but will take some time. |
jeez, that's still an awful lot and more than most standard machine configs with 2 gb/thread. will that be filled over time or pre-allocated? |
I apologise as could not make time/effort in trying trying to estimate the memory. Currently, I am experiencing a crash in routing with only this information after ~24 hours [INFO] algorithm::bidirectional_a* I have reset the grid.cache_size to 100240 to see if this has something to do with routing. Will update again and try closing the issue if its not related. Thank you. |
double free would be a bug not usually an out of memory situation. can you make a debug build and attach the stack trace here? |
yes. making the build and updating our server. once I get the core dump and get a stack, I will share. |
after using a debug build, I found the assert is failing at this location in graphtile.h const NodeInfo* node(const GraphId& node) const { Currently we are processing the osm and stripping data not related to routing. This has sped up the time to generate the graph tiles. I am wondering if this is causing the issue, but can't pin point. Please note that the issue is not easily reproducible. |
yeah that leaves a lot of room for the imagination that could lead to the problem you are seeing. basically the problem is that there is an edge with an end node and the index for that node is not found in your data. this could mean the node isnt there or that the edges data is corrupted. its hard to say.. |
If the edges are corrupted, then the crash should be reproducible for the same request right ? Its possible the memory corruption happened somewhere else (in our scope of code), leading to routing failure. This is typically the case with seg faults. I will update once I check. Thank you. |
Hi I got stack trace after long testing. The system becomes unresponsive and eventually dumps with a crash (was able to repeat for 2 dumps and got similar stacks). The stack is reflecting the double-free. This is ran with graph tiles generated with unaltered geofabrik osm. 0 0x000055bfad3b27c5 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release ( |
UPDATE : I think the source of the problem is because of me setting auto_cleanup to true as a default when the actor_t object is created. To use the routing multi-threaded, I am creating multiple actors with a parent actor's graph reader passed into it. But for all these actors I am setting auto_cleanup as true. |
if you want threads to share a graphreader you need to use the synchronized tile cache options for graphreader. what is happening is that when you call cleanup on one of the actors, it will remove stuff from the graphreader cache. if another thread is currently using that tile but the memory is destroyed then all kinds of crazy stuff could occur. |
ok. will check with this. It makes sense now, but it was not trivial to know. Maybe add a comment or assert ? |
This was helpful. The crash issue is fixed. I will close this thread as I am not having any issues related to this and the trace_attributes call. |
Hi
I am facing a issue where the trace_attributes calls are gradually increasing the RAM used overtime (for the same map region , each call made synchronously).
I have even set the auto_cleanup boolean to true for the actor object.
It seems the memory is not getting released somewhere and is being held by the object. Is this issue known or any similar instance was reported ?
Furthermore, when the calls are made asynchronously (where I use the parent actor's graphreader for the cloned actors), I was getting a segmentation fault after some time if the ENABLE_THREAD_SAFE_TILE_REF_COUNT is OFF.
when ENABLE_THREAD_SAFE_TILE_REF_COUNT is ON and i use more threads the accumulation is much faster.
I am trying to find more information and will revert if possible.
Thank you.
The text was updated successfully, but these errors were encountered: