calling trace_attributes continuously is gradually accumulating memory #4667

mkandulavm · 2024-03-31T15:23:00Z

Hi

I am facing a issue where the trace_attributes calls are gradually increasing the RAM used overtime (for the same map region , each call made synchronously).
I have even set the auto_cleanup boolean to true for the actor object.
It seems the memory is not getting released somewhere and is being held by the object. Is this issue known or any similar instance was reported ?

Furthermore, when the calls are made asynchronously (where I use the parent actor's graphreader for the cloned actors), I was getting a segmentation fault after some time if the ENABLE_THREAD_SAFE_TILE_REF_COUNT is OFF.

when ENABLE_THREAD_SAFE_TILE_REF_COUNT is ON and i use more threads the accumulation is much faster.

I am trying to find more information and will revert if possible.

Thank you.

nilsnolde · 2024-04-02T13:22:08Z

might very well be related to #4322. but yeah, we heard this before. kevin suggests it's a tile cache clear omission, but no time to look into that.

mkandulavm · 2024-04-04T14:15:02Z

Hi @nilsnolde

thank you for the info.

I ran my test with debug + valgrind with massif. I am attaching the file for reference.

Here is my observation.
The memory allocation happening at MapMatcher.cc (around line 880) is
const auto& candidates = candidatequery_.Query(measurement.lnglat(), measurement.stop_type(), sq_radius, costing());

Above code's memory is consistently increasing as per the visualisation from the massif file. Possibly some memory control is required at this place ?

Kindly let me know if this is helping so I can continue debugging.

massif.zip

kevinkreiser · 2024-04-04T14:31:02Z

yep that is the place i expected it. there is a secondary cache there that we use to find location candidates for the input coordinates and i suspect we just dont clear it ever. @mkandulavm give me a sec and ill throw up a branch you can test to see if it fixes it

kevinkreiser · 2024-04-04T14:42:21Z

ok it looks like cache clearing is properly implemented both for the command line tool and for the regular worker path see:

valhalla/src/thor/worker.cc

Line 323 in 9338820

matcher_factory.ClearFullCache();

valhalla/src/meili/map_matcher_factory.cc

Line 88 in 9338820

candidatequery_->Clear();

valhalla/valhalla/meili/candidate_search.h

Lines 68 to 72 in 9338820

    
             return grid_cache_.size(); 
        
           } 
        
           void Clear() { 
        
             grid_cache_.clear();

first question if you set cache_size very small for this value in your valhalla.json:

valhalla/scripts/valhalla_build_config

Line 169 in 9338820

'grid': {'size': 500, 'cache_size': 100240},

does it fix the problem for you? perhaps you are using the default and your machine is just not well matched to that?

mkandulavm · 2024-04-04T16:03:26Z

I will try with lower numbers, but should'nt cache_size be in order of bytes ? It seems its being checked with the map size.

kevinkreiser · 2024-04-04T16:11:16Z

yeah it uses a count to control the cache, so yes the units are not in bytes. the number of bytes of ram used is a function of the number of items but its non trivial to count the exact number of bytes this uses. there are many other items that control cache by item count rather than bytes. the code is as intended

mkandulavm · 2024-04-04T16:12:59Z

ok. Just one more clarification, should I leave the grid size as 500 ? will reducing it affect any computation ?

I will run the test and share the massif capture to verify.

kevinkreiser · 2024-04-04T16:14:49Z

reducing grid size will use less ram but it will make map matching slower

mkandulavm · 2024-04-04T20:14:18Z

the problem seems rectified at my end with grid.cache_size = 100
I am attaching the massif file (both tests ran for ~3 hours)
I do understand that the cache_size is meant to be estimate of certain average memory * map size, but the number seems pretty large for the default being currently set.

massif-grid.cache_size_100.zip

kevinkreiser · 2024-04-04T20:16:55Z

perhaps we inadvertently changed the size of the objects being cached there and thus invalidated the default. how much ram does your system have? we typically run this on servers with 32 or 64 gb of ram (in the cloud)

mkandulavm · 2024-04-05T04:33:59Z

I am using a server with 64 GB RAM and have a swap of ~250 GB.

To give a gross estimate, the memory consumption was upto ~15 GB (this is per thread, i think with my 8 threads it was ~100GB). Also once the test was done, this memory was not getting released. Possibly meaning that the element of each map object (and its associated references) in the grid represents a very large memory size.

After making grid.cache_size to be 100 , the consumption of RAM did not exceed ~3-4 GB per thread.

All above running only map matching (trace_attributes) calls only.

Again, I am giving a very high level idea. I can try writing some code to calculate the actual size if it helps, but will take some time.

nilsnolde · 2024-04-05T15:39:05Z

After making grid.cache_size to be 100 , the consumption of RAM did not exceed ~3-4 GB per thread.

jeez, that's still an awful lot and more than most standard machine configs with 2 gb/thread. will that be filled over time or pre-allocated?

mkandulavm · 2024-05-13T10:59:42Z

I apologise as could not make time/effort in trying trying to estimate the memory.
The map matching itself is stable without any excess memory consumption.

Currently, I am experiencing a crash in routing with only this information after ~24 hours

[INFO] algorithm::bidirectional_a*
2024/05/13 08:47:39.904306 [INFO] algorithm::bidirectional_a*
double free or corruption (out)
Aborted (core dumped)

I have reset the grid.cache_size to 100240 to see if this has something to do with routing. Will update again and try closing the issue if its not related.

Thank you.

kevinkreiser · 2024-05-13T13:08:16Z

double free would be a bug not usually an out of memory situation. can you make a debug build and attach the stack trace here?

mkandulavm · 2024-05-13T13:11:04Z

yes. making the build and updating our server. once I get the core dump and get a stack, I will share.

mkandulavm · 2024-05-14T08:11:36Z

after using a debug build, I found the assert is failing at this location in graphtile.h

const NodeInfo* node(const GraphId& node) const {
assert(node.Tile_Base() == header_->graphid().Tile_Base());

Currently we are processing the osm and stripping data not related to routing. This has sped up the time to generate the graph tiles. I am wondering if this is causing the issue, but can't pin point.

Please note that the issue is not easily reproducible.

kevinkreiser · 2024-05-14T13:42:25Z

Currently we are processing the osm and stripping data not related to routing.

yeah that leaves a lot of room for the imagination that could lead to the problem you are seeing. basically the problem is that there is an edge with an end node and the index for that node is not found in your data. this could mean the node isnt there or that the edges data is corrupted. its hard to say..

mkandulavm · 2024-05-14T14:10:33Z

If the edges are corrupted, then the crash should be reproducible for the same request right ?
However, this is not happening.

Its possible the memory corruption happened somewhere else (in our scope of code), leading to routing failure. This is typically the case with seg faults. I will update once I check. Thank you.

mkandulavm · 2024-05-16T03:25:02Z

Hi

I got stack trace after long testing. The system becomes unresponsive and eventually dumps with a crash (was able to repeat for 2 dumps and got similar stacks). The stack is reflecting the double-free. This is ran with graph tiles generated with unaltered geofabrik osm.

0 0x000055bfad3b27c5 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (
this=0x55c044932d50) at /usr/include/c++/10/bits/shared_ptr_base.h:158
1 0x000055bfad3a84a7 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count (
this=0x7f15531c1f98, __in_chrg=) at /usr/include/c++/10/bits/shared_ptr_base.h:733
2 0x000055bfadbbfd76 in std::__shared_ptr<valhalla::baldr::GraphTile const, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0x7f15531c1f90, __in_chrg=)
at /usr/include/c++/10/bits/shared_ptr_base.h:1183
3 0x000055bfadbbfd92 in std::shared_ptr<valhalla::baldr::GraphTile const>::~shared_ptr (
this=0x7f15531c1f90, __in_chrg=) at /usr/include/c++/10/bits/shared_ptr.h:121
4 0x000055bfade82031 in valhalla::loki::Reach::operator() (this=0x7f15531c2538, edge=0x7f1c19e9a408,
edge_id=..., max_reach=50, reader=..., costing=..., direction=3 '\003')
at /home/valhalla/src/loki/reach.cc:74
5 0x000055bfadc58b39 in (anonymous namespace)::bin_handler_t::check_reachability (
this=0x7f15531c24b0, begin=..., end=..., tile=..., edge=0x7f1c19e9a408, edge_id=...)
at /home/valhalla/src/loki/search.cc:492
6 0x000055bfadc59330 in (anonymous namespace)::bin_handler_t::handle_bin (this=0x7f15531c24b0,
begin=..., end=...) at /home/valhalla/src/loki/search.cc:587
7 0x000055bfadc59d45 in (anonymous namespace)::bin_handler_t::search (this=0x7f15531c24b0)
at /home/valhalla/src/loki/search.cc:698
8 0x000055bfadc5aa36 in valhalla::loki::Search (locations=..., reader=..., costing=...)
at /home/valhalla/src/loki/search.cc:825
9 0x000055bfadc54e0e in valhalla::loki::loki_worker_t::route (this=0x7f1524003420, request=...)
at /home/valhalla/src/loki/route_action.cc:105
10 0x000055bfada6831c in valhalla::tyr::actor_t::route(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void ()> const*, valhalla::Api*) (
this=0x7f15240053a0, request_str=..., interrupt=0x7f15531c3060, api=0x7f15531c2d70)
at /home/valhalla/src/tyr/actor.cc:103

mkandulavm · 2024-05-16T10:08:51Z

UPDATE : I think the source of the problem is because of me setting auto_cleanup to true as a default when the actor_t object is created. To use the routing multi-threaded, I am creating multiple actors with a parent actor's graph reader passed into it. But for all these actors I am setting auto_cleanup as true.
As I reverted this back to false, there is no crash (testing is going on...), but the behaviour of the system is stable now. Will confirm by tomorrow.

kevinkreiser · 2024-05-16T13:34:09Z

if you want threads to share a graphreader you need to use the synchronized tile cache options for graphreader. what is happening is that when you call cleanup on one of the actors, it will remove stuff from the graphreader cache. if another thread is currently using that tile but the memory is destroyed then all kinds of crazy stuff could occur.

mkandulavm · 2024-05-16T14:20:45Z

ok. will check with this. It makes sense now, but it was not trivial to know. Maybe add a comment or assert ?

mkandulavm · 2024-05-22T09:20:14Z

if you want threads to share a graphreader you need to use the synchronized tile cache options for graphreader. what is happening is that when you call cleanup on one of the actors, it will remove stuff from the graphreader cache. if another thread is currently using that tile but the memory is destroyed then all kinds of crazy stuff could occur.

This was helpful. The crash issue is fixed.

I will close this thread as I am not having any issues related to this and the trace_attributes call.

mkandulavm changed the title ~~calling trace_attributes continously is gradually accumulating memory~~ calling trace_attributes continuously is gradually accumulating memory Mar 31, 2024

mkandulavm closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calling trace_attributes continuously is gradually accumulating memory #4667

calling trace_attributes continuously is gradually accumulating memory #4667

mkandulavm commented Mar 31, 2024 •

edited

Loading

nilsnolde commented Apr 2, 2024

mkandulavm commented Apr 4, 2024 •

edited

Loading

kevinkreiser commented Apr 4, 2024

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 4, 2024

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 4, 2024 •

edited

Loading

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 4, 2024

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 5, 2024 •

edited

Loading

nilsnolde commented Apr 5, 2024

mkandulavm commented May 13, 2024

kevinkreiser commented May 13, 2024

mkandulavm commented May 13, 2024

mkandulavm commented May 14, 2024

kevinkreiser commented May 14, 2024

mkandulavm commented May 14, 2024

mkandulavm commented May 16, 2024 •

edited

Loading

mkandulavm commented May 16, 2024

kevinkreiser commented May 16, 2024

mkandulavm commented May 16, 2024

mkandulavm commented May 22, 2024

calling trace_attributes continuously is gradually accumulating memory #4667

calling trace_attributes continuously is gradually accumulating memory #4667

Comments

mkandulavm commented Mar 31, 2024 • edited Loading

nilsnolde commented Apr 2, 2024

mkandulavm commented Apr 4, 2024 • edited Loading

kevinkreiser commented Apr 4, 2024

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 4, 2024

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 4, 2024 • edited Loading

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 4, 2024

kevinkreiser commented Apr 4, 2024

mkandulavm commented Apr 5, 2024 • edited Loading

nilsnolde commented Apr 5, 2024

mkandulavm commented May 13, 2024

kevinkreiser commented May 13, 2024

mkandulavm commented May 13, 2024

mkandulavm commented May 14, 2024

kevinkreiser commented May 14, 2024

mkandulavm commented May 14, 2024

mkandulavm commented May 16, 2024 • edited Loading

mkandulavm commented May 16, 2024

kevinkreiser commented May 16, 2024

mkandulavm commented May 16, 2024

mkandulavm commented May 22, 2024

mkandulavm commented Mar 31, 2024 •

edited

Loading

mkandulavm commented Apr 4, 2024 •

edited

Loading

mkandulavm commented Apr 4, 2024 •

edited

Loading

mkandulavm commented Apr 5, 2024 •

edited

Loading

mkandulavm commented May 16, 2024 •

edited

Loading