Skip to content
Gijs Molenaar edited this page Feb 13, 2014 · 4 revisions

Caching Issues

Michiel's "real-life" 3C343 trees have grown tall enough to start pushing against the RAM ceiling, at least in the extreme case of the entire MS being solved for at once. A major reason for this is the node result cache: up to now, all nodes have been caching and holding their results for eternity, which is great for tree debugging, but rather heavy on memory use. In order to relieve the pressure, I have implemented a more efficient caching strategy.

Caching can be considerably optimized at no cost to performance, by taking into account two simple considerations (note the locality!):

  • Rule 1: If the next request is bound to need recalculating (as is the case, e.g., for all nodes that depend on a solvable parameter, when the next request is another solve iteration), holding cache is useless beyond the point at which all parents have collected the result. Rule 2: If a parent node holds cache, child caches are not needed. Following these two simple rules (and incorporating a next_request_id field into the Request object), the amount of result caching can be lowered rather dramatically in most cases. Note that for some nodes you may want to override this behaviour, e.g.:

  • Root data collector nodes need to cache always, otherwise there will be nothing for the browser to visualize.

  • During tree debugging, it may be very useful to have selected nodes hold cache, so that their results may be examined from the browser.

  • You may want to explicitly disable caching at some nodes, to reduce RAM use even further.

Cache Policy

Cache policy is now selectable via forest and node state. The cache_policy field of forest state defines the default forest-wide caching policy. This is a single integer, higher values generally meaning more caching, as follows:

  • -10 ("never"): no caching at all, result is recalculated for every request and every parent. -1 ("minimal"): cache is held temporarily until all parents have collected the result. 0 or 1 ("smart"): smart caching, the two rules above are followed to hold cache only where needed. This is the default setting. 20 ("always"): cache always held.
# glish example
mqs.meq('Set.Forest.State',[state=[cache_policy=20]]);

The caching policy may be overridden on a node-by-node basis, by setting the cache_policy field of node state (or the node init-record). A value of 0 (the default) causes the node to follow the forest policy, while one of the values above overrides the forest policy. Note that "smart-caching" nodes will automatically adjust to parents or children whose caching policy has been overridden, and will continue to cache in an optimal way.

NB: the default caching policy of the Spigot node is "never", since it effectively has its own internal cache of tiles.

The "minimal" and "smart" caching policies have one weakness that is triggered by the use of ReqSeq nodes: they assume that all parents will need to collect a result, and retain cache accordingly. A sequencer node may then induce "overcaching syndrome" in some of its children. The current band-aid solution to this (until I think of something better) is to set cache_num_active_parents in node state to the number of "actually active" parents. This is a relatively minor issue affecting only a few nodes (e.g., two per baseline, in the solve343.g case discussed below), since the worst that can happen is slight over- or under-caching.

Visualizing & Profiling

If you enable the tree debugger in the browser, cache will be represented by blue icons in the "(R)" column of the tree view (press the blue "(i)" button in the toolbar to get a quick reference of what each icon means). Open a few branches all the way down to get a feel of what's being cached and where.

A sub-record of node state called cache_stats collects caching statistics. The fields all_requests and new_requests contain two vectors of counters (the difference between "all requests" and "new requests" is usually due to multiple parents). The elements of each vector have the following meaning:

  • 0: number of requests; 1: cache hits (i.e. request served from cache); 2: cache misses (i.e. wrong result in cache -- if smart caching is in effect and this is not 0, then something is wrong!); 3: cache was empty; 4: numer of times the result of a request was cached; 5: number of times the result was long-cached (i.e. beyond last parent). The cache_stats.parents vector shows the current cache situation w.r.t. to the node's parents. The numbers are: total known parents, number of active parents, how many parents have signed off on our cache, and how many of those asked us to retain it. This is used by smart caching to figure out how long to retain cache.

Last but certainly not least, the profiling_stats sub-record collects run-time profiling information. The following info is collected: total for time spent inside this node and its children, children for time spent in children only, and getresult for time spent actually executing the node's getResult() function (the latter two almost, but not quite, add up to the total time -- the remainder is the node housekeeping overhead). For each measurement, three values are reported: cumulative time, number of executions, and average time. Times are given in milliseconds.

Some tests with solve343.g

A short report ona real-life test. This is derived from Michiel's older scalar-ME script. Note that memory usage always tops out after one iteration, once the entire predict tree has been evaluated. Here are some memory footprints, as shown by the "ps u" shell command (VSZ and RSS values, in Kb). As usual, this is the "full Monty" case of having one snippet equal to the entire MS (~1500 timeslots); using smaller tiling in time will lower the footprint proportionally.

Baseline footprint: 1081540 985712. This is when all spigots/sinks are loaded with data, but nothing has been executed yet. No easy way to lower this, since this is driven primarily by MS size.

Now, solving for two scalar fluxes. After one iteration we have: Worst-case 2678300 2582432 : This is with everything cached (forest cache_policy=20).

"Naive" smart caching 2124508 2021960 : (forest cache_policy=1)

A somewhat sub-optimal thing remains here: the presence of the ReqSeq causes the xx.s1.s2 and predict.s1.s2 nodes to "overcache" (see above). This is easily eliminated by specifying cache_num_active_parents=1 to each. We then arrive at a... "Smarter" footprint 1771812 1676372 : Cache now retained only at phasedft.src.s1.s2, observed_i.s1.s2, and n.src nodes.

Another easy optimization is available. If you look at the profiling stats of the phasedft.src.s1.s2 nodes, their actual re-execution time is relatively small. Setting their cache_policy to "minimal" causes a slight performance hit, but moves the effective cache up to their children, which are per-station rather than per-baseline. This is worth doing if you're approaching the memory limit. "Optimal" footprint: 1608292 1513140 : note that at this point there's somewhat wasteful caching at the spigot nodes, which could be eliminated with a slight rework of MeqSpigot. This would free up another ~200Mb.

Now, how does this scales with the number of solvables? Solving for two 2x3 fluxes (12 solvables), all timeslots, was previously impossible as the thing ran out of memory and crashed. Now it works; after one iteration we have an "optimal" footprint of 2762340 2665928. For comparison, the "smarter" footprint is 2927800 2829128.

Clone this wiki locally