Reachability Analysis #669

dgrove-oss · 2021-10-01T15:07:21Z

dgrove-oss
Oct 1, 2021
Maintainer

Problem Statement

Starting from a main class and perhaps an optional list of additional entry points (for example methods invoked reflectively), we need to determine the reachable program entities (types, methods, fields, etc.) that should be included in the final native executable.

General Approach

The general approach to solving this problem is to optimistically assume nothing is reachable except the entrypoints and then iteratively add new entities as we discover they are reachable. Eventually this process reaches a fix-point (nothing else is added). This worklist-driven approach to discovering the reachable program is builtin to qbicc's top-level compilation driver. A key piece of this reachability analysis is processing reachable dynamic method invocations (invokeinterface, invokevirtual , invokedynamic), which may invoke multiple potential callee methods and where the possible callees is also dependent on the runtime values that reach the call site, and deciding what new methods the processed callsite makes reachable. There have been hundreds of research papers on this general topic (control flow analysis, call graph construction, class analysis, pointer analysis, etc).

Suitable Call Graph Discovery Algorithms for Qbicc

Although there may be scenarios in which the user is willing to spend significant compile time producing a highly optimized native executable, we expect that the majority of compilations will need to use relatively inexpensive call graph construction algorithms. In particular, we are looking for algorithms that have time and space complexities that are roughly linear as a function of program size (and since all Java programs include some part of the JDK, even trivial programs are fairly large).

I've attached a copy of "Fast Interprocedural Class Analysis" from POPL'98. FastInterproceduralClassAnalysis-POPL98.pdf We tried to systematically look at some of the design possibilities for this part of the algorithm space.

Our initial implementation in qbicc's reachability plugin is Rapid Type Analysis. Linear time and space complexity. Very simple to implement. Imprecise. In effect, RTA maintains a single set of reachable types that it uses to compute the method receiver at every callsite.
I think that "Bounded Linear-Edge OO 0-CFA" is implementable within the reachability plugin without needing to change the overall qbicc compiler structure. The algorithm is more complex, but still near linear time. The precision gain vs. RTA is that this algorithm starts with a distinct set of reachable types for each field/method in the program, tracks the flow of classes between them by adding edges as it analyses the program, and then propagates reachability along those edges. Near linear time is maintained by bounding the number of times information can be propagated along an edge before the edge is collapsed.

dmlloyd · 2021-10-04T13:59:52Z

dmlloyd
Oct 4, 2021
Maintainer

I'll contribute a little practical information about how the compiler implementation plays into any algorithm we might select.

Compilation presently uses the worklist to pass program elements between four phases: ADD, ANALYZE, LOWER, and GENERATE. The first phase, ADD, is the only phase in which it is permitted (generally speaking) to add new elements to the overall program; this is also the phase where program elements may be interpreted for initialization purposes. The remaining phases remove reachable elements by copying the program elements repeatedly, passing them through a few different simplifying optimizations which (among other things) remove reachable code paths. Therefore there are multiple opportunities to reduce the size of the reachable program.

Gathering information can happen in three ways:

Locally (implementing BBB and gathering information or optimizing based only on what information is immediately available)
- Note that gathering information here is pessimistic as code paths that are built might not actually be reachable when the element is constructed due to optimizations
Whole-element (using a visitor or other means to analyze each element after it is completely created)
Whole-program (using a post-hook, or pre-hook of the next phase, to examine the entire program between phases)

During ADD, no previous information exists, therefore each approach to information gathering may only rely on what was gathered by preceding approaches. However, local optimizations in phases after ADD may necessarily rely on whole-element or whole-program information. Generally speaking this information is not available unless each particular plugin take measures to gather that information by way of coordinating across phases. And, it is done on an ad-hoc basis with different modules "owning" different bits of information, which occasionally results in trouble in the form of circular dependencies between modules.

It would be beneficial (not only to reachability analysis but also to escape analysis, dead code elimination, inlining, memory model optimizations, etc.) to provide more general facilities for tracking at least the following information:

Which fields were read and/or written in the previous phase, with a link to the node(s)
Which local variables were read and/or written in the previous phase
Information presently contained in/consumed by RTA such as:
- The set of methods called (and the type of invocation) in the previous phase, with a link to the call site(s)
- The set of reached new invocations and heap objects found in the previous phase, with links to the node(s)

This could still be done via plugin but it'd be just one plugin that has no dependencies and simply tracks the raw information, so that other plugins could depend on that one including the call analysis plugin(s).

The other suggestion I wanted to make is that we can run almost any given analysis after more than one phase, and thereby get a better result. Any analysis that may result in a reduction of the program size is not only beneficial for that reason but also for the practical reason of speeding up the LLVM stage of compilation.

Finally I wanted to mention that there's a cleaner version of "Fast Interprocedural Class Analysis" here (it's already on the big reading list).

2 replies

dgrove-oss Mar 8, 2022
Maintainer Author

I've been thinking again about trying to add a second implementation of Reachability Analysis. I'm not sure what is the right implementation strategy for fitting their typical algorithmic control flow structure into the overall qbicc compilation structure.

Reachability doesn't fit into the nice pattern that Escape Analysis does where we have a BasicBlockBuilder that constructs a connection graph during a phase followed by a postHook that does the propagation along that graph after the phase. This is because in Reachability Analysis, the global propagation is how we discover the reachable methods that we need to process in the current phase and thus the global propagation can't wait for a post hook.

The overall algorithmic structure is an alternation between analyzing a newly reachable method to add new edges/nodes to a global graph structure that represents the entire program (good fit for a BasicBlockBuilder) and propagating information along the nodes/edges of that graph until they reach a fixed point (not really related to the primary task of the BBB). The current RTA implementation effectively "hijacks" the thread that is executing the BasicBlockBuilder to do the very limited amount of global propagation that the RTA algorithm requires. This works because the global propagation done by RTA is trivial. In more complex algorithms, the amount of propagation work done for a visited Node is unpredictable and may be quite significant.

I can see three options and I wonder which of these is best (and if there is a fourth I am not seeing):

Keep the current overall control structure of ReachabilityAnalysis and just accept that we may end up doing a lot of propagation work as a result of processing a single Node in a BasicBlockBuilder. It is very likely that much of this propagation happens "not a the end" of a Phase, so we don't run much of a risk of work imbalance adding to compilation time.
Have a work queue of tasks to do during a phase which will handle the propagation work of ReachabilityAnalysis (and potentially of other similar global analyses we need to run during a phase as we later add them...). Coordinating this with the primary work queue to ensure that a Phase isn't considered complete until both queues are empty would take a little new logic in Driver.
Extend the main work queue to enable it to contain different kinds of work items (ie, instead of enqueuing an ExecutableElement and invoking a lambda on it, enqueue a lambda that encapsulates the work item).

I'm leaning to option 1 as the most contained, but can also see 3 or 2 as being attractive and not that much more work if we think we have other scenarios where we may want to have a more general work queue of tasks to do within a Phase.

dmlloyd Mar 11, 2022
Maintainer

I've been thinking about this a lot since you posted it. I have some (potentially disconnected) thoughts.

In general terms, a concurrent work queue could cause problems in two different ways. The first way, which you mention, is when a queue task is outsized and causes task starvation (most likely due, as you say, to being at the end of the phase, but also potentially due to long-running computations which yield work items only at the end). The other (opposite) potential problem would be some throughput loss due to excessive context switching caused by too-small tasks; for example when a modestly-sized task (say, performing a small computation on each item of a fairly long list) is broken into very small individual work list items where the context switch may be more expensive than the actual task.

Given that compile performance is important (especially to the development cycle use cases), I think it is important to ensure that the selected solution is reasonable in that regard and that we can balance the number of tasks with their size.

Okay, next thought. :)

Since we're already essentially doing option 1, the question is really about entertaining options 2 and 3. I will be honest and say that I don't much like the idea of multiple work queues. I think this could add quite a lot of specialized complexity and create new opportunities for bugs to crop up - particularly concurrency-related bugs, my least-favorite type. So I'm a bit "cool" on option 2. Option 3 is quite interesting to me though.

I've been picking away at the "facts" API in spare moments for quite a while now, so I am really looking at things through that lens lately. It seems to me that the current work queue really is itself a simplified expression of reachability. When we determine that an executable member is reachable, that fact is recorded by adding it to the queued set and, if the item was not previously enqueued, this produces the action of processing it via the element handler chain by way of the work queue processor.

It seems to me that this fact - I guess I'll just call it "direct reachability" - is really just one of many facts that we consider as a part of reachability analysis and really effectively duplicates the logical (if not physical) state that we're already recording there. Further, it seems to me that we also care about reachability of other things - at a minimum, field access and literal objects - at many points during compilation. So an "option 3" seems like a natural companion to the facts API.

I imagine that, in the presence of a working facts API, the element queue could be trivially replaced like this:

Introduce the general work queue, e.g. ctxt.enqueueTask(someRunnable);; the mechanics would be the same as the current element queue in that once the queue is fully drained, all work is considered completed
Operations which currently call ctxt.enqueue(element) would instead call e.g. facts.discover(element, DIRECTLY_REACHABLE)
The current action of actually adding the work item to the queue would be replaced by a registered action, something like this one-time code (probably registered from Driver): facts.registerAction(element, DIRECTLY_REACHABLE, e -> ctxt.enqueueTask(() -> elementHandlers.handle(e)))
The element-based work queue is then eliminated completely

Then the door is open to using the work queue for traversing object literals, fields, etc. in the same way. I expect that analysis code would not even need any direct awareness of the work queue, other than registering actions to be taken when facts are first discovered (though the actions might in turn enqueue work items in some cases). This would eliminate the ad-hoc tests we have now, where there are multiple different combinations of states which can trigger a particular action and thus have to be tested at all the sites. Instead, it's just facts.discover(typeDef, IS_INSTANTIATED) or something like that - the action is encapsulated and completely hidden from the discovery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reachability Analysis #669

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Reachability Analysis #669

dgrove-oss Oct 1, 2021 Maintainer

Problem Statement

General Approach

Suitable Call Graph Discovery Algorithms for Qbicc

Replies: 1 comment · 2 replies

dmlloyd Oct 4, 2021 Maintainer

dgrove-oss Mar 8, 2022 Maintainer Author

dmlloyd Mar 11, 2022 Maintainer

dgrove-oss
Oct 1, 2021
Maintainer

Replies: 1 comment 2 replies

dmlloyd
Oct 4, 2021
Maintainer

dgrove-oss Mar 8, 2022
Maintainer Author

dmlloyd Mar 11, 2022
Maintainer