Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push-based query invalidation #41

Open
matklad opened this Issue Oct 7, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@matklad
Copy link
Contributor

matklad commented Oct 7, 2018

Currently, applying any change advances the global revision and "marks" all queries as potentially invalid (because their verified_at becomes smaller than current revision). The "marking" process is O(1) (we only bump one atomic usize), but subsequent validation can be O(N). Validation is on-demand (pull-based) and cheap because we don't execute queries and just compare hashes, but it still must walk large parts of the query graphs.

Here are two examples where we might want something more efficient.

function body modification

In IDEs, a lot of typing happens inside a single function body. On the first sight, this seems like it could be implemented very efficiently: changing a body can't (in most cases) have any globally effects, so we should be able to recompute types inside function's body very efficiently. However, in the current setup, types inside functions indirectly depend on the all source files (b/c, to infer types, you need to know all of the impls, which might be anywhere), so, after a change to a function, we need to check if all inputs are up to date, and this is O(N) in the number of files in the project.

unrelated crates modification

Suppose we have two independent crates, A and B, and a third crate C, which depends on A and B. Ideally, any modification of A should invalidate queries for C, but all queries for B should remain fresh. In the current setup, because there's a single global revision counter, we will have to re validate B's queries anyway.

It seems like we could achieve better big-O here, if we move validation work from query time (pull) to modification time (push).

In the second example we, for example, can define a per-crate revision counter, which is the sum of dependent-crates revisions + the revision of crate's source. During modification, we'll have to update the revision of all the reverse-dependencies of the current crate, but, during query, we can skip validation altogether, if crate-local revision hasn't change.

In the first example, we can define a "set of item declarations" query, whose results should not be affected by modification of function bodies. As "set of all items" is a union of per-files set of items (more or less, macros make everything complicated), we should be able to update when a single file is modified without checking if all other files are fresh.

@nikomatsakis

This comment has been minimized.

Copy link
Collaborator

nikomatsakis commented Oct 9, 2018

This is one of the key differences between salsa and adaption (cc @matthewhammer). We followed here the Glimmer approach (cc @wycats) — in part because it is much simpler. You only need to track the inputs to any query, and don't have to be able to work the graph the other way. In the context of glimmer, it was also found to be more efficient: you are only doing work you know you have to do.

I think I would like to wait a bit until we see a true problem before pre-emptively making changes here. That might reveal whether indeed any changes are needed.

I would expect that in the context of an IDE, as you type into a fn, the IDE is also prioritized re-evaluation of queries related to that function, so they need to be re-evaluated anyway. It will also be re-evaluating the set of errors in other files, of course, but that can be done at a lower priority and will be continuously getting canceled as you type. Once you reach quiescence, we'll have to revalidate the inputs to those other files, but I expect that will be fast. In particular, they will share many of the same inputs, so the idea of O(n) time to revalidate a single query isn't really correct -- don't forget that once we revalidate the work for a given function, we cache that by updating verified_at, so we won't revalidate it twice. Hence it's really O(n) time to revalidate the entire graph.

But I guess time will out. =)

@matklad

This comment has been minimized.

Copy link
Contributor Author

matklad commented Oct 9, 2018

100% agree that the current approach might turn out to be just fast enough!

Hence it's really O(n) time to revalidate the entire graph.

That's true yeah, but, OTOH, it's one of the main optimization of IDEs that they avoid showing errors in other files. It's interesting that VS Studio with roslyn, has an "whole solution analysis" checkbox, which controls "show errors in all files" behavior. It is on by default (all errors are shown), but docs suggest disabling it for large projects. I wonldn't be surprised if in Rust, due to the strong compilation boundaries between crates, we will be able to just always show all errors, regardless of the project size, as long as it is split into crates of manageable size.

@matthewhammer

This comment has been minimized.

Copy link

matthewhammer commented Oct 10, 2018

I'm interested to see how well other points in the design space of "cache invalidation" play out in practice. The approach taken by salsa-rs (and glimmer) seems like an interesting one, and comparable to, but distinct from, the approach taken by Adapton.

@nikomatsakis characterizes Adapton accurately above: When a cell changes value, the Adapton runtime system eagerly dirties all of the thunks that observed the prior value (actually, it dirties the edge on which each observation is recorded; thunks are "dirty" when they have one or more dirty edges). To maintain its internal invariants about what is dirty and what is clean, Adapton also dirties the observers of thunks with dirty edges, transitively. Because this traversal is eager, however, it can also "short circuit" when it encounters an edge that is already dirty.

The logic for this traversal is currently implemented here:
https://docs.rs/adapton/0.3.30/src/adapton/engine.rs.html#1306

(This OOPSLA 2015 paper discusses how it works formally, including the graph and its invariants: https://arxiv.org/abs/1503.07792)

In practice, this ("naive"?) approach to dirtying can be problematic for several reasons. We address most of these with refinements to the approach (also implemented above):

First, when "long chains" of dependencies arise, it is not practical to dirty and clean these long chains, since the traversals are too expensive. The general strategy here is not to change the Adapton runtime system itself, but instead to refactor the program (and thus, its dependency graph) by introducing a "firewall" (or more typically, a "chain of firewalls"). Each "firewall" consists of a named reference cell, whose (fixed) pointer address is returned from a memoized thunk, rather than the (changing) value contained in this reference cell. Later in a dependency chain, other thunks read the reference cell, and create an "indirect" kind of dependency between the producer thunk and the consumer thunk. A more detailed explanation and example are here: https://docs.rs/adapton/0.3.30/adapton/#nominal-firewalls

Second, sometimes a reference cell will hold a tuple (or some other "structured value"), and the thunks that depend on this result only inspect one projection of this value, and do not depend on the entire (aggregate) value. In this case, dirtying the dependencies naively may dirty too much. To overcome this issue, Adapton permits observations that carry (pure) mapping functions; the dirtying logic mentioned above uses these pure mapping functions to perform comparisons after mapping the observation. Here are some concrete examples of using this feature: https://docs.rs/adapton/0.3.30/adapton/#use-force_map-for-more-precise-dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.