Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAET index #351

Open
darkleaf opened this issue Jun 2, 2020 · 10 comments
Open

VAET index #351

darkleaf opened this issue Jun 2, 2020 · 10 comments

Comments

@darkleaf
Copy link
Contributor

darkleaf commented Jun 2, 2020

Why the VAET index wasn't implemented?
Will be it useful to implement it?

@tonsky
Copy link
Owner

tonsky commented Jun 2, 2020

For all use-cases AVET covers what VAET does. I’m not sure what’s its purpose in Datomic is. If you have a use-case, I might consider adding it. Otherwise, it’ll just be a performance penalty

@refset
Copy link
Contributor

refset commented Jun 28, 2020

VAET is only necessary when you don't know which reverse reference-type attributes might be applicable for a given entity. For a small DataScript-sized system this is probably never an issue. At a larger scale, with potentially millions of attributes, you need VAET in order to do explorative navigation of the graph efficiently (when you can't be sure which reverse attributes might be relevant beforehand). Without VAET you have to scan through all the possible combinations in AVET.

@tonsky
Copy link
Owner

tonsky commented Jun 28, 2020

This only arises when deleting entities, right?

@refset
Copy link
Contributor

refset commented Jun 28, 2020

I'm saying that if you have many different :db.type/ref attributes then you can't query [?e _ e123] without naively scanning through all combinations in AVET.

I don't know exactly how VAET would be wired up internally, but I would expect to see something else more advanced than the EAVT filter here:

(filter (fn [^Datom d] (and (= v (.-v d))

Again, I doubt most users would be storing enough attributes or overall data in DataScript to warrant changing anything. Most of the time people know exactly which attributes they need in their queries.

For the curious, this is where near enough the same kind of scanning occurs for deleting entities:

v-datoms (vec (mapcat (fn [a] (-search db [nil a e])) (-attrs-by db :db.type/ref)))]

@wbrown
Copy link

wbrown commented Jun 29, 2020 via email

@tonsky
Copy link
Owner

tonsky commented Jun 29, 2020

[?e _ e123]

That would require VAET, of course, but I can’t imagine an use-case where anybody would need that type of query. The only use-case I come up with is deletion, and it’s a tradeoff — faster deletions or faster insertions (one less index to fill).

@refset
Copy link
Contributor

refset commented Jun 29, 2020

Yeah, it's certainly a tradeoff. I suspect that without VAET the deletion time cost might be too unpredictable at a large-enough scale to maintain consistent throughput, whereas the space (+ time) costs of inserting into VAET are amortised which makes it a much safer default for a highly-distributed production system like Datomic.

I think the non-deletion use-cases can be described as "exploratory" queries, where your data set is large (many attributes) and not well understood (can't predict which attributes might be relevant). I imagine Roam's knowledge graph "backlinks" feature might be a good use-case for benefiting from a native VAET (backlink!) index, if such a graph ever grew large enough (i.e. multi-user).

Incidentally, I just came across these discussions that mention Datomic's ability to do wildcard reverse lookups with the Entity API via (keys (.cache (.touch ent))):

Providing a similar index to VAET is something that the Crux team has been contemplating recently, motivated by our work on a generic entity navigation UI. Crux avoids the deletion issue by choosing not to enforce referential integrity, i.e. there are no explicit reference-type attributes, only values and IDs that might happen to correspond.

I'll leave it at that for my contributions to this thread but hopefully it's useful context for some. I remember thinking about this very question several years ago, so it's good to have finally written it down :)

@bluesun
Copy link

bluesun commented Jul 7, 2020

Here is a use case that produces potentially large numbers of attributes and in which a query of the form of [?es _ e123] is relevant:

One way to import large vectors (for example an ordered sequence of events) is to use “numbered” attributes. For example [event-source-id :345 event-id] where event-source is the entity representing the source of the ordered sequence of events and event is the event at index 345 in the vector.

In this case, [?event-source _ specific-event-id] is a legitimate query clause to find the source metadata relating to a specific event.

This use case uses potentially a large number of attributes. And this data design is the one of the few ones having similar performance characteristic to a vector in Clojure or array in other languages.

This use case was important enough to be integrated in the RDF schema standard:
https://www.w3.org/TR/rdf-schema/#ch_containermembershipproperty

@bluesun
Copy link

bluesun commented Aug 11, 2020

Here is a use case that produces potentially large numbers of attributes and in which a query of the form of [?es _ e123] is relevant:

One way to import large vectors (for example an ordered sequence of events) is to use “numbered” attributes. For example [event-source-id :345 event-id] where event-source is the entity representing the source of the ordered sequence of events and event is the event at index 345 in the vector.

In this case, [?event-source _ specific-event-id] is a legitimate query clause to find the source metadata relating to a specific event.

This use case uses potentially a large number of attributes. And this data design is the one of the few ones having similar performance characteristic to a vector in Clojure or array in other languages.

This use case was important enough to be integrated in the RDF schema standard:
https://www.w3.org/TR/rdf-schema/#ch_containermembershipproperty

Hi @tonsky , I am curious about your feedback on this point. I can easily extend the argumentation if you are interested...

@tonsky
Copy link
Owner

tonsky commented Aug 11, 2020

Thank you @bluesun, your argument is clear. I do not like the approach, it feels like misuse of the attribute field, it produces an unbounded amount of keywords, requires string concatenation/parsing if you want two ordered attributes. And, it works poorly with existing indexes structure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants