Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuple of n-items #98

Closed
amirouche opened this issue Jun 16, 2019 · 12 comments
Closed

Tuple of n-items #98

amirouche opened this issue Jun 16, 2019 · 12 comments

Comments

@amirouche
Copy link

@amirouche amirouche commented Jun 16, 2019

Why?

Like explained in Frey:2017, the named graph (ngraph) allows to represent metadata in a more performant way. I propose to generalize that scheme to support any number of metadata item that must add information to every tuple e.g. like provenance, licence or history significance.

Previous work

  • SRFI-168 aka. nstore a proposal to support nstore in scheme programming language implementations.
  • hoply a python implementation of nstore
  • datae work-in-progress versioned nstore that rely on 7-tuples (graph subject predicate object alive change-id uid).

Proposed solution

In the case of quad store, instead of

SELECT DISTINCT ?person
WHERE {
    ?person foaf:name ?name .
    GRAPH ?g1 { ?person a foaf:Person }
    GRAPH ?g2 { ?person a foaf:Person }
    GRAPH ?g3 { ?person a foaf:Person }
    FILTER(?g1 != ?g2 && ?g1 != ?g3 && ?g2 != ?g3) .
}     

The proposition is to also support the following more general syntax:

SELECT DISTINCT ?person
WHERE {
    default-graph ?person foaf:name ?name .
    ?g1 ?person a foaf:Person .
    ?g2 ?person a foaf:Person .
    ?g3 ?person a foaf:Person .
    FILTER(?g1 != ?g2 && ?g1 != ?g3 && ?g2 != ?g3) .
}        

Considerations for backward compatibility

Backward compatible.

@namedgraph
Copy link

@namedgraph namedgraph commented Jun 16, 2019

How is this backward-compatible or even RDF-compatible?..

@amirouche
Copy link
Author

@amirouche amirouche commented Jun 16, 2019

The database must be setup with fixed n. For n=3 the proposed syntax is the same as before. For n=4 the syntax is different but the parser can support both the new and current syntax.

@amirouche
Copy link
Author

@amirouche amirouche commented Jun 16, 2019

RDF-compatible

What particular aspect of RDF the proposal enter with conflict with outside serialization formats like Turtle that could also be adapted to work on tuple of n-items.

@namedgraph
Copy link

@namedgraph namedgraph commented Jun 16, 2019

Like, the whole RDF stack which has been always based on triples?

@amirouche
Copy link
Author

@amirouche amirouche commented Jun 16, 2019

The whole RDF stack which has been always based on triples?

This is a conservative stance. Without much argument(s) to uphold it. Moreover, it is plain wrong, named graphs are 4-tuple stores.

@namedgraph
Copy link

@namedgraph namedgraph commented Jun 16, 2019

RDF statements consist of three elements they are called triples. Quads are an extension of the model, as it says below.

@JervenBolleman
Copy link
Collaborator

@JervenBolleman JervenBolleman commented Jun 17, 2019

While formally quite different in consequences to RDF*/SPARQL* it does seem to point to the same need to make it easier to query for provenance to certain triples.

The new syntax does not seem to add any new capabilities to the platforms (other than a default graph name/keyword also see #43). I.e. I can't imagine a query written in this syntax that could not be written in the existing GRAPH syntax.

@VladimirAlexiev
Copy link
Contributor

@VladimirAlexiev VladimirAlexiev commented Jul 10, 2019

But @JervenBolleman I think the provide query example does not expose the complexity of what is being proposed. I can imagine a query eg like this for a 6-store:

select * {graph <?g1,?g2,?g3> {?person a foaf:Person}}

@amirouche I have some questions:

  • Are the implementations you mention performant? Have they been tested eg on 10B n-tuples?
  • Do you have any design proposal for how to implement efficiently a n-store where n can be configured per repository?
  • To enable efficient graph retrieval, quad stores implement indexes such as CPO, CSP, CSPO etc (where C=context is the graph), in addition to the usual SPO, POS, etc. Have you any research of how many indexes will be needed for a n-store and the attendant tradeoffs?
  • If you have a quad store <C,S,P,O>, you can easily model additional components by attaching them to C. Why is this not sufficient?

@amirouche
Copy link
Author

@amirouche amirouche commented Jul 10, 2019

Thanks for responding to my proposal!

select * {graph <?g1,?g2,?g3> {?person a foaf:Person}}

Sorry, I don't understand what this query does.

Are the implementations you mention performant?

Performant in terms of speed yes. In terms of on-disk space it is another story, I have optimized that, I think there is no way to reduce "the indexation factor". The indexation factor is in the case of 6-tuple equal to 20. It is the central binomial coefficient. Without compression, it requires 20 times the original size of the data, to store it on disk.

Have they been tested eg on 10B n-tuples?

Not yet. But that is more a question for the underlying storage engine: wiredtiger, foundationdb, rocksdb. All of them are used in production for big workloads.

Do you have any design proposal for how to implement efficiently a n-store where n can be configured per repository?

Yes, that is both hoply and srfi-168 provide two implementations respectively for Python and the other for Scheme.

To enable efficient graph retrieval, quad stores implement indexes such as CPO, CSP, CSPO etc (where C=context is the graph), in addition to the usual SPO, POS, etc. Have you any research of how many indexes will be needed for a n-store

That is exactly the gist of this "invention". The minimal number of indexes required is the central binomial coefficient. It is explained in https://math.stackexchange.com/q/3146568/23663 and the related https://stackoverflow.com/q/55143485/140837

and the attendant tradeoffs?

I think that the tradeoff that is made by n-tuples vs. "n-tuples in 3-tuples" is trading space disk for lower CPU usage. That said, reification of every 3-tuple will also require more space. That is an open question if the same data stored in 3-tuple vs n-tuple have different disk usage. "n-tuple in 3-tuples" will always be slower. Does it answer the question?

If you have a quad store <C,S,P,O>, you can easily model additional components by attaching them to C. Why is this not sufficient?

I already use C to represent the "collection". Yes there are reification technics that makes it possible to represent n-tuple in 3-tuples or 4-tuples. The advantage of n-tuples, is that the application can configure it up-front to store in a single row / tuple all the primordial information that will be available in a single look up on all facts. All other representation of n-tuple in 3-tuples require more than one lookup to fetch similar information.

It seems odd that the "provenance" of a given tuple is deferred to another set of tuple.

I read on the mailing list that n-tuple are called "chunk" and "graph of chunks" in cognitive science.

@VladimirAlexiev
Copy link
Contributor

@VladimirAlexiev VladimirAlexiev commented Jul 11, 2019

Adding more indexes slows down data loading and update: this should be taken into account in a realistic benchmark.

If you use C as singleton, I.e. as a tuple id, then you can attach any extra fields to that id.
Of course, the best indexing/ storage strategy depends on the amount and regularity of "primordial" data. Eg if only 1% of rules have a "confidence" field, it makes better sense to attach that as an extra field to C, rather than have it as a slot in every tuple .

@namedgraph
Copy link

@namedgraph namedgraph commented Jul 11, 2019

This feature is out of scope for SPARQL 1.2

@amirouche
Copy link
Author

@amirouche amirouche commented Dec 6, 2019

@VladimirAlexiev Thanks a lot for the feedback! By the way, I found a way to reduce indexation factor to something like 1 time the original size of the data plus something that depends on the total number of n-tuples. That is, much less that 20 times the size of the original data.

I will close the issue since it is out-of-scope for current iteration.

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants