Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuple of n-items #98

Open
amirouche opened this issue Jun 16, 2019 · 11 comments

Comments

@amirouche
Copy link

commented Jun 16, 2019

Why?

Like explained in Frey:2017, the named graph (ngraph) allows to represent metadata in a more performant way. I propose to generalize that scheme to support any number of metadata item that must add information to every tuple e.g. like provenance, licence or history significance.

Previous work

  • SRFI-168 aka. nstore a proposal to support nstore in scheme programming language implementations.
  • hoply a python implementation of nstore
  • datae work-in-progress versioned nstore that rely on 7-tuples (graph subject predicate object alive change-id uid).

Proposed solution

In the case of quad store, instead of

SELECT DISTINCT ?person
WHERE {
    ?person foaf:name ?name .
    GRAPH ?g1 { ?person a foaf:Person }
    GRAPH ?g2 { ?person a foaf:Person }
    GRAPH ?g3 { ?person a foaf:Person }
    FILTER(?g1 != ?g2 && ?g1 != ?g3 && ?g2 != ?g3) .
}     

The proposition is to also support the following more general syntax:

SELECT DISTINCT ?person
WHERE {
    default-graph ?person foaf:name ?name .
    ?g1 ?person a foaf:Person .
    ?g2 ?person a foaf:Person .
    ?g3 ?person a foaf:Person .
    FILTER(?g1 != ?g2 && ?g1 != ?g3 && ?g2 != ?g3) .
}        

Considerations for backward compatibility

Backward compatible.

@namedgraph

This comment has been minimized.

Copy link

commented Jun 16, 2019

How is this backward-compatible or even RDF-compatible?..

@amirouche

This comment has been minimized.

Copy link
Author

commented Jun 16, 2019

The database must be setup with fixed n. For n=3 the proposed syntax is the same as before. For n=4 the syntax is different but the parser can support both the new and current syntax.

@amirouche

This comment has been minimized.

Copy link
Author

commented Jun 16, 2019

RDF-compatible

What particular aspect of RDF the proposal enter with conflict with outside serialization formats like Turtle that could also be adapted to work on tuple of n-items.

@namedgraph

This comment has been minimized.

Copy link

commented Jun 16, 2019

Like, the whole RDF stack which has been always based on triples?

@amirouche

This comment has been minimized.

Copy link
Author

commented Jun 16, 2019

The whole RDF stack which has been always based on triples?

This is a conservative stance. Without much argument(s) to uphold it. Moreover, it is plain wrong, named graphs are 4-tuple stores.

@namedgraph

This comment has been minimized.

Copy link

commented Jun 16, 2019

RDF statements consist of three elements they are called triples. Quads are an extension of the model, as it says below.

@JervenBolleman

This comment has been minimized.

Copy link
Collaborator

commented Jun 17, 2019

While formally quite different in consequences to RDF*/SPARQL* it does seem to point to the same need to make it easier to query for provenance to certain triples.

The new syntax does not seem to add any new capabilities to the platforms (other than a default graph name/keyword also see #43). I.e. I can't imagine a query written in this syntax that could not be written in the existing GRAPH syntax.

@VladimirAlexiev

This comment has been minimized.

Copy link

commented Jul 10, 2019

But @JervenBolleman I think the provide query example does not expose the complexity of what is being proposed. I can imagine a query eg like this for a 6-store:

select * {graph <?g1,?g2,?g3> {?person a foaf:Person}}

@amirouche I have some questions:

  • Are the implementations you mention performant? Have they been tested eg on 10B n-tuples?
  • Do you have any design proposal for how to implement efficiently a n-store where n can be configured per repository?
  • To enable efficient graph retrieval, quad stores implement indexes such as CPO, CSP, CSPO etc (where C=context is the graph), in addition to the usual SPO, POS, etc. Have you any research of how many indexes will be needed for a n-store and the attendant tradeoffs?
  • If you have a quad store <C,S,P,O>, you can easily model additional components by attaching them to C. Why is this not sufficient?
@amirouche

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

Thanks for responding to my proposal!

select * {graph <?g1,?g2,?g3> {?person a foaf:Person}}

Sorry, I don't understand what this query does.

Are the implementations you mention performant?

Performant in terms of speed yes. In terms of on-disk space it is another story, I have optimized that, I think there is no way to reduce "the indexation factor". The indexation factor is in the case of 6-tuple equal to 20. It is the central binomial coefficient. Without compression, it requires 20 times the original size of the data, to store it on disk.

Have they been tested eg on 10B n-tuples?

Not yet. But that is more a question for the underlying storage engine: wiredtiger, foundationdb, rocksdb. All of them are used in production for big workloads.

Do you have any design proposal for how to implement efficiently a n-store where n can be configured per repository?

Yes, that is both hoply and srfi-168 provide two implementations respectively for Python and the other for Scheme.

To enable efficient graph retrieval, quad stores implement indexes such as CPO, CSP, CSPO etc (where C=context is the graph), in addition to the usual SPO, POS, etc. Have you any research of how many indexes will be needed for a n-store

That is exactly the gist of this "invention". The minimal number of indexes required is the central binomial coefficient. It is explained in https://math.stackexchange.com/q/3146568/23663 and the related https://stackoverflow.com/q/55143485/140837

and the attendant tradeoffs?

I think that the tradeoff that is made by n-tuples vs. "n-tuples in 3-tuples" is trading space disk for lower CPU usage. That said, reification of every 3-tuple will also require more space. That is an open question if the same data stored in 3-tuple vs n-tuple have different disk usage. "n-tuple in 3-tuples" will always be slower. Does it answer the question?

If you have a quad store <C,S,P,O>, you can easily model additional components by attaching them to C. Why is this not sufficient?

I already use C to represent the "collection". Yes there are reification technics that makes it possible to represent n-tuple in 3-tuples or 4-tuples. The advantage of n-tuples, is that the application can configure it up-front to store in a single row / tuple all the primordial information that will be available in a single look up on all facts. All other representation of n-tuple in 3-tuples require more than one lookup to fetch similar information.

It seems odd that the "provenance" of a given tuple is deferred to another set of tuple.

I read on the mailing list that n-tuple are called "chunk" and "graph of chunks" in cognitive science.

@VladimirAlexiev

This comment has been minimized.

Copy link

commented Jul 11, 2019

Adding more indexes slows down data loading and update: this should be taken into account in a realistic benchmark.

If you use C as singleton, I.e. as a tuple id, then you can attach any extra fields to that id.
Of course, the best indexing/ storage strategy depends on the amount and regularity of "primordial" data. Eg if only 1% of rules have a "confidence" field, it makes better sense to attach that as an extra field to C, rather than have it as a slot in every tuple .

@namedgraph

This comment has been minimized.

Copy link

commented Jul 11, 2019

This feature is out of scope for SPARQL 1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.