Tuple of n-items #98

amirouche · 2019-06-16T10:57:23Z

Why?

Like explained in Frey:2017, the named graph (ngraph) allows to represent metadata in a more performant way. I propose to generalize that scheme to support any number of metadata item that must add information to every tuple e.g. like provenance, licence or history significance.

Previous work

SRFI-168 aka. nstore a proposal to support nstore in scheme programming language implementations.
hoply a python implementation of nstore
datae work-in-progress versioned nstore that rely on 7-tuples (graph subject predicate object alive change-id uid).

Proposed solution

In the case of quad store, instead of

SELECT DISTINCT ?person
WHERE {
    ?person foaf:name ?name .
    GRAPH ?g1 { ?person a foaf:Person }
    GRAPH ?g2 { ?person a foaf:Person }
    GRAPH ?g3 { ?person a foaf:Person }
    FILTER(?g1 != ?g2 && ?g1 != ?g3 && ?g2 != ?g3) .
}

The proposition is to also support the following more general syntax:

SELECT DISTINCT ?person
WHERE {
    default-graph ?person foaf:name ?name .
    ?g1 ?person a foaf:Person .
    ?g2 ?person a foaf:Person .
    ?g3 ?person a foaf:Person .
    FILTER(?g1 != ?g2 && ?g1 != ?g3 && ?g2 != ?g3) .
}

Considerations for backward compatibility

Backward compatible.

The text was updated successfully, but these errors were encountered:

namedgraph · 2019-06-16T11:01:52Z

How is this backward-compatible or even RDF-compatible?..

amirouche · 2019-06-16T11:04:25Z

The database must be setup with fixed n. For n=3 the proposed syntax is the same as before. For n=4 the syntax is different but the parser can support both the new and current syntax.

amirouche · 2019-06-16T11:11:25Z

RDF-compatible

What particular aspect of RDF the proposal enter with conflict with outside serialization formats like Turtle that could also be adapted to work on tuple of n-items.

namedgraph · 2019-06-16T11:12:00Z

Like, the whole RDF stack which has been always based on triples?

amirouche · 2019-06-16T14:27:29Z

The whole RDF stack which has been always based on triples?

This is a conservative stance. Without much argument(s) to uphold it. Moreover, it is plain wrong, named graphs are 4-tuple stores.

namedgraph · 2019-06-16T15:54:52Z

RDF statements consist of three elements they are called triples. Quads are an extension of the model, as it says below.

JervenBolleman · 2019-06-17T12:41:42Z

While formally quite different in consequences to RDF*/SPARQL* it does seem to point to the same need to make it easier to query for provenance to certain triples.

The new syntax does not seem to add any new capabilities to the platforms (other than a default graph name/keyword also see #43). I.e. I can't imagine a query written in this syntax that could not be written in the existing GRAPH syntax.

VladimirAlexiev · 2019-07-10T07:33:26Z

But @JervenBolleman I think the provide query example does not expose the complexity of what is being proposed. I can imagine a query eg like this for a 6-store:

select * {graph <?g1,?g2,?g3> {?person a foaf:Person}}

@amirouche I have some questions:

Are the implementations you mention performant? Have they been tested eg on 10B n-tuples?
Do you have any design proposal for how to implement efficiently a n-store where n can be configured per repository?
To enable efficient graph retrieval, quad stores implement indexes such as CPO, CSP, CSPO etc (where C=context is the graph), in addition to the usual SPO, POS, etc. Have you any research of how many indexes will be needed for a n-store and the attendant tradeoffs?
If you have a quad store <C,S,P,O>, you can easily model additional components by attaching them to C. Why is this not sufficient?

amirouche · 2019-07-10T10:58:17Z

Thanks for responding to my proposal!

select * {graph <?g1,?g2,?g3> {?person a foaf:Person}}

Sorry, I don't understand what this query does.

Are the implementations you mention performant?

Performant in terms of speed yes. In terms of on-disk space it is another story, I have optimized that, I think there is no way to reduce "the indexation factor". The indexation factor is in the case of 6-tuple equal to 20. It is the central binomial coefficient. Without compression, it requires 20 times the original size of the data, to store it on disk.

Have they been tested eg on 10B n-tuples?

Not yet. But that is more a question for the underlying storage engine: wiredtiger, foundationdb, rocksdb. All of them are used in production for big workloads.

Do you have any design proposal for how to implement efficiently a n-store where n can be configured per repository?

Yes, that is both hoply and srfi-168 provide two implementations respectively for Python and the other for Scheme.

To enable efficient graph retrieval, quad stores implement indexes such as CPO, CSP, CSPO etc (where C=context is the graph), in addition to the usual SPO, POS, etc. Have you any research of how many indexes will be needed for a n-store

That is exactly the gist of this "invention". The minimal number of indexes required is the central binomial coefficient. It is explained in https://math.stackexchange.com/q/3146568/23663 and the related https://stackoverflow.com/q/55143485/140837

and the attendant tradeoffs?

I think that the tradeoff that is made by n-tuples vs. "n-tuples in 3-tuples" is trading space disk for lower CPU usage. That said, reification of every 3-tuple will also require more space. That is an open question if the same data stored in 3-tuple vs n-tuple have different disk usage. "n-tuple in 3-tuples" will always be slower. Does it answer the question?

If you have a quad store <C,S,P,O>, you can easily model additional components by attaching them to C. Why is this not sufficient?

I already use C to represent the "collection". Yes there are reification technics that makes it possible to represent n-tuple in 3-tuples or 4-tuples. The advantage of n-tuples, is that the application can configure it up-front to store in a single row / tuple all the primordial information that will be available in a single look up on all facts. All other representation of n-tuple in 3-tuples require more than one lookup to fetch similar information.

It seems odd that the "provenance" of a given tuple is deferred to another set of tuple.

I read on the mailing list that n-tuple are called "chunk" and "graph of chunks" in cognitive science.

VladimirAlexiev · 2019-07-11T06:23:16Z

Adding more indexes slows down data loading and update: this should be taken into account in a realistic benchmark.

If you use C as singleton, I.e. as a tuple id, then you can attach any extra fields to that id.
Of course, the best indexing/ storage strategy depends on the amount and regularity of "primordial" data. Eg if only 1% of rules have a "confidence" field, it makes better sense to attach that as an extra field to C, rather than have it as a slot in every tuple .

namedgraph · 2019-07-11T08:03:39Z

This feature is out of scope for SPARQL 1.2

amirouche · 2019-12-06T16:10:05Z

@VladimirAlexiev Thanks a lot for the feedback! By the way, I found a way to reduce indexation factor to something like 1 time the original size of the data plus something that depends on the total number of n-tuples. That is, much less that 20 times the size of the original data.

I will close the issue since it is out-of-scope for current iteration.

Thanks all!

amirouche closed this as completed Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuple of n-items #98

Tuple of n-items #98

amirouche commented Jun 16, 2019 •

edited

namedgraph commented Jun 16, 2019

amirouche commented Jun 16, 2019

amirouche commented Jun 16, 2019

namedgraph commented Jun 16, 2019

amirouche commented Jun 16, 2019

namedgraph commented Jun 16, 2019

JervenBolleman commented Jun 17, 2019

VladimirAlexiev commented Jul 10, 2019

amirouche commented Jul 10, 2019

VladimirAlexiev commented Jul 11, 2019

namedgraph commented Jul 11, 2019

amirouche commented Dec 6, 2019

Tuple of n-items #98

Tuple of n-items #98

Comments

amirouche commented Jun 16, 2019 • edited

Why?

Previous work

Proposed solution

Considerations for backward compatibility

namedgraph commented Jun 16, 2019

amirouche commented Jun 16, 2019

amirouche commented Jun 16, 2019

namedgraph commented Jun 16, 2019

amirouche commented Jun 16, 2019

namedgraph commented Jun 16, 2019

JervenBolleman commented Jun 17, 2019

VladimirAlexiev commented Jul 10, 2019

amirouche commented Jul 10, 2019

VladimirAlexiev commented Jul 11, 2019

namedgraph commented Jul 11, 2019

amirouche commented Dec 6, 2019

amirouche commented Jun 16, 2019 •

edited