how to implement SHACL Rules (forward chaining) efficiently #350

VladimirAlexiev · 2025-04-08T13:01:26Z

VladimirAlexiev
Apr 8, 2025
Collaborator

The following "provoked" the posting of this issue:

@afs proposed to use datalog for SHACL rules
discussion with @robert-david
@steveraysteveray in Use case: The ASHRAE 223 standard (soon to be open for public review) #343 and ensuing discussion

We definitely iterate multiple times over the rules. Using TopBraidComposer, that is easily configured.
... repeatedly invokes the SHACL inference until no new triples are asserted.

That makes it non-scalable (could even say "a recipe for disaster" if I wasn't afraid @TallTed would censor me :-)).
A key question in forward chaining is how to track some sort of a "working set" in order to limit the amount of "data-rules" interactions.
That was tackled by https://en.wikipedia.org/wiki/Rete_algorithm in 1974-1979: let's not forget all this prior art and years of progress!

Typically on the order of 7 or so iterations are made.

It's not just that an apriori-unknown number of iterations is needed.
Another problem is that the "productivity" of these iterations decreases (a "diminishing returns" problem). But each iteration runs all the rules, so it does the same number of checks (or even a bit more due to the newly inferred data).

Let's gather some info on how repositories optimize forward chaining:

GDB has a queue of incoming statements. Each incoming triple is checked against each premise of every rule (rules are loaded in memory as JVM code, so this checking is "cheap"). If a rule matches a triple, all other premises are checked against the transaction and data-at-rest (this is more expensive). If a new triple is inferred, it's added to the queue
Jena Rules support "forward chaining, backward chaining and a hybrid execution model". @afs How does it implement forward chaining efficiently?

Which SHACL rule features are easier to implement?

Triple Rules are easier since you can presumably do dependency tracking
SPARQL Rules are harder since it could infer anything

Some excuses on my part:

This is not a critique of the ASHRAE use case (which is excellent) nor TopBraid
Answering "this is an implementation detail" is ok, but still it would be good to collect some guidance/advice regarding implementation

VladimirAlexiev · 2025-04-08T13:08:29Z

VladimirAlexiev
Apr 8, 2025
Collaborator Author

on the https://models.open223.info/intro.html site that contains example models, you will see links to the "original" and "compiled" versions of the models. The compiled version is the result of the inferences

@steveraysteveray What is the expansion ratio (total/explicit triples) you obtain?
The typical ratios we see are 1.15-1.2x but I've see as much as 4.7x (on CIDOC CRM with a very deep class hierarchy, can post a reference if anyone is interested).

0 replies

steveraysteveray · 2025-04-08T13:49:23Z

steveraysteveray
Apr 8, 2025

@VladimirAlexiev, first, apologies that the open223 sample model files haven't been updated lately. I will also need to double-check why the entire QUDT vocabulary somehow made its way into the compiled version of the first sample model! Running the inferences on that model myself yields an expansion ratio of 1.77. This is mostly due to the "connection" inferencing mentioned in #343.

As an aside, I look at the inferencing rules as used in the ASHRAE standard as similar to the way calculations can be applied in an Excel spreadsheet - more in a unidirectional flow rather than loops or recursion.

1 reply

VladimirAlexiev Apr 9, 2025
Collaborator Author

1.77 is quite reasonable for the "connection" inferencing.

gtfierro · 2025-04-08T14:27:39Z

gtfierro
Apr 8, 2025

I will also need to double-check why the entire QUDT vocabulary somehow made its way into the compiled version of the first sample model!

This is due to a bug in the script that generates all of the model pages. I'll work on fixing that today; hopefully it's just as simple as regenerating that model. I will probably edit the script so it generates 3 links for the model: the base turtle, turtle + inferred triples, and turtle + inferred triples + all dependent ontologies (easiest for demos -- just load a single graph!).

As an aside, I look at the inferencing rules as used in the ASHRAE standard as similar to the way calculations can be applied in an Excel spreadsheet - more in a unidirectional flow rather than loops or recursion.

I've always thought of them as a fixed-point computation, though this is rooted in my experience building an OWL 2 RL inference engine on top of datalog. I think an efficient forward chaining implementation of SHACL rules would be very welcome (at least to me!).

1 reply

VladimirAlexiev Apr 9, 2025
Collaborator Author

Andy points below to Datalog in #295

afs · 2025-04-08T14:32:58Z

afs
Apr 8, 2025
Collaborator

Why forward chaining?

That makes it non-scalable

May be it is, may it isn't. It depends on the rules and the data.

Jena Rules support "forward chaining, backward chaining and a hybrid execution model". @afs How does it implement forward chaining efficiently?

That page says what is used!

RETE for forward, SLD resolution for backward. Hybrid is forward rules that generate backwards rules.
(RETE is more than just forward-chaining.)

There are many algorithms for datalog:
#295 (comment)

"this is an implementation detail"

Yes.
This would be better as a discussion. Issues should be matters that can be closed.

1 reply

VladimirAlexiev Apr 9, 2025
Collaborator Author

@afs thanks for the feedback!
Converted to discussion, will read now #295, and will add a use case about Transitive.
Sorry I didn't read Jena Rules carefully (about their implementation).

Cheers!

TallTed · 2025-04-08T19:39:34Z

TallTed
Apr 8, 2025
Collaborator

[@VladimirAlexiev] if I wasn't afraid @TallTed would censor me :-)

The above puzzles me. I'm not aware of having censored anything.

"Recipes for disaster" have certainly existed in various standardization efforts. Where databases of any model are concerned, queries that would deliver Cartesian products as results are often considered such recipes, especially as the database in question grows.

You were commenting on repeatedly invokes the SHACL inference until no new triples are asserted, which you suggested is inherently non-scalable. I'm not convinced that that's true in every case, though it certainly may be in many if not most cases.

Scalability is often a question of specific query execution and/or construction tactics. Deployment resources (RAM, persistent storage, temporary storage, physical processor cores, logical processor threads, etc.) also play a significant part.

Iteration of inference rules may or may not be scalable. It depends on how complex the rules are, how much data is being reasoned over, whether all that data is local, and various other considerations. Given that such iteration is in regular use by Holger and/or users of TopBraid, it certainly seems to be sufficiently scalable for their current purposes. Whether that will continue to be the case is beyond my assessment.

1 reply

VladimirAlexiev Apr 9, 2025
Collaborator Author

I should have clarified: by "scalable" I mean working over >1B triples.
I won't dare say "this excludes in-memory databases" because the experience of RdfOx shows people are willing to throw RAM at a problem, and their acquisition by Samsung imho shows they are successful (other people at Graphwise think the opposite).

robert-david · 2025-04-09T10:22:11Z

robert-david
Apr 9, 2025
Collaborator

I don't see how iterating over the inferences can improve the scalability situation. Is it because we can define the number of iterations to limit it? Generally i think even with simple rules you can easily run into scalability/recursion issues. My understanding is that for this reason recursion was left out of the current/previous SHACL version.

I think using Datalog would be a good way to implement rules. @afs how would you address that it is not implementable in SPARQL anymore (which i understand was a principle of SHACL)?

0 replies

afs · 2025-04-09T10:50:36Z

afs
Apr 9, 2025
Collaborator

Scalability is often a question of specific query execution and/or construction tactics. Deployment resources (RAM, persistent storage, temporary storage, physical processor cores, logical processor threads, etc.) also play a significant part.

Absolutely.

SHACL is already highly parallelizable and there is existing work on parallel datalog.

If you limit the expressivity for "efficiency reasons", you are removing or restricting features.

At that point, I want to see an explanation of how the user is going to achieve their task.

0 replies

afs · 2025-04-09T11:03:55Z

afs
Apr 9, 2025
Collaborator

I don't see how iterating over the inferences can improve the scalability situation.

What is happening in one rule depends on the output of another rule (which may even depend on the first rule).

It's about completeness of deductions rather than scalability. There are techniques to refine the iteration (e.g. grouping rules by dependencies) - it requires being able to look into the rule which is part of why the SHACL-AF CONSTRUCT form needs taming.

@afs how would you address that it is not implementable in SPARQL anymore (which i understand was a principle of SHACL)?

The goal is that each rule can be translated to SPARQL. A rule may sometimes need to be run more than once.

I'm happy to do an AMA session at a WG meeting, or outside, if it helps.

1 reply

VladimirAlexiev Apr 9, 2025
Collaborator Author

SHACL-AF CONSTRUCT form needs taming.

Right! Because general SPARQL is hard to grok, understand dependencies, parallelize, etc

The goal is that each rule can be translated to SPARQL

Also agree: SPARQL is the most generic rules implementation;
translating to it offers an easy "reference implementation" route that however won't be efficient for many cases.

steveraysteveray · 2025-04-09T13:28:23Z

steveraysteveray
Apr 9, 2025

The goal is that each rule can be translated to SPARQL. A rule may sometimes need to be run more than once.

@afs, I was a little surprised to see this. I have been trying to use native SHACL rules whenever it looks straightforward rather than SPARQL because I assumed it would be more efficient. But I'm much more comfortable writing SPARQL. Should I not bother doing rules in SHACL?

5 replies

VladimirAlexiev Apr 9, 2025
Collaborator Author

Keep faith!

SHACL rules that are declarative (not SPARQL) give opportunities for more efficient implementation
- TripleRules are not very expressive, but with the new Node Expressions, there'll be a lot more opportunities for declarative rules
- On the other hand, the more SPARQLish features people add to Node Expressions, the further away we move from declarativity and easier implementation
putting SPARQL rules into SHACL gives them better organization because you slot them into classes or shapes, and can precondition them on shapes (potentially reducing the set of SPARQLs you need to run on a given dataset/transaction)
you can also add metadata to such rules (eg order, complexity class, SHACL profile...) that may be useful for an inferencer to organize its work

afs Apr 9, 2025
Collaborator

Firstly - we are a LONG way from a spec.
Secondly - everything so far is opinion. There have no been no WG decisions.

Should I not bother doing rules in SHACL?

What you shouldn't do at the moment is rely on a WG direction!

I prefer to call SHACL-AF SPARQL Rules, "CONSTRUCT rules". This makes it clear they are a use of SPARQL, not part of SPARQL.

One example of the 223p rules is:

CONSTRUCT {  ?childCp s223:hasMedium ?medium  }
WHERE {
    ?this     s223:hasConnectionPoint  ?cp .
    ?childCp  s223:mapsTo           ?cp .
    ?cp       s223:connectsThrough  ?connection ;
              s223:hasMedium        ?medium
    FILTER NOT EXISTS { ?childCp  s223:hasMedium  ?something }
  }

That has a NOT EXISTS which depends on itself.

But note that a sh:TripleRule with a condition using some of the proposed node expressions is just as bad, and analysing node expressions is not easy in the general case.

One suggestion I have is that maybe we could have a way to say "run this rule in isolation after all datalog-ish rules have run" which gives a defaulting effect.

(The CONSTRUCT rule above has a simpler way to be made valid - have the NOT EXISTS only look in the previous strata and not the out of this rule.)

We can keep opaque CONSTRUCT rules - either require addition declaration of dependencies or require order control.
My guess is that a lot of uses CONSTRUCT rules will be analysable anyway (ask your engine provider nicely!) or have simple translations to SHACL 1.2 Rules syntax but some will require non-trivial changes hence keeping the existing CONSTRUCT rules.

I'm much more comfortable writing SPARQL

IMO It is essential to have compact syntax coverage for SHALC 1.2 Rules. It would be reasonable for it to be SPARQL-inspired.

Whether that is written head/body or condition/consequence style is yet to be discussed.
(speculation):

RULE { ?x :q ?z } WHEN { ?x :p1 ?y . ?y :p2 ?z }

or

RULE { ?x :p1 ?y . ?y :p2 ?z } THEN { ?x :q ?z }

or tersely

{ ?x :q ?z } :- { ?x :p1 ?y . ?y :p2 ?z }

or ...

I have been trying to use native SHACL rules whenever it looks straightforward rather than SPARQL because I assumed it would be more efficient.

Implementation issue. Writing a sh:TripleRule as a sh:construct makes little difference to the necessary execution. Parsing SPARQL can be done once, ahead of time. So some overheads but the execution scale (big-O scale) can be made the same.

steveraysteveray Apr 9, 2025

The intent of the example you quote (and others like it) can be paraphrased as "If the following predicate is missing a value, figure it out and assert it"
The FILTER NOT EXISTS is there to avoid depending on the reasoner to clean out duplicative triples.
If there is a better way to do this, I'd be happy to adopt it.

afs Apr 9, 2025
Collaborator

This one:

CONSTRUCT { ?this s223:connectsFrom ?equipment }
WHERE {
  { ?this  s223:connectsAt       ?cp .
    ?cp    rdf:type              s223:OutletConnectionPoint ;
           s223:isConnectionPointOf  ?equipment
  }

is quite amenable to analysis as are the symmetric and inverse rules.

The ones with sub-query and self NOT EXISTS are more complicated but a human user can see they are intended to run once.

afs Apr 9, 2025
Collaborator

The extracted rules: #343 (comment)

how to implement SHACL Rules (forward chaining) efficiently #350

Uh oh!

Uh oh!

VladimirAlexiev Apr 8, 2025 Collaborator

Replies: 9 comments · 10 replies

Uh oh!

VladimirAlexiev Apr 8, 2025 Collaborator Author

Uh oh!

steveraysteveray Apr 8, 2025

Uh oh!

VladimirAlexiev Apr 9, 2025 Collaborator Author

Uh oh!

gtfierro Apr 8, 2025

Uh oh!

VladimirAlexiev Apr 9, 2025 Collaborator Author

Uh oh!

Uh oh!

afs Apr 8, 2025 Collaborator

Uh oh!

VladimirAlexiev Apr 9, 2025 Collaborator Author

Uh oh!

Uh oh!

TallTed Apr 8, 2025 Collaborator

Uh oh!

VladimirAlexiev Apr 9, 2025 Collaborator Author

Uh oh!

robert-david Apr 9, 2025 Collaborator

Uh oh!

Uh oh!

afs Apr 9, 2025 Collaborator

Uh oh!

afs Apr 9, 2025 Collaborator

Uh oh!

VladimirAlexiev Apr 9, 2025 Collaborator Author

Uh oh!

steveraysteveray Apr 9, 2025

Uh oh!

VladimirAlexiev Apr 9, 2025 Collaborator Author

Uh oh!

Uh oh!

afs Apr 9, 2025 Collaborator

Uh oh!

steveraysteveray Apr 9, 2025

Uh oh!

afs Apr 9, 2025 Collaborator

Uh oh!

afs Apr 9, 2025 Collaborator

VladimirAlexiev
Apr 8, 2025
Collaborator

Replies: 9 comments 10 replies

VladimirAlexiev
Apr 8, 2025
Collaborator Author

steveraysteveray
Apr 8, 2025

VladimirAlexiev Apr 9, 2025
Collaborator Author

gtfierro
Apr 8, 2025

VladimirAlexiev Apr 9, 2025
Collaborator Author

afs
Apr 8, 2025
Collaborator

VladimirAlexiev Apr 9, 2025
Collaborator Author

TallTed
Apr 8, 2025
Collaborator

VladimirAlexiev Apr 9, 2025
Collaborator Author

robert-david
Apr 9, 2025
Collaborator

afs
Apr 9, 2025
Collaborator

afs
Apr 9, 2025
Collaborator

VladimirAlexiev Apr 9, 2025
Collaborator Author

steveraysteveray
Apr 9, 2025

VladimirAlexiev Apr 9, 2025
Collaborator Author

afs Apr 9, 2025
Collaborator

afs Apr 9, 2025
Collaborator

afs Apr 9, 2025
Collaborator