-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reasoning using the semi naive algorithm #295
Conversation
Nice results @jeswr! |
Sure! The key point is to pre-compute as much information as possible from the rules before going to the data, in particular, determining which set of indexes to use for each premise based on which terms in rules are variables and which variables will be substituted in by the time we look at a particular term ins a premise To improve over #294 - I now pre-compute the rule dependencies, so whenever a new implicit fact is found we can substitute the new fact into 'dependent rules' to create restricted rules that we then need to check. I'll clean this up so this is easier to understand - but this may not be until later on in the month depending on how I track with other commitments. |
This is really exciting @jeswr, awesome work! So is this implementation N3-specific? Or could it also be generalized to run with the same level of performance over other RDF/JS Sources? |
This is N3 specific as it leverages facts about the internal indexes, and also benefits from not having to convert between the internal IDs, and RDFJS term representations. I expect that the reasoning components I am implementing over in Comunica will be at least 2-3x slower (even after merging all of the optimizations that I have been working on here and over at AsyncIterator) as a result of not having access to these internals. In terms of generalising to other sources @rubensworks, I have an unpublished forward chaining algorithm for Comuncia that uses similar ideas to what I have done here. |
Right, that makes sense. At some point in time, I would love us to get Comunica to work directly on these internal indexes (comunica/comunica#873), as this would make these kinds of use cases a lot faster. |
Indeed - that was the motivation for #289 as well |
Oh, interesting, I missed that issue. |
Thanks @jeswr for keeping the deep taxonomy ball rolling 😉 For the moment however I can't reproduce your performance results
So I Just to make sure that we are on the same page, what I have is
and running on
|
Weird @josd ... I have cloned this to a new location and managed to reproduce the results I initially posted. You seem to be working from the right commit. A couple of potential things causing what you are seeing For completeness, my OS specs are
|
Also if you could link to the 1_000_000 deep taxonomy once you have this working that would be great @josd . |
I started fresh and made a pull request jeswr#10
|
src/N3Store.js
Outdated
|
||
// return; | ||
|
||
let newRules = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking @josd - this is where the memory error would be occurring. I can probably tweak the logic here so that rather than iterating over all the rules and then buffering newRules; I do this one rule at a time.
As noted earlier I can't actually download the 1_000_000 benchmark to test this - where do I find it? Thanks for the PR with the 1_000_000 benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a future note for myself, I can also reduce the amount that needs to be stored in the buffer by storing variable substitutions and a pointer to a rule, rather than all the generated conclusions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now at https://sourceforge.net/p/eulersharp/code/HEAD/tree/trunk/2009/12dtb/test-dl-1000000.n3
and also in the pull request jeswr#10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, it turns out that the problem is not in the length of this buffer, which in tests that I have run appears to not exceed a size of 3 on this benchmark.
In addition, this fault occurs when there are 6-7 million triples in the store, it can definitely handle a lot more, with there being 16.7M triples loaded in in the default tests. I'm very bemused as to where the problem is arising...
testing test-dl-1000000
I've just run some benchmarks (running https://github.com/comunica/comunica-feature-reasoning/tree/chore/benchmark-forward-chaining/packages/actor-rdf-reason-forward-chaining/perf) in Comunica with some of the Asynciterator and N3.js optimisations included and it is 2-3 orders of magnitude slower than the results we are seeing here. |
Do you know what the cause is? Inefficient code? Or just a different algorithm? |
@jeswr amazing work. I understand that the source-agnostic version would benefit from some form of shared dictionary encoding as per recent conversations? Also, would you be able to profile one of those test runs, both for the Comunica version and the RDF/JS version? I think we might be able to learn a lot from that. |
Is something like this what you were thinking @jacoscaz (also be aware that the time scale is logarithmic) |
@jeswr I can't seem to zoom in on that image as it's quite small in resolution but that looks like a comparison of different reasoning solutions. What I'm thinking about is profiling CPU and memory usage across a run using either a sampling profiler (https://nodejs.org/en/docs/guides/simple-profiling/ ) or something more advanced, depending on how comfortable you are around profilers and whether you have used one before. Some IDEs (I tend to use WebStorm) can automatically hook up to V8's profiler if you configure them to do so in their launch configurations. |
@RubenVerborgh - I think this is getting close to being ready for review - and whilst I think it can be further cleaned up in the future, I don't know if it is worth putting in the effort to do so until other future optimizations that I have planned are put in. Before I rebase and mark it as ready for review, there is one remaining hurdle - getting full test coverage in this line. The reason I have included it is because I was following a similar pattern to here. Could you please explain why this condition is required in the latter case since it appears that graphs are removed if they are empty anyway. I quickly tried running git blame but the most recent change to that line was a variable renaming. |
Amazing stuff. That's what we need, and what I've wanted to have since the day I started this. May I suggest to implement this in a separate class?
It was probably just a way to get around an overzealous linter not liking |
I would also very much think so, e.g. adding to the store (like now and is like forward chaining) or just goal oriented querying (backward chaining) or a "reasoned" combination of those. |
Perhaps this is overcomplicating it - but if I go down that track is it worth trying it implement it as an |
I've pushed the RDFJS implementation agnostic to https://github.com/jesse-wright-reasoning-honours/rdfjs-reasoner/blob/master/reasoner.ts. This is neither tied to n3 nor Comunica. Happy to clean up and move this repo to the It might be worth getting some input on the API of it first though. In particular, should it be a class with methods like |
Sweet, how's perf? Should we still do #296? |
@RubenVerborgh It ranges between 2-4x slower in the RDFJS version
Unfortunately there isn't much logic that can be shared either due to the 'pre-compilation' of index lookups that I'm doing in #296 |
I think having a RDF/JS reasoner that doesn't depend on N3.js would be worth the 2x - 4x loss, given the absolute numbers I'm seeing. Don't get me wrong, N3.js is awesome, but enriching the RDF/JS ecosystem with such a wonderful component would be amazing. |
Same—I'd strategically suggest to write it on N3.js now. The burden of proof is on us to show that it can all work fast in the browser. Once we have that, we can generalize. Will follow up on the reasoning PR. |
I would also be in favor of focusing on the N3-specific reasoner for now. I guess once we have an idea of how to tackle #289 (and standardize this at RDF/JS-level), we can port this to an RDF/JS-level reasoner, without any loss in performance. |
Agreed - in further support of this, I am now getting the logic for reasoning in Comunica closer to that of the generic RDFJS algorithm. The primary bottleneck came down to the unnecessary creation of many iterators in the process - which can be resolved by re-organising some code, and creating some custom iterators for reference these are the variants of iterator usage I have been testing - though I doubt they are of much use to anyone else at the present moment. So an RDFJS implementation separate from Comunica may become somewhat redundant. |
@jeswr I really love https://github.com/jeswr/N3.js/blob/feat/N3Reasoner/src/N3Reasoner.js and maybe even more your The performance is one of the best I have ever seen:
Thank you so much and Keep Up The Good Work! |
Also a very nice performance on a smartphone https://twitter.com/josderoo/status/1525209755471462402 be it that there was not enough memory to run the deep taxonomy of depth 1000000 (subsumption with 3 million classes). |
Also have a look at the extended deep taxonomy test which is taking the depth as the number of elements in the classes. It is closer to reality e.g. an ontology with 300000 classes (like SNOMED-CT) and 100000 individuals belonging to those classes (like patients having observations, diseases, treatments, ...).
The current results are:
So from depth 10000 on it doesn't work anymore. For EYE and depth 100000 we currently get:
|
Is it correct to assume that EYE doesn't suffer from the scaling problem that the N3Reasoner does because it is performing backwards chaining in this case? I definitely think backwards chaining should be added in here at some point meaning that backwards chaining reasoning can be done on calls like
This is likely because the array for buffering elements is getting too long by this point. One strategy I am thinking should be implemented to overcome this is to instead have a single s, p, o index in each |
Very much indeed!
(last quote from https://josd.github.io/quote.html and thanks to Werner Heisenberg) |
Great work! |
Note - the code is nasty at the moment and needs some love and refactoring, but it does work!
@josd @pbonte - alternative to #293 now implementing a better forward chaining algorithm.
Somewhat surprisingly it now takes longer to do RDFS rules over the union of timbl's profile card, and the foaf ontology (which is not deep at all, you only need one round of reasoning).
It is far more competitive on the Deep Taxolonomy Benchmark now @josd - this is the output of running the Reasoning-perf.js file in this branch. I have not benchmarked test-dl-1000000.n3 as I could not download this from sourceforge.