Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore and resolve simple loop editor's notes #23

Closed
dlongley opened this issue Nov 2, 2022 · 2 comments · Fixed by #40
Closed

Explore and resolve simple loop editor's notes #23

dlongley opened this issue Nov 2, 2022 · 2 comments · Fixed by #40

Comments

@dlongley
Copy link
Contributor

dlongley commented Nov 2, 2022

There are two related editor's notes in the URDN2015 spec about looping over the first step of the algorithm. The first note is in the "4.4 Canonicalization Algorithm" section and the second in the "4.6 Hash First Degree Quads" section.

These notes are about a mistake that was made during the initial implementation and roll out of the URDNA2015 spec many years ago. The first step of the algorithm that leverages the "Hash First Degree Quads" was meant to be looped over multiple times while canonical labels could be assigned based on "simple" differences. Once no more "simple" differences could be determined, that loop would end and the second step of the algorithm would be run.

However, the "Hash First Degree Quads" function in the spec did not include text to use these canonical bnode labels as they were assigned, when generating hashes. Implementations did use them at some point, but in an effort to conform to the spec (or some other refactoring probably lost to history) removed them -- and the algorithm instead never considers canonical bnode labels. This means that the hash output never changes on account of them as they are assigned. Therefore, the "simple" loop in section 4.4 will only ever run once. Therefore, the algorithm text could remove the loop without affecting the algorithm outputs.

The end result of this is that some differences in datasets that are simple enough to detect without moving into the second step of the algorithm (which recursively and comprehensively considers all paths / relationships) aren't caught early. They, of course, are still handled by the second step of the overall algorithm, but originally they were meant to be handled by looping over the simpler step.

The URDNA2015 spec text today is still "correct" (in that it matches the test suite and implementations in the wild), it could just be simplified because the "simple loop" isn't actually used. Alternatively, this missing optimization could be added to produce an algorithm (with different outputs), but that would come at some cost to existing implementations and signed documents out in the wild. It's not clear how costly it would be -- there may not be that much data that ever needed a second run through the "simple loop", just like there is very little data that is expected to need the second step at all.

@iherman
Copy link
Member

iherman commented Nov 22, 2022

(I am working on an implementation in TypeScript… I have just finished coding, but not yet testing, the "simple" case.)

I must admit I got hooked up on this loop when coding in based on the spec. Indeed, it is correct (the while loop will run twice, and it will essentially be a no-op in the second run) but, by itself, it looks (and it is) unnecessary. The algorithm is already complex to follow and we should make everything possible to make it look simpler.

How frequent is when that (missing) optimization step is indeed fruitful? How much fruitful? Would it significantly improve the efficiency of the algorithms? Would the price (in terms of making the algorithm more complex to understand and/or implement) worth it?

My gut feeling is that it is not worth it. In other words, I would strongly be in favour of removing that cycle altogether.

@dlongley
Copy link
Contributor Author

@iherman,

How frequent is when that (missing) optimization step is indeed fruitful? How much fruitful? Would it significantly improve the efficiency of the algorithms? Would the price (in terms of making the algorithm more complex to understand and/or implement) worth it?

My gut feeling is that it is not worth it. In other words, I would strongly be in favour of removing that cycle altogether.

My gut feeling is the same (and this is also why a big effort wasn't made to correct the problem out there in existing implementations). My thought is that the (missing) optimization code seems like it would only have been helpful in cases the N-Degree quads part of the algorithm could solve fairly quickly (a single iteration) anyway. I would think these cases would be fairly rare on their own as well, but may actually constitute "useful data" as opposed to "poison graphs". This signals to me that it's probably not worth the complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants