Explore and resolve `simple` loop editor's notes #23

dlongley · 2022-11-02T17:16:45Z

There are two related editor's notes in the URDN2015 spec about looping over the first step of the algorithm. The first note is in the "4.4 Canonicalization Algorithm" section and the second in the "4.6 Hash First Degree Quads" section.

These notes are about a mistake that was made during the initial implementation and roll out of the URDNA2015 spec many years ago. The first step of the algorithm that leverages the "Hash First Degree Quads" was meant to be looped over multiple times while canonical labels could be assigned based on "simple" differences. Once no more "simple" differences could be determined, that loop would end and the second step of the algorithm would be run.

However, the "Hash First Degree Quads" function in the spec did not include text to use these canonical bnode labels as they were assigned, when generating hashes. Implementations did use them at some point, but in an effort to conform to the spec (or some other refactoring probably lost to history) removed them -- and the algorithm instead never considers canonical bnode labels. This means that the hash output never changes on account of them as they are assigned. Therefore, the "simple" loop in section 4.4 will only ever run once. Therefore, the algorithm text could remove the loop without affecting the algorithm outputs.

The end result of this is that some differences in datasets that are simple enough to detect without moving into the second step of the algorithm (which recursively and comprehensively considers all paths / relationships) aren't caught early. They, of course, are still handled by the second step of the overall algorithm, but originally they were meant to be handled by looping over the simpler step.

The URDNA2015 spec text today is still "correct" (in that it matches the test suite and implementations in the wild), it could just be simplified because the "simple loop" isn't actually used. Alternatively, this missing optimization could be added to produce an algorithm (with different outputs), but that would come at some cost to existing implementations and signed documents out in the wild. It's not clear how costly it would be -- there may not be that much data that ever needed a second run through the "simple loop", just like there is very little data that is expected to need the second step at all.

iherman · 2022-11-22T11:23:05Z

(I am working on an implementation in TypeScript… I have just finished coding, but not yet testing, the "simple" case.)

I must admit I got hooked up on this loop when coding in based on the spec. Indeed, it is correct (the while loop will run twice, and it will essentially be a no-op in the second run) but, by itself, it looks (and it is) unnecessary. The algorithm is already complex to follow and we should make everything possible to make it look simpler.

How frequent is when that (missing) optimization step is indeed fruitful? How much fruitful? Would it significantly improve the efficiency of the algorithms? Would the price (in terms of making the algorithm more complex to understand and/or implement) worth it?

My gut feeling is that it is not worth it. In other words, I would strongly be in favour of removing that cycle altogether.

dlongley · 2022-11-22T15:47:56Z

@iherman,

How frequent is when that (missing) optimization step is indeed fruitful? How much fruitful? Would it significantly improve the efficiency of the algorithms? Would the price (in terms of making the algorithm more complex to understand and/or implement) worth it?

My gut feeling is that it is not worth it. In other words, I would strongly be in favour of removing that cycle altogether.

My gut feeling is the same (and this is also why a big effort wasn't made to correct the problem out there in existing implementations). My thought is that the (missing) optimization code seems like it would only have been helpful in cases the N-Degree quads part of the algorithm could solve fairly quickly (a single iteration) anyway. I would think these cases would be fairly rare on their own as well, but may actually constitute "useful data" as opposed to "poison graphs". This signals to me that it's probably not worth the complexity.

gkellogg added a commit that referenced this issue Nov 2, 2022

Replace ednotes on the use of "simple" with issue markers to #23.

512b714

gkellogg mentioned this issue Nov 2, 2022

Replace ednotes on the use of "simple" with issue markers #24

Merged

gkellogg added a commit that referenced this issue Nov 4, 2022

Replace ednotes on the use of "simple" with issue markers to #23.

393048a

gkellogg added the spec:substantive label Nov 22, 2022

iherman mentioned this issue Nov 23, 2022

Removed the simple loop - issue 23 #40

Merged

gkellogg closed this as completed in #40 Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore and resolve `simple` loop editor's notes #23

Explore and resolve `simple` loop editor's notes #23

dlongley commented Nov 2, 2022

iherman commented Nov 22, 2022

dlongley commented Nov 22, 2022

Explore and resolve simple loop editor's notes #23

Explore and resolve simple loop editor's notes #23

Comments

dlongley commented Nov 2, 2022

iherman commented Nov 22, 2022

dlongley commented Nov 22, 2022

Explore and resolve `simple` loop editor's notes #23

Explore and resolve `simple` loop editor's notes #23