forked from sctweedie/csvdiff3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
202 lines (152 loc) · 7.47 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
Notes on line ordering
======================
The 3-way merge needs to handle lines being reordered in either the A
or B side of the merge or both, wrt. the LCA (Latest Common Ancestor).
We do not do chunk analysis to perform any sort of global matching of
lines; we don't need perfection here, the primary keys in a CSV file
ensure we still have valid output as long as the lines themselves
still exist.
But we do try to handle line ordering decisions locally as best as
possible.
"Local" here means "given the information at, or near, the current
cursors walking through each input file", and performing actions one
line at a time.
We indicate the "cursor" in examples with []: eg.
Line: P Q [R] S T U V W X Y Z
means that we have 11 lines to consider (not including the header),
those lines having primary keys "P" through "Z" respectively, and we
have processed "P" and "Q"; the line being actively considered (at the
cursor) is "R".
Lines before the cursor may be in the "backlog" (stashed for future
processing) if we know that they are still needed later on (to match
against a line in another file that has not yet been processed); lines
after the cursor may already have been "consumed" (processed and
merged by lookahead due to an earlier match in a different intput
file).
So to know which lines are still candidates for processing, we mark
processed lines as lowercase keys; eg. in
Line: p Q [R] S T U V W X y z
"P" has been processed completely but "Q" is in the backlog; "Y" and
"Z" are ahead of the cursor but are already consumed.
Moved lines
===========
Consider a useful example:
state *EXAMPLE1*:
LCA: [P] Q R S T U V W X Y Z
A: [Q] R S T U V W X Y Z P
B: [R] S T U V W X Y Z P Q
In A, "P" has been moved to the end of the file; in B, both "P" and
"Q" have moved.
How do we handle the state shown, when the cursor is at keys "P", "Q"
and "R" respectively in LCA/A/B? The correct state should be as in
file B: both files have moved key "P", so that key should be moved in
the output; B has the additional change of moving "Q", so that change
should be merged on top of A.
Just looking at LCA/A/B above, it is not clear which of they keys "Q"
or "R" should be emitted next. We *can* emit either; we will find a
match in the other files in any case, by lookahead. But "R" matches
against a nearby key in A, whereas "Q" matches against a distant key
in B.
But there is a simpler way to handle this. If we move line "P" in LCA
to the backlog, we get the state:
state *BACKLOG1*:
LCA: p [Q] R S T U V W X Y Z
A: [Q] R S T U V W X Y Z P
B: [R] S T U V W X Y Z P Q
and we can now see that LCA and A are back in sync; LCA and A have the
same next content, but B represents a different change, so we should
merge the change in B, which is that the next lines in LCA/A has been
moved and so should be put into the backlog for now:
state *MOVED1*:
LCA: p q [R] S T U V W X Y Z
A: q [R]S T U V W X Y Z P
B: [R] S T U V W X Y Z P Q
And now all three files are completely in sync, and will remain so for
the rest of the file, until we reach the final "P"/"Q" lines that were
moved between LCA and A/B. Being in sync allows us to detect other
line reorderings accurately, so we want to be in sync as much as
possible.
Resyncing
=========
"Resyncing" is the process of attempting to keep the cursors in all 3
files at corresponding lines. This has 2 major benefits:
* It is much faster to process lines when we can match keys in all
files at once; we simply need to do a one-line merge and move onto
the next line, rather than searching for keys in the backlog /
lookahead; and
* If the next lines in A and B have different keys, We can more
accurately tell why this is if LCA is also in sync, and so make
correct decisions about whether the discrepancy is due to a change
in A or B.
Let's go back to the original state of that example:
state *EXAMPLE1*:
LCA: [P] Q R S T U V W X Y Z
A: [Q] R S T U V W X Y Z P
B: [R] S T U V W X Y Z P Q
The desired action here is to move "P" in LCA to the backlog. How do
we make that determination?
We can look at the distances between different key matches.
The keys at the head of both A and B ("Q" and "R" respectively) are
both found soon in the LCA, at distances of 1 and 2.
The key at the head of LCA ("P") is found in both A and B, but at much
longer distances: 10 lines away (for A) or 9 lines (for B).
Remember that the order of lines in the output is determined by A and
B, not LCA (LCA is used only to merge differences between A and B by
determining which file represents a change that we need to preserve.)
So we are going to emit lines nearer the start of A/B sooner than
lines near the start of LCA.
So "P" is going to be useful only, at best, in 9 lines' time (distance
to occurrence in B). But "Q" and "R" may be useful in 0 or 1 lines'
time.
This tells us that P is better off moving to the backlog for later
consumption. By pushing it out of the way and advancing the LCA
cursor to "Q", we give the merge a chance to get LCA's cursor back in
sync with A and B.
We call the distance function here "distance-to-relevance", or
"relevance" for short; it is the distance to the next line where a
given key may matter for ordering decisions. It does *not* imply that
the line will actually be emitted at that distance (eg. in *BACKLOG1*
above, LCA and A were resynced on "Q", but we then used that
information to tell that B had moved Q, so Q ended up in the backlog,
only to be emmitted much later on. Q was relevant for this decision
but was not output at that time.
Counterexample:
Compare this with the example:
state *EXAMPLE2*:
LCA: [P] Q R S T U V W X Y Z
A: [Z] P Q R S T U V W X Y
B: [Y] Z P Q R S T U V W X
in which lines are moving from the tail of LCA to the head of A/B.
In this case, the LCA line has high relevance (it is useful for making
ordering decisions against A[1] and B[2]), whereas the lines in A/B
("Z" and "Y") have low relevance *for the purpose of describing
original state*; in this case, we should preserve the LCA cursor where
it is; pushing everything in the LCA from "P" through "X" onto the
backlog will achieve nothing.
3-way ordering conflicts
========================
"Resyncing" above is only concerned with advancing LCA to maximise the
ability to keep lines in sync with A and B.
But *EXAMPLE2* above raises another question, about syncing between A
and B when we have a 3-way ordering conflict. Look again at the
example in which A moves one line from the tail of the file to the
head, and B moves two lines:
state *EXAMPLE2*:
LCA: [P] Q R S T U V W X Y Z
A: [Z] P Q R S T U V W X Y
B: [Y] Z P Q R S T U V W X
In this situation we cannot usefully advance LCA, so we need to decide
which key out of A and B to emit next. Yet we have a 3-way line-order
conflict, so neither A nor B can be identified as the changed line
just by comparing with LCA.
In the case above, we should prefer to emit "Y" next; A has moved one
line ("Z"), and B contains that same change plus an additional one
("Y"). So we should keep "Z" as a common change and add th move of
"Y" as an additional change introduced by B.
How can we tell that B should be preferred in the example above?
We use the same "relevance" distance that we used in determining
whether to resync the LCA above. In EXAMPLE2, A is very relevant for
matching "soon" (it has a match in B nearby); B is much less relevant
(it has a distant match). So we want to *keep* the cursor at "Z" in
A, to facilitate a match against B in the near future; matching on B
and emitting "Y" will maximise our chance of resyncing on "Z" soon.