Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
dataflow: Buffer full replays in unions by tag
If multiple diamond unions exist in a chain, each of those unions will receive a number of replays equal to the number of replay paths (the product of the number of path splits). Previously, each union would buffer replays until it had received a number of replays equal to its number of parents, but in the case of successive diamonds that would break if the replays were received out of order - eg a union of two nodes could receive two replays from its left parent, and release those as a complete replay, before ever receiving a replay from its right parent. f8c3582 (Added replay path sorting to ensure we traverse them correctly., 2021-10-26) attempted to work around this by sorting the replay paths so that earlier unions would receive replays from all parents (and then successive replays would just overwrite the downstream state with the same set of records). That happened to work in the case that was tested where all the nodes were in the same domain and replays were all happening synchronously since we fully processed replays through nodes before starting another replay, but doesn't work if things are happening more concurrently, for the same reason as before. Instead of that approach, this commit fixes the issue more sustainably by buffering full replays within unions on a *per tag* basis, rather than overall. This works because the "replay grouping" code in the materialization planner already groups replay paths with identical suffixes under the same tag, so we know that if we've received a number of replays equal to the number of parents with the same tag, then we've received a complete picture of our upstream state and can release the replay downstream. There's also a comment within the code explaining the current state of how all this works in another way. Since this is a better fix for the same issue, this also reverts the changes in f8c3582 (Added replay path sorting to ensure we traverse them correctly., 2021-10-26). This doesn't fix any test cases as of this commit (getting a failing test here for the nodes-in-different-domains case is somewhat annoying given node assignment heuristics) but paves the way for making full replays stream asynchronously, which was another way to hit this issue. Along the way, this also adds some new trace logging which I used to debug this issue. Fixes: REA-2989 Change-Id: Iea0adb55d7f7d7139b51319c04ad88c464297e35 Reviewed-on: https://gerrit.readyset.name/c/readyset/+/5391 Tested-by: Buildkite CI Reviewed-by: Dan Wilbanks <dan@readyset.io>
- Loading branch information