Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REA-2664] Sequence of ALTER and DROP on a table causes us to get stuck in a resnapshotting loop #106

Closed
nickelization opened this issue Jun 27, 2023 · 6 comments
Labels
bug Something isn't working High priority Created by Linear-GitHub Sync

Comments

@nickelization
Copy link
Contributor

The vertical DDL test found a failure, with a minimal failing case that looks like this:

CREATE TABLE dogs (woof INT);
ALTER TABLE cats RENAME COLUMN meow TO purr;
DROP TABLE cats;
INSERT INTO dogs VALUES (1);```

After this, the results from ReadySet never seems to converge with MySQL, as the INSERT appears to get lost.

Notably, ReadySet repeatedly logs `Aborting replication task on error error=Change in DDL requires partial resnapshot` every ~4 seconds, so it seems like somehow we’re getting stuck in a loop where we keep trying to resnapshot but keep failing for some reason.

This looks superficially somewhat similar to [https://readysettech.atlassian.net/browse/ENG-2655](https://readysettech.atlassian.net/browse/ENG-2655) but it doesn’t seem to be the same issue, as the fallback test created to reproduce that issue continues to pass.

From SyncLinear.com | REA-2664

@nickelization nickelization self-assigned this Jun 27, 2023
@nickelization nickelization added bug Something isn't working High priority Created by Linear-GitHub Sync labels Jun 27, 2023
@nickelization
Copy link
Contributor Author

@gvsg-rs yes, this is that ticket. I created a fix, but after additional testing I found that there are other similar issues that my fix didn’t cover.

Current status is that I would like to try to find a better fix, but I am not actively working on it.

@nickelization
Copy link
Contributor Author

@nickelization — Is this the ticket that you wanted to do more analysis for a better solution? In either case, can you update the status of this ticket?

@nickelization
Copy link
Contributor Author

<~accountid:609236b4b9ac3a007151a40b> yeah that seems like an okay solution I guess. I don’t love it because it seems like it increases the odds of the https://readysettech.atlassian.net/browse/ENG-2785 fix failing to do its job, but I guess if that fix isn’t guaranteed to avoid us getting stuck anyway, then updating the replication offset when tables have been deleted avoids doesn’t really make the situation all that much worse. Certainly having a small chance of getting temporarily stuck in the catch-up phase is better than definitely getting stuck in a snapshot loop forever.

@nickelization
Copy link
Contributor Author

Maybe “update the schema replication offset if any tables no longer exist” is the better solution here?

@nickelization
Copy link
Contributor Author

This was a tricky one, but I’m pretty sure I’ve figured out what’s going on. Looks like this is a relatively recent regression caused by https://gerrit.readyset.name/c/readyset//4637. (Though, since the vertical DDL tests caught this, the fact that it slipped past us is at least partly my fault for not integrating the vertical DDL tests into CI yet.)

The trouble is that when we drop the cats table before the replicator restarts, then we don’t end up needing to resnapshot cats (like we otherwise would have done due to the schema change), and we don't trigger the catch-up process, so we skip updating the schema replication offset (which used to always happen on replicator startup, but now only happens conditionally, due to the aforementioned fix for https://readysettech.atlassian.net/browse/ENG-2785). But then, since the schema replication offset isn't updated, we don't skip over the ALTER TABLE DDL event in the beginning of handle_action like we otherwise would have if we’d incremented the offset.

So then we try to re-apply the same DDL event again, and end up returning Err(ReadySetError::ResnapshotNeeded) from handle_ddl_change, and the whole thing starts over again, and we’re stuck in a resnapshotting loop forever.

Not sure what the right fix is here, and my brain is pretty fried from debugging this, so I’m going to step away for a while. But <accountid:609236b4b9ac3a007151a40b> <accountid:609236f3d3538000687db999> if either of you have any suggestions on how to approach this I would love to hear any ideas you might have 🙂

@nickelization nickelization removed their assignment Jul 7, 2023
@nickelization
Copy link
Contributor Author

nickelization commented Jul 26, 2023

This was fixed by reverting a prior commit that was found to have introduced this as a regression - see a0fe5ae

readysetbot pushed a commit that referenced this issue Aug 24, 2023
This was fixed some time ago when we fixed #106 by reverting the
following commit:

670413429 (replicators: Skip catch up if ReadySet did not snapshot any table., 2023-03-30)

in this revert commit here:

0e8746ac3 (Revert "replicators: Skip catch up if ReadySet did not snapshot any table.", 2023-07-10)

But I neglected to also enable the test after the revert was merged.
Since the test now passes, I'm removing the ignore.

Refs: #106
Refs: REA-2664
Refs: REA-2437
Change-Id: I2ff061f1deeaab65c862c5140f497c60e02fb9b2
readysetbot pushed a commit that referenced this issue Aug 24, 2023
This was fixed some time ago when we fixed #106 by reverting the
following commit:

670413429 (replicators: Skip catch up if ReadySet did not snapshot any table., 2023-03-30)

in this revert commit here:

0e8746ac3 (Revert "replicators: Skip catch up if ReadySet did not snapshot any table.", 2023-07-10)

But I neglected to also enable the test after the revert was merged.
Since the test now passes, I'm removing the ignore.

Refs: #106
Refs: REA-2664
Refs: REA-2437
Change-Id: I2ff061f1deeaab65c862c5140f497c60e02fb9b2
Reviewed-on: https://gerrit.readyset.name/c/readyset/+/5881
Tested-by: Buildkite CI
Reviewed-by: Ethan Donowitz <ethan@readyset.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working High priority Created by Linear-GitHub Sync
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant