New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exporter fails to recover from log after failover #4350
Comments
Saw this also on another cluster, log: |
On another cluster with 3 brokers and 4 partitions (replication: 3) |
Also on a 6 broker 8 partition (replication: 3) cluster: |
Right off the bat, mega ultra bug: the first index of your log (as posted above) is Now we know that Atomix isn't great with indexes (and in fact we plan on fixing that this quarter), but essentially it does not write the indexes in the log, instead it just "counts". What I'm guessing is it's bad at counting and giving you a bad index, as we store the position and fetch the index from it. I'll dig deeper into it. |
So log looks fine, and to me it doesn't look like an issue with failover per se. What I can see is:
So looks like we compacted but the reader was still behind? How is this possible? Shouldn't we compact only after we've exported (meaning that the reader would be ahead)? /cc @deepthidevaki since you fixed issues with readers thread safety recently |
BTW I think the only interesting setting which might be related is that I set the snapshot period to 5 minutes ( |
4374: fix(broker): use exporter positions to calculate compactable index in snapshots r=npepinpe a=npepinpe ## Description - use concrete entry supplier in AtomixSnapshotStorageTest instead of mocks - use ZeebeIndexMapping and RaftLogReader directly to supply the correct Atomix entry for snapshot metadata - do not create a snapshot if a snapshot with that index already exists - add context map to stackdriver logs - update exported position in RecordingExporter - ensure log density is 1 in QA tests that require snapshotting with low synthetic loads - include partition in ZeebeRaftStateMachine log context ## Related issues closes #4350 # Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Describe the bug
A three node cluster with a single partition and replication factor 3 experienced multiple failovers. It continue in a state where broker 3 was leader for the partition but the exporter failed to recover, therefore no further events where exported.
To Reproduce
Expected behavior
The exporter should be able to recover. In the rare case that is not possible the broker should step down so another broker can continue with a working exporter hopefully.
Log/Stacktrace
Full Stacktrace
Full Log File: https://drive.google.com/open?id=1UVq2qRfxTg18EiyFM7Xp3X10_6CDQyjx
Data folder: https://drive.google.com/open?id=1jfY2Kbs80PTSoFDt7d4ktEa4xbWsNyvo
Environment:
Configuration:
The text was updated successfully, but these errors were encountered: