Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): fix WAL segment housekeeping edge case leading to a failure to write to WAL table #3381

Merged
merged 10 commits into from
May 24, 2023

Conversation

amunra
Copy link
Contributor

@amunra amunra commented May 22, 2023

Fixed a possible but unlikely scenario where the WalPurgeJob could delete a wal segment before it was applied.

The scenario consists of the following:

  • The purge job scans for wal segments and tracks a locked wal segment (in this scenario (1,6).
  • The WalPurgeJob thread is parked.
  • In the meantime, another wal segment (1,7) is created and then closed and unlocked.
  • The WalPurgeJob thread is re-scheduled.
  • It discovers (1,7) as finds it to be unlocked.

Faulty logic assumed that a locked segment would always be the last segment for a WAL, when this is not the case.

The logic has now been simplified, extracted out to a separate class for testability and the bug fixed.

@amunra amunra force-pushed the fix_premature_cleanup branch 4 times, most recently from 79d6938 to 9e8c30c Compare May 22, 2023 16:19
@amunra amunra changed the title fix(wal): WalPurgeJob no longer prematurely deletes unlocked segments. fix(wal): WalPurgeJob no longer prematurely deletes unlocked segments May 22, 2023
@amunra amunra requested a review from ideoma May 22, 2023 16:27
@amunra amunra marked this pull request as ready for review May 22, 2023 16:27
@amunra
Copy link
Contributor Author

amunra commented May 22, 2023

It's a little difficult to tell from the diff what actually changed in the logic.

The old logic accumulated state about what walId/segmentId is next to be applied in a LongList.

The bug was in this code:

// If the current segment is locked, scroll to next WAL id.
final int walIdLimit = segmentLocked ? walId + 1 : walId;    // <--- actual bug
while (committedWalId < walIdLimit && ++committedWalSegmentIndex < committedSegmentSize) {
    long sequencerPair = sequencerWalIDSegmentIDPairs.get(committedWalSegmentIndex);
    committedWalId = Numbers.decodeHighInt(sequencerPair);
    committedSegmentId = Numbers.decodeLowInt(sequencerPair);
}

In the scenario from the test case it ended up holding committedWalId == 2 and committedSegmentId == 0, when iterating segment (1,7).

This bad state was set up in the previous iteration of the loop and caused the wrong branch in the logic to be executed.

if (walId == committedWalId) {
    ...
} else if (!segmentLocked) {
    ...  // this code got executed when it should not have been.
}

Ultimately it was easier to replace the LongList with an IntIntHashMap and remove the complex iteration logic.

ideoma
ideoma previously approved these changes May 22, 2023
@amunra amunra changed the title fix(wal): WalPurgeJob no longer prematurely deletes unlocked segments fix(wal): edge case where a segment was purged prematurely May 23, 2023
ideoma
ideoma previously approved these changes May 23, 2023
core/src/main/java/io/questdb/cairo/wal/WalPurgeJob.java Outdated Show resolved Hide resolved
@ideoma
Copy link
Collaborator

ideoma commented May 23, 2023

[PR Coverage check]

😍 pass : 169 / 173 (97.69%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/wal/WalWriter.java 47 51 92.16%
🔵 io/questdb/cairo/wal/WalPurgeJob.java 121 121 100.00%
🔵 io/questdb/cairo/TableUtils.java 1 1 100.00%

@ideoma ideoma changed the title fix(wal): edge case where a segment was purged prematurely fix(core): fix WAL segment housekeeping edge case leading to a failure to write to WAL table May 24, 2023
@ideoma ideoma merged commit c749cff into master May 24, 2023
21 checks passed
@ideoma ideoma deleted the fix_premature_cleanup branch May 24, 2023 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants