Skip to content

Conversation

@mkuratczyk
Copy link
Contributor

@mkuratczyk mkuratczyk commented Jun 14, 2022

If a server was terminated without flushing data in the page cache this could cause errors during stream member recovery. This PR improves the recovery / truncation phase in multiple ways to handle situations where unflushed data is lost.

  • fsync index and segment when closing them (to be implemented separately)
  • during init, delete files that don't contain any useful data (empty, header-only, no valid chunks)
  • during osiris_log:overview, which can happen before init, ignore files such as above (they get later deleted during init)
  • Refactoring of how osiris_log detects and handles reaching a full segment. Before it was evaluated after each write, now it is done before each write which should work better with scenarios where data loss results in no data in the last segment.

Fixes #79

@kjnilsson kjnilsson changed the title Handle corrupted index and segment files DO NOT MERGE YET: Handle corrupted index and segment files Jul 4, 2022
@kjnilsson
Copy link
Contributor

I've found an issue where the index is 0 and there is segment data which it wasn't able to correctly truncate

As this may affect the order in which the respective dirty pages
will be flushed. Maybe.
@kjnilsson kjnilsson changed the title DO NOT MERGE YET: Handle corrupted index and segment files Handle corrupted index and segment files Jul 5, 2022
@kjnilsson kjnilsson changed the title Handle corrupted index and segment files Handle data loss in index and segment files Jul 5, 2022
@kjnilsson
Copy link
Contributor

I've found an issue where the index is 0 and there is segment data which it wasn't able to correctly truncate

This appears fixed since 80600af

@kjnilsson kjnilsson merged commit 4bed6b5 into main Jul 5, 2022
@kjnilsson kjnilsson deleted the corrupted-segment branch March 20, 2024 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Operating-system restart/crash can leave a log in an usable state

3 participants