Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compactor: Rare race of sharded compactors / slow upload causing data loss. #1919

Closed
bwplotka opened this issue Dec 20, 2019 · 5 comments · Fixed by #1937
Closed

compactor: Rare race of sharded compactors / slow upload causing data loss. #1919

bwplotka opened this issue Dec 20, 2019 · 5 comments · Fixed by #1937
Labels

Comments

@bwplotka
Copy link
Member

bwplotka commented Dec 20, 2019

As per discussion, #1901 sharding compactor is currently not fully safe. Please do not use it or use only for smaller setups.

We are working on this, but the reason is also partially explained in https://docs.google.com/document/d/1QvHt9NXRvmdzy51s4_G21UW00QwfL9BfqBDegcqNvg0/edit#

TL;DR:

  • Compactor A for shard cluster = A is uploading the block X - upload takes more than 30m
  • Compactor B is syncing all blocks. It see block X that is being uploaded for more than 30m, removing block.
@bwplotka bwplotka added the bug label Dec 20, 2019
@bwplotka
Copy link
Member Author

bwplotka commented Dec 20, 2019

The workaround here is to change --consistencyDelay on Compactor to larger value like 2h, given block upload time being for Compactor, for biggest block being quicker than this. This has side effect of compaction process being delayed by 2h which should be fine.

cc @mattrco

bwplotka added a commit that referenced this issue Jan 3, 2020
This replaces man 4 inconsistent meta.json syncs places in other components.

Fixes: #1335
Fixes: #1919
Fixes: #1300

* One place for sync logic for both compactor and store
* Corrupted disk cache for meta.json is handled gracefully.
* Blocks without meta.json are handled properly for all compactor phases.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.
* Remove Compactor Syncer.
* Added metric for partialUploadAttempt deletions.
* More tests.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Jan 3, 2020
This replaces man 4 inconsistent meta.json syncs places in other components.

Fixes: #1335
Fixes: #1919
Fixes: #1300

* One place for sync logic for both compactor and store
* Corrupted disk cache for meta.json is handled gracefully.
* Blocks without meta.json are handled properly for all compactor phases.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.
* Remove Compactor Syncer.
* Added metric for partialUploadAttempt deletions.
* More tests.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1335
Fixes: #1919
Fixes: #1300

* Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation).
* Blocks without meta.json are handled properly for all compactor phases.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Added metric for partialUploadAttempt deletions and delayed it.
* More tests.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member Author

bwplotka commented Jan 3, 2020

bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1335
Fixes: #1919
Fixes: #1300

* Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation).
* Blocks without meta.json are handled properly for all compactor phases.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Added metric for partialUploadAttempt deletions and delayed it.
* More tests.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1335
Fixes: #1919
Fixes: #1300

* Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation).
* Blocks without meta.json are handled properly for all compactor phases.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Added metric for partialUploadAttempt deletions and delayed it.
* More tests.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1335
Fixes: #1919
Fixes: #1300

* Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation).
* Blocks without meta.json are handled properly for all compactor phases.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Added metric for partialUploadAttempt deletions and delayed it.
* More tests.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Jan 6, 2020
Fixes: #1335
Fixes: #1919
Fixes: #1300

* Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation).
* Blocks without meta.json are handled properly for all compactor phases.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Added metric for partialUploadAttempt deletions and delayed it.
* More tests.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Jan 6, 2020
Fixes: #1335
Fixes: #1919
Fixes: #1300

* Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation).
* Blocks without meta.json are handled properly for all compactor phases.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Added metric for partialUploadAttempt deletions and delayed it.
* More tests.


Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Jan 6, 2020
Fixes: #1335
Fixes: #1919
Fixes: #1300

* Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation).
* Blocks without meta.json are handled properly for all compactor phases.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Added metric for partialUploadAttempt deletions and delayed it.
* More tests.


Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member Author

bwplotka commented Jan 6, 2020

@mattrco this should be fixed on master now.

@jaseemabid
Copy link
Contributor

Thanks @bwplotka, is --consistencyDelay required for the master release too?

@bwplotka
Copy link
Member Author

bwplotka commented Jan 7, 2020

It is required but for different reasons, mainly for non-consistent object storages. I would suggest leaving the default value.

@bwplotka bwplotka changed the title compactor: Race of sharded compactors causing data loss. compactor: Rare race of sharded compactors / slow upload causing data loss. Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants