Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compact/main: verify compacted and downsampled blocks #194

Merged
merged 2 commits into from
Mar 1, 2018
Merged

Conversation

fabxc
Copy link
Collaborator

@fabxc fabxc commented Jan 30, 2018

This runs the verification procedure before and after we create new blocks from previous blocks. The compactor will block indefinitely and the downsample
run will terminate with an error.

for _, g := range sy.Groups() {
if _, err := g.Compact(ctx, comp); err != nil {
level.Error(logger).Log("msg", "compaction failed", "err", err)
// The HaltError type signals that we hit a critical bug and should block
// for investigation.
if compact.IsHaltError(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally see the reason here, but I feel like this need to be automated a little bit.. I can see cases when people will start it and don't have (or remember to have) alert for halted compactor.. And from orchestration point of view it will be seen as healthy..

Not sure what is the best way to handle it, maybe halting that as an option? (configured by flag) or maybe just return error and fail everything. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a debugging point of view this is the best course of action I think. Remember that it should[tm] not happen to begin with. But if it did and we just exit with an error, we'll get rescheduled and debugging gets harder.

A less severe alternative would be marking ourselves as unhealthy. But in that case k8s will also just kill us eventually and reschedule.

I get that it's a bit of a dirty approach. But we do log an error and we are setting a respective metric.
Once we are fully confident that the issue is resolved we can maybe just exit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I understand it, but let's hide it behind some debug flag then.

@fabxc
Copy link
Collaborator Author

fabxc commented Feb 27, 2018

Added a hidden flag to enable this behavior.

This runs the verification procedure whenever we encounter a block that
is not valid. The compactor will block indefinitely and the downsample
run will terminate with an error.
@fabxc fabxc merged commit b77877f into master Mar 1, 2018
}
// Ensure all input blocks are valid.
if err := block.VerifyIndex(filepath.Join(pdir, "index")); err != nil {
return id, errors.Wrapf(halt(err), "invalid plan block %s", pdir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to claim this error: #246
Let's revisit this code to investigate. Looks like index is not broken, just index file is no longer there.
@fabxc

Will have time for it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants