Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetcher: Fix data races #4663

Merged
merged 7 commits into from Sep 21, 2021
Merged

Fetcher: Fix data races #4663

merged 7 commits into from Sep 21, 2021

Conversation

matej-g
Copy link
Collaborator

@matej-g matej-g commented Sep 16, 2021

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

In the process of refactoring E2E tests, it was discovered that some compactor tests are occasionally failing due to a data race (#4579 (comment)).

After building the Docker image with race detector enabled and running the compactor E2E tests, it was discovered that there is a number of data races occurring, due to concurrent map write / reading.

This PR attempts to address all the discovered races.

Verification

Ran compactor tests with race detector enabled on the Thanos binary - ✔️

Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
@matej-g matej-g marked this pull request as ready for review September 16, 2021 08:09
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, job, just one nit!

@@ -723,7 +739,10 @@ func (r *ReplicaLabelRemover) Modify(_ context.Context, metas map[ulid.ULID]*met
level.Warn(r.logger).Log("msg", "block has no labels left, creating one", r.replicaLabels[0], "deduped")
l[r.replicaLabels[0]] = "deduped"
}
metas[u].Thanos.Labels = l

nm := *meta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deep copy! 🤗

}

// Filter filters out blocks that are marked for deletion after a given delay.
// It also returns the blocks that can be deleted since they were uploaded delay duration before current time.
func (f *IgnoreDeletionMarkFilter) Filter(ctx context.Context, metas map[ulid.ULID]*metadata.Meta, synced *extprom.TxGaugeVec) error {
f.mtx.Lock()
f.deletionMarkMap = make(map[ulid.ULID]*metadata.DeletionMark)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that needed, do we access this map somewhere else?

If that's the case I think we have to build this map locally and only then swap in locked fashion, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we will have "partial" results

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's being accessed in DeletionMarkBlocks. I'm wondering what are the ramifications of having only 'partial' results in the map, shouldn't we pick up any leftovers during the next run of the cleaner func? (Isn't that how it works now, as current version is also potentially allowing to access map with 'partial' results?).

Otherwise you're right, we can also build it separately and swap afterwards while using lock.

Copy link
Member

@bwplotka bwplotka Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just assuming there is a reader somewhere else accessing the deletion map concurrently. For a short period instead of 20000 deleted items it would have zero elements and the back to 20000. I would need to check what exactly wrong can happen with this, but it is just asking for problems here, no? (: Maybe now, things would eventually heal, but tomorrow no. No one would assume that internal map is flaky unless it's called flakyDeleionMarkMap across instances with 20 lines of comments on why and still that might be surprising for next writer/reader (:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would fix this before moving on

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, I see the point now 👍 I adjusted it, please let me know if it looks good now!

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded (:

f.mtx.Lock()
defer f.mtx.Unlock()

deletionMarkMap := make(map[ulid.ULID]*metadata.DeletionMark)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deletionMarkMap := make(map[ulid.ULID]*metadata.DeletionMark)
deletionMarkMap := make(map[ulid.ULID]*metadata.DeletionMark, len(f.deletionMarkMap))

}

// Filter filters out blocks that are marked for deletion after a given delay.
// It also returns the blocks that can be deleted since they were uploaded delay duration before current time.
func (f *IgnoreDeletionMarkFilter) Filter(ctx context.Context, metas map[ulid.ULID]*metadata.Meta, synced *extprom.TxGaugeVec) error {
f.mtx.Lock()
f.deletionMarkMap = make(map[ulid.ULID]*metadata.DeletionMark)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would fix this before moving on

Signed-off-by: Matej Gera <matejgera@gmail.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You! LGTM! 💪🏽

@@ -871,6 +900,10 @@ func (f *IgnoreDeletionMarkFilter) Filter(ctx context.Context, metas map[ulid.UL
return errors.Wrap(err, "filter blocks marked for deletion")
}

f.mtx.Lock()
f.deletionMarkMap = deletionMarkMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

@bwplotka bwplotka enabled auto-merge (squash) September 21, 2021 11:13
@bwplotka bwplotka merged commit b77b8ee into thanos-io:main Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants