Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(storage): fix race condition between pin version and new shared-buffer #3651

Merged
merged 3 commits into from
Jul 5, 2022

Conversation

Little-Wallace
Copy link
Contributor

@Little-Wallace Little-Wallace commented Jul 5, 2022

Signed-off-by: Little-Wallace bupt2013211450@gmail.com

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

close #3639

This bug may cause write lose because the shared buffer was created in BTreeMap of before version but the other thread will replace the whole LocalVersion, include BTreeMap of shared-buffer and HummockVersion.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

Refer to a related PR or issue link (optional)

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we only have one thread updating the pinned vaersion, can we avoid all the complexity and inefficiency to first acquire read lock and then write lock by:

  1. Do not check version id in try_update_pinned_version, rename it to update_pinned_version.
  2. Let the caller ensure the updated version is newer. It is easier to do so because the caller anyway needs to get the local pinned version id before sending pin_version RPC.

@Little-Wallace
Copy link
Contributor Author

Given that we only have one thread updating the pinned vaersion, can we avoid all the complexity and inefficiency to first acquire read lock and then write lock by:

  1. Do not check version id in try_update_pinned_version, rename it to update_pinned_version.
  2. Let the caller ensure the updated version is newer. It is easier to do so because the caller anyway needs to get the local pinned version id before sending pin_version RPC.

But it is dangerous to depend on other module to keep version id supplied by parameters is incremental.

@@ -230,10 +230,10 @@ impl LocalVersionManager {
conflict_detector.set_watermark(newly_pinned_version.max_committed_epoch);
}

let mut new_version = old_version;
let mut new_version = old_version.clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to update the pinned_version in local version here so we don't need to clone and replace the whole local version.

{
            let mut guard = RwLockUpgradableReadGuard::upgrade(old_version);
            guard.set_pinned_version(newly_pinned_version);
            RwLockWriteGuard::unlock_fair(guard);
}

How about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no different from the code before #3620....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not hope to run set_pinned_version with a write lock because it will also change BTreeMap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The differences are:

  1. We don't need to acquire a write lock if the version id is not larger.
  2. We release the write lock before version_update_notifier_tx.send.
  3. We ensure fair unlock

Given that the version id check is very light-weight, I don't think we can benefit from 1). I am not sure about 2) either. I suspect the necessity of #3620 but maybe 3) is the key point?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not hope to run set_pinned_version with a write lock because it will also change BTreeMap.

I see your point now. LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No....It is not enought. Our goal in #3620 is to reduce the time holding write mutex

@Little-Wallace Little-Wallace marked this pull request as ready for review July 5, 2022 07:39
@codecov
Copy link

codecov bot commented Jul 5, 2022

Codecov Report

Merging #3651 (1e4d1f5) into main (0efa4dd) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #3651      +/-   ##
==========================================
- Coverage   74.40%   74.40%   -0.01%     
==========================================
  Files         781      781              
  Lines      110788   110788              
==========================================
- Hits        82432    82430       -2     
- Misses      28356    28358       +2     
Flag Coverage Δ
rust 74.40% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/storage/src/hummock/local_version_manager.rs 81.38% <100.00%> (-0.12%) ⬇️
src/meta/src/hummock/mock_hummock_meta_client.rs 40.56% <0.00%> (-0.95%) ⬇️
src/connector/src/filesystem/file_common.rs 80.35% <0.00%> (-0.45%) ⬇️
src/frontend/src/expr/utils.rs 98.99% <0.00%> (ø)
src/storage/src/hummock/iterator/merge_inner.rs 90.00% <0.00%> (+0.62%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@skyzh
Copy link
Contributor

skyzh commented Jul 5, 2022

Can we get this merged first so that our release version won't contain race condition? We may solve other issues later, it's okay to have some regression for now. 🤣

@skyzh skyzh added the mergify/can-merge Indicates that the PR can be added to the merge queue label Jul 5, 2022
@mergify mergify bot merged commit 480010f into risingwavelabs:main Jul 5, 2022
nasnoisaac pushed a commit to nasnoisaac/risingwave that referenced this pull request Aug 9, 2022
…uffer (risingwavelabs#3651)

* fix

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

* fix warn

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mergify/can-merge Indicates that the PR can be added to the merge queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test: frequent missing data in e2e test
3 participants