feat(storage): sync shared buffer based on threshold #1671

StrikeW · 2022-04-07T10:32:10Z

What's changed and what's your intention?

Add a threshold size to hummock shared buffer and sync batches to S3 if there is no enough space for incoming batches

add shared_buffer_cur_size and shared_buffer_threshold_size to SharedBufferManager
write_batch() will first allocate space for the incoming batch
if there is no enough space, we will trigger a sync(None) to flush the shared buffer
use tokio watch channel to prevent sending multiple messages for flush

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests

Refer to a related PR or issue link (optional)

close #990

StrikeW · 2022-04-07T10:33:53Z

I will append the e2e benchmark results later.

hzxa21 · 2022-04-07T11:46:46Z

src/compute/src/server.rs

+        storage_config.as_ref(),
+        registry.clone(),
+    ));
+    log::info!("State store shared buffer threshold {} MB", storage_config.shared_buffer_threshold_mb);


We already printed all configs in L74 so I think there is no need to print shared_buffer_threshold_mb here.

hzxa21 · 2022-04-07T12:15:26Z

src/storage/src/monitor/state_store_metrics.rs

+            pub shared_buffer_cur_size: atomic::AtomicU64,
+            pub shared_buffer_threshold_size: u64,


These two fields are used solely by SharedBufferFlusher and not exposed. Should we track shared_buffer_cur_size in SharedBufferManager instead?

hzxa21 · 2022-04-07T12:33:13Z

src/storage/src/hummock/mod.rs

+            while self.stats.shared_buffer_threshold_size <= shared_buff_cur_size {
+                yield_now().await;
+                shared_buff_cur_size = self.stats.shared_buffer_cur_size.load(atomic::Ordering::SeqCst);
+            }


Passively waiting for shared buffer (can wait for up to the SharedBufferFlusher check interval 10ms) to free up in a while loop is less optimal than proactively triggering a sync immediately. Also I think we should have two thresholds for shared buffer size:

size_low_watermark: start syncing to S3 when shared_buffer_cur_size > size_low_watermark

size_high_watermark halt ingest_batch when shared_buffer_cur_size > size_high_watermark. Moreover, ingest_batch should trigger a sync when there is no ongoing sync happening at the moment.

Very helpful! You are right. With low and high watermark can achieve a more smooth strategy.

StrikeW · 2022-04-07T14:25:41Z

It's weird. The Grafana diagram reflects that executors can still ingest batch even reach shared buffer threshold.

If use an older version as baseline (ae9222d, the version I first implement the code), the sync throughput is higher than the ingest. But the compute node still get killed due to OOM in both cases (minIO as storage).

BugenZhao · 2022-04-08T05:04:30Z

src/storage/src/hummock/mod.rs

+                .load(atomic::Ordering::SeqCst);
+            // yield current task if threshold has been reached
+            while self.stats.shared_buffer_threshold_size <= shared_buff_cur_size {
+                yield_now().await;


This doesn't seem to have a guarantee about scheduling. In the worst cases, this will become a busy spin-loop. Suggest using watch notifier here.

BugenZhao · 2022-04-08T05:05:59Z

src/storage/src/hummock/shared_buffer/mod.rs

 pub mod shared_buffer_batch;
+pub mod shared_buffer_flusher;
 pub mod shared_buffer_manager;
 pub mod shared_buffer_uploader;


Just to mention, the shared_buffer_ prefix here seems to be redundant. 🤣

src/storage/src/hummock/shared_buffer/shared_buffer_flusher.rs

BugenZhao · 2022-04-08T05:12:32Z

src/storage/src/hummock/shared_buffer/shared_buffer_uploader.rs

+            shared_buff_prev_size,
+            sync_size
+        );
+        assert!(shared_buff_prev_size >= sync_size);


No need to assert this since sub-overflow will cause panic in L122.

BugenZhao · 2022-04-08T05:13:48Z

src/storage/src/monitor/state_store_metrics.rs

+            pub shared_buffer_cur_size: atomic::AtomicU64,
+            pub shared_buffer_threshold_size: u64,


BugenZhao · 2022-04-08T05:14:40Z

src/storage/src/store.rs

@@ -25,14 +25,15 @@ use crate::write_batch::WriteBatch;
 pub trait GetFutureTrait<'a> = Future<Output = Result<Option<Bytes>>> + Send;
 pub trait ScanFutureTrait<'a, R, B> = Future<Output = Result<Vec<(Bytes, Bytes)>>> + Send;
 pub trait EmptyFutureTrait<'a> = Future<Output = Result<()>> + Send;
+pub trait IngestBatchFutureTrait<'a> = Future<Output = Result<u64>> + Send;


Consider updating the doc for ingest_batch as well.

src/storage/src/memory.rs

src/storage/src/hummock/shared_buffer/shared_buffer_batch.rs

BugenZhao · 2022-04-08T09:40:47Z

It's weird. The Grafana diagram reflects that executors can still ingest batch even reach shared buffer threshold.

If use an older version as baseline (ae9222d, the version I first implement the code), the sync throughput is higher than the ingest. But the compute node still get killed due to OOM in both cases (minIO as storage).

Once the threshold is not reached, all waiting writers seem to have chances for writing their chunks. So this may also lead to OOM.

A simple design might be to use a semaphore with initial permits reflecting the size threshold. When a writer needs to ingest a batch, it can try acquiring some permits according to the size of the batch. 🤣

StrikeW · 2022-04-08T10:45:53Z

After I fixed the problem (a concurrency bug), the diagram shows that the trend of the two curves appear to be the same. And the compute node doesn't consume much memory. I will refine the PR later.

CLAassistant · 2022-04-08T14:51:53Z

All committers have signed the CLA.

minor

codecov · 2022-04-13T07:44:12Z

Codecov Report

Merging #1671 (ff21261) into main (5575b70) will increase coverage by 0.10%.
The diff coverage is 86.56%.

@@            Coverage Diff             @@
##             main    #1671      +/-   ##
==========================================
+ Coverage   71.33%   71.43%   +0.10%     
==========================================
  Files         605      606       +1     
  Lines       79168    79481     +313     
==========================================
+ Hits        56474    56778     +304     
- Misses      22694    22703       +9

Flag	Coverage Δ
rust	`71.43% <86.56%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/bench/ss_bench/main.rs	`0.80% <0.00%> (-0.01%)`	⬇️
src/compute/src/server.rs	`0.00% <0.00%> (ø)`
src/storage/src/hummock/local_version_manager.rs	`83.00% <0.00%> (-0.62%)`	⬇️
src/storage/src/monitor/monitored_store.rs	`2.81% <0.00%> (+0.13%)`	⬆️
src/storage/src/store.rs	`100.00% <ø> (ø)`
...src/hummock/shared_buffer/shared_buffer_manager.rs	`93.53% <79.23%> (-3.61%)`	⬇️
...rc/hummock/shared_buffer/shared_buffer_uploader.rs	`87.06% <91.66%> (+7.97%)`	⬆️
src/common/src/config.rs	`78.44% <100.00%> (+0.76%)`	⬆️
src/storage/src/hummock/mod.rs	`75.00% <100.00%> (+0.70%)`	⬆️
...e/src/hummock/shared_buffer/shared_buffer_batch.rs	`95.94% <100.00%> (+0.18%)`	⬆️
... and 17 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

StrikeW · 2022-04-13T09:02:04Z

Benchmark results of two compute nodes on a single EC2 instance.
Workload: select * from lineitem;
Setup: shared_buffer_threshold_size 128MB

hzxa21

LGTM. Great work! Thanks for the PR.

skyzh · 2022-04-13T10:30:42Z

Can we have more complicated test setups, like TPC-H with joins? I'd like to see if barrier can successfully pass through the system compared with the status before this PR.

BugenZhao · 2022-04-13T11:02:09Z

src/storage/src/hummock/shared_buffer/shared_buffer_manager.rs

+        let mut res = Ok(());
+        let sync_res = rx.await.unwrap();


Why not let sync_size = rx.await.unwrap()?; here?

BugenZhao · 2022-04-13T11:02:45Z

src/storage/src/hummock/shared_buffer/shared_buffer_manager.rs

+                    .stats
+                    .shared_buffer_cur_size
+                    .fetch_sub(sync_size, Ordering::SeqCst);
+                assert!(shared_buff_prev_size >= sync_size);


No need for this assertion.

BugenZhao · 2022-04-13T11:10:37Z

src/storage/src/hummock/shared_buffer/shared_buffer_manager.rs

+                ));
+            }
+        } else {
+            self.ongoing_flush.init();


What if there're two coroutines running here at the same time? Seems this will panic on assertion. The subscribe and init should be executed atomically.

That's a problem.

BugenZhao · 2022-04-13T11:17:14Z

src/storage/src/hummock/shared_buffer/shared_buffer_manager.rs

+            // Wait for notification
+            rx.changed().await.unwrap();
+            if !(*rx.borrow()) {
+                res = Err(HummockError::shared_buffer_error(


Directly return here.

BugenZhao · 2022-04-13T11:21:53Z

src/storage/src/hummock/shared_buffer/shared_buffer_batch.rs

+                        HummockValue::Put(_, val) => VALUE_META_SIZE + val.len(),
+                        HummockValue::Delete(_) => VALUE_META_SIZE,


What if the inner implementation of HummockValue changes in the future? IIRC, there is a function of encoded_len.

encoded_len seems to be the length in the SST file. But we need the size of kv pair in memory. May be there is a sizeof(HummockValue) in Rust?

src/storage/src/hummock/shared_buffer/shared_buffer_manager.rs

MrCroxx · 2022-04-13T14:18:19Z

src/storage/src/hummock/shared_buffer/shared_buffer_manager.rs

+    pub async fn write_batch(
+        &self,
+        batch: Vec<SharedBufferItem>,
+        epoch: HummockEpoch,
+    ) -> HummockResult<u64> {
        let batch = SharedBufferBatch::new(batch, epoch);
+        let size = batch.size;
+
+        self.allocate_space(size).await?;
+
        self.shared_buffer
            .write()
            .entry(epoch)
            .or_insert(BTreeMap::new())
            .insert(batch.end_user_key().to_vec(), batch.clone());
        self.uploader_tx
            .send(SharedBufferUploaderItem::Batch(batch))
-            .map_err(HummockError::shared_buffer_error)
+            .map_err(HummockError::shared_buffer_error)?;
+        Ok(size)
    }


Consider that there are 3 writers entering L181. Writer 1 exceeded the threshold and triggered flush, writer 2 advanced the current size by some number, then writer 3 exceeded the threshold and triggered flush again. But none of the 3 writers stepped to L183 at the time. It seems that in this case the shared buffer will be flushed twice (the second time flushed an empty buffer), and the shared buffer will still OOM.

For the multiple flush concern, in the flush() method, only the first writer will send the flush request, and other writers will see there is an ongoing flush and subscribe to the flush notification.
And for the scenario in your case, I assume all three writers are on the same compute node. Then writer 2 cannot advance the size until writer 1 finished the flush.

But after writer 1 finish flushing and before writing shared buffer, there is still a chance that writer 2 and 3 can call flush again.

StrikeW · 2022-04-14T06:39:34Z

Can we have more complicated test setups, like TPC-H with joins? I'd like to see if barrier can successfully pass through the system compared with the status before this PR.

Sure, but the tpch-bench cannot run against risingwave now. The java frontend complains "could not find source catalog for x". It seems that the support of kafka source is broken now (maybe under refactoring).

skyzh · 2022-04-14T06:50:45Z

Rust frontend should have supported it. May join #proj-connector for more information.

StrikeW requested review from twocode and hzxa21 April 7, 2022 10:32

github-actions bot added the type/feature label Apr 7, 2022

hzxa21 reviewed Apr 7, 2022

View reviewed changes

BugenZhao reviewed Apr 8, 2022

View reviewed changes

StrikeW changed the title ~~feat(storage): add shared buffer flusher to sync based on threshold~~ [WIP] feat(storage): sync shared buffer based on threshold Apr 12, 2022

StrikeW added 5 commits April 13, 2022 14:49

feat(storage): add shared buffer flusher to sync based on threshold

ec78347

minor

refine

8d45d64

refine again

48c6e64

final refine!

f488e67

resolve conflicts

606899e

StrikeW force-pushed the sy/shared_buffer_threshold branch from c5ce0ba to 606899e Compare April 13, 2022 07:11

StrikeW added 2 commits April 13, 2022 15:24

minor

d8111a0

minor

b9b16e0

minor

e289bee

StrikeW changed the title ~~[WIP] feat(storage): sync shared buffer based on threshold~~ feat(storage): sync shared buffer based on threshold Apr 13, 2022

StrikeW requested a review from hzxa21 April 13, 2022 07:57

StrikeW added 2 commits April 13, 2022 16:46

minor

d48170d

minor

a433f5c

minor

ff21261

StrikeW requested a review from BugenZhao April 13, 2022 09:05

hzxa21 approved these changes Apr 13, 2022

View reviewed changes

StrikeW merged commit 6ee339c into main Apr 13, 2022

StrikeW deleted the sy/shared_buffer_threshold branch April 13, 2022 10:23

BugenZhao reviewed Apr 13, 2022

View reviewed changes

MrCroxx reviewed Apr 13, 2022

View reviewed changes

src/storage/src/hummock/shared_buffer/shared_buffer_manager.rs Show resolved Hide resolved

MrCroxx reviewed Apr 13, 2022

View reviewed changes

This was referenced Apr 14, 2022

feat: flush when shared buffer reaches threshold and refine code #1850

Closed

feat: use threshold and capacity to control shared buffer usage and r… #1928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): sync shared buffer based on threshold #1671

feat(storage): sync shared buffer based on threshold #1671

StrikeW commented Apr 7, 2022 •

edited

StrikeW commented Apr 7, 2022

hzxa21 Apr 7, 2022

hzxa21 Apr 7, 2022

BugenZhao Apr 8, 2022

StrikeW Apr 8, 2022

hzxa21 Apr 7, 2022

StrikeW Apr 8, 2022

StrikeW commented Apr 7, 2022

BugenZhao Apr 8, 2022

BugenZhao Apr 8, 2022 •

edited

BugenZhao Apr 8, 2022

BugenZhao Apr 8, 2022

BugenZhao Apr 8, 2022

BugenZhao commented Apr 8, 2022

StrikeW commented Apr 8, 2022 •

edited

CLAassistant commented Apr 8, 2022 •

edited

codecov bot commented Apr 13, 2022 •

edited

StrikeW commented Apr 13, 2022

hzxa21 left a comment

skyzh commented Apr 13, 2022

BugenZhao Apr 13, 2022

BugenZhao Apr 13, 2022

BugenZhao Apr 13, 2022

StrikeW Apr 14, 2022

BugenZhao Apr 13, 2022

BugenZhao Apr 13, 2022 •

edited

StrikeW Apr 14, 2022 •

edited

MrCroxx Apr 13, 2022 •

edited

StrikeW Apr 14, 2022 •

edited

MrCroxx Apr 14, 2022

StrikeW commented Apr 14, 2022

skyzh commented Apr 14, 2022

		pub shared_buffer_cur_size: atomic::AtomicU64,
		pub shared_buffer_threshold_size: u64,

		HummockValue::Put(_, val) => VALUE_META_SIZE + val.len(),
		HummockValue::Delete(_) => VALUE_META_SIZE,

feat(storage): sync shared buffer based on threshold #1671

feat(storage): sync shared buffer based on threshold #1671

Conversation

StrikeW commented Apr 7, 2022 • edited

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

StrikeW commented Apr 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikeW commented Apr 7, 2022

Choose a reason for hiding this comment

BugenZhao Apr 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugenZhao commented Apr 8, 2022

StrikeW commented Apr 8, 2022 • edited

CLAassistant commented Apr 8, 2022 • edited

codecov bot commented Apr 13, 2022 • edited

Codecov Report

StrikeW commented Apr 13, 2022

hzxa21 left a comment

Choose a reason for hiding this comment

skyzh commented Apr 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugenZhao Apr 13, 2022 • edited

Choose a reason for hiding this comment

StrikeW Apr 14, 2022 • edited

Choose a reason for hiding this comment

MrCroxx Apr 13, 2022 • edited

Choose a reason for hiding this comment

StrikeW Apr 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikeW commented Apr 14, 2022

skyzh commented Apr 14, 2022

StrikeW commented Apr 7, 2022 •

edited

BugenZhao Apr 8, 2022 •

edited

StrikeW commented Apr 8, 2022 •

edited

CLAassistant commented Apr 8, 2022 •

edited

codecov bot commented Apr 13, 2022 •

edited

BugenZhao Apr 13, 2022 •

edited

StrikeW Apr 14, 2022 •

edited

MrCroxx Apr 13, 2022 •

edited

StrikeW Apr 14, 2022 •

edited