feat(metrics): add backpressure metrics #3636

Sunt-ing · 2022-07-04T12:18:22Z

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

Add backpressure metrics. Similar to Flink backpressure monitoring, we get the backpressure rate by sampled output buffer occupation rate.

Bench result in 3-CN mode:

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

#3284

src/stream/src/executor/dispatch.rs

skyzh · 2022-07-04T12:51:57Z

src/stream/src/task/stream_manager.rs

+    actor_coroutine_monitor_tasks: HashMap<ActorId, JoinHandle<()>>,
+
+    /// Stores all actor output buffer montioring tasks.
+    actor_output_buffer_monitor_tasks: HashMap<ActorId, JoinHandle<()>>,


Can we have a single extra coroutine for all metrics collection?

src/stream/src/executor/dispatch.rs

skyzh · 2022-07-04T12:58:31Z

thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', src/stream/src/task/stream_manager.rs:792:14
stack backtrace:
   0: rust_begin_unwind
             at /rustc/bb8c2f41174caceec00c28bc6c5c20ae9f9a175c/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/bb8c2f41174caceec00c28bc6c5c20ae9f9a175c/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at /rustc/bb8c2f41174caceec00c28bc6c5c20ae9f9a175c/library/core/src/panicking.rs:48:5
   3: core::option::Option<T>::unwrap
             at /rustc/bb8c2f41174caceec00c28bc6c5c20ae9f9a175c/library/core/src/option.rs:775:21
   4: risingwave_stream::task::stream_manager::LocalStreamManagerCore::drop_actor
             at ./src/stream/src/task/stream_manager.rs:790:9
   5: risingwave_stream::task::stream_manager::LocalStreamManager::drop_actor
             at ./src/stream/src/task/stream_manager.rs:239:13

Panic because dispatcher metrics future is not started for some actors. Please check.

skyzh

https://github.com/apache/flink/blob/16109a31468949f09c2a7bba9003761726e3d61c/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L573-L584

Flink absolutely doesn't count backpressure like this PR. I think we'd better be accurate for this metrics instead of letting future developers be confused about "why this metrics doesn't reflect the real situation".

StrikeW · 2022-07-04T13:49:53Z

Current implementation uses a separate coroutine task to calculate the backpressure rate. The problem would be the visibility is dependent on whether the monitor task can be scheduled on time. A better way would be each actor can report the metrics by itself.

Sunt-ing · 2022-07-04T13:51:22Z

This PR does not implement accurate back pressure mentoring. If that's necessary, then the implementation can be changed to monitor RemoteOutput::send and LocalOutput::send as follows:

    async fn send(&mut self, message: Message) -> Result<()> {
        // local channel should never fail
        let start_time = if self.ch.capacity() == 0 {
            Some(Instant::now())
        } else {
            None
        };
        self.ch
            .send(message)
            .await
            .map_err(|_| internal_error("failed to send"))?;
        if start_time.is_some(){
            self.metrics
                .output_buffer_blockting_time
                .with_label_values(&[&up_actor_id])
                .inc_by(start_time.unwrap().elapsed().as_nanos() as u64);
        };
        Ok(())
    }

Sunt-ing · 2022-07-04T14:23:54Z

I benched the overhead with my M1:

#[tokio::main]
async fn main() {
    let buffer_size = 10000;
    let (tx, rx) = tokio::sync::mpsc::channel(buffer_size);
    tokio::spawn(async move {
        let mut cnt = 0;
        for i in 0..buffer_size{
            let start_time = madsim::time::Instant::now();
            tx.send(i).await.unwrap();
            cnt += start_time.elapsed().as_nanos();
        }
        println!("{} ns", cnt as f64 / buffer_size as f64);
    }).await.unwrap();
}

The result is only hundreds of nano seconds.

Sunt-ing · 2022-07-04T18:24:19Z

The bench result of the updated implementation be like:

codecov · 2022-07-04T18:30:35Z

Codecov Report

Merging #3636 (d14684e) into main (c3a17ce) will decrease coverage by 0.01%.
The diff coverage is 51.47%.

@@            Coverage Diff             @@
##             main    #3636      +/-   ##
==========================================
- Coverage   74.37%   74.36%   -0.02%     
==========================================
  Files         776      776              
  Lines      110163   110213      +50     
==========================================
+ Hits        81937    81958      +21     
- Misses      28226    28255      +29

Flag	Coverage Δ
rust	`74.36% <51.47%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/stream/src/task/stream_manager.rs	`0.00% <0.00%> (ø)`
src/stream/src/executor/dispatch.rs	`75.42% <46.93%> (-1.69%)`	⬇️
src/stream/src/executor/integration_tests.rs	`98.56% <100.00%> (ø)`
src/stream/src/executor/monitor/streaming_stats.rs	`100.00% <100.00%> (ø)`
src/meta/src/hummock/mock_hummock_meta_client.rs	`40.56% <0.00%> (-0.95%)`	⬇️
src/frontend/src/expr/utils.rs	`98.99% <0.00%> (-0.26%)`	⬇️
src/storage/src/hummock/local_version_manager.rs	`81.38% <0.00%> (-0.12%)`	⬇️
src/meta/src/manager/id.rs	`95.50% <0.00%> (ø)`
src/meta/src/rpc/server.rs	`80.51% <0.00%> (+0.23%)`	⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

grafana/risingwave-dashboard.py

StrikeW

LGTM. there are 39 commits in the PR, please remember to clean the commit log when you merge the PR.

StrikeW · 2022-07-05T01:51:19Z

src/stream/src/task/stream_manager.rs

@@ -65,8 +65,10 @@ pub struct LocalStreamManagerCore {

    /// Stores all actor information, taken after actor built.
    actors: HashMap<ActorId, stream_plan::StreamActor>,
-    /// Store all actor execution time montioring tasks.
+
+    /// Stores all actor tokio runtime montioring tasks.
    actor_monitor_tasks: HashMap<ActorId, JoinHandle<()>>,


remove this?

It's still used for another metrics. I'm thinking to move that metrics into the actor itself, instead of spawning new tasks.

grafana/risingwave-dashboard.py

Sunt-ing · 2022-07-05T02:29:48Z

Should have fixed all the comments

Sunt-ing added 22 commits June 27, 2022 19:27

add backpressure metrics

73d5dc4

add backpressure metrics

a90d14c

change grafana

5fa8189

change metrics name

339d140

change metrics name

d9f309e

change metrics name

1a790f7

change metrics name

f19ed1e

change metrics name

2d64451

change metrics name

5e4de41

change metrics name

6b9bf65

change metrics name

71de720

change metrics name

b34cb59

fix comflicts

a95461e

add backpressure metrics

afb3fcb

add backpressure metrics

dbc73a0

add backpressure metrics

40ced82

add backpressure metrics

89c3dca

clippy

a118f48

add backpressure metrics

8ea7094

add backpressure metrics

b1836cc

add backpressure metrics

3314784

add backpressure metrics

b088cef

Sunt-ing requested a review from StrikeW July 4, 2022 12:18

fix conflicts

be717de

github-actions bot added the type/feature label Jul 4, 2022

skyzh reviewed Jul 4, 2022

View reviewed changes

skyzh suggested changes Jul 4, 2022

View reviewed changes

Sunt-ing added 13 commits July 4, 2022 22:52

change to measure accurate blocking time

4fd012a

change to measure accurate blocking time

e044d8c

change to measure accurate blocking time

cd3e355

change to measure accurate blocking time

16944b0

change to measure accurate blocking time

b68455c

change to measure accurate blocking time

a81719c

change to measure accurate blocking time

d942b0a

change to measure accurate blocking time

ee81d35

change to measure accurate blocking time

8596c77

change to measure accurate blocking time

e7a3b6d

change to measure accurate blocking time

a006d75

change to measure accurate blocking time

36af308

change to measure accurate blocking time

2d545b9

skyzh suggested changes Jul 5, 2022

View reviewed changes

grafana/risingwave-dashboard.py Outdated Show resolved Hide resolved

Sunt-ing added 3 commits July 5, 2022 10:05

reword

596cf79

reword

86bd12e

reword

e6a8510

StrikeW approved these changes Jul 5, 2022

View reviewed changes

fix Grafana

f34c358

Merge branch 'main' into sunt_backpressure_metrics

d14684e

Sunt-ing requested a review from skyzh July 5, 2022 02:41

skyzh approved these changes Jul 5, 2022

View reviewed changes

Sunt-ing enabled auto-merge (squash) July 5, 2022 02:45

Sunt-ing merged commit be7830e into main Jul 5, 2022

Sunt-ing deleted the sunt_backpressure_metrics branch July 5, 2022 02:54

nasnoisaac pushed a commit to nasnoisaac/risingwave that referenced this pull request Aug 9, 2022

feat(metrics): add backpressure metrics (risingwavelabs#3636)

dfb2ade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add backpressure metrics #3636

feat(metrics): add backpressure metrics #3636

Sunt-ing commented Jul 4, 2022 •

edited

skyzh Jul 4, 2022

skyzh commented Jul 4, 2022 •

edited

skyzh left a comment

StrikeW commented Jul 4, 2022

Sunt-ing commented Jul 4, 2022

Sunt-ing commented Jul 4, 2022

Sunt-ing commented Jul 4, 2022

codecov bot commented Jul 4, 2022 •

edited

StrikeW left a comment

StrikeW Jul 5, 2022

skyzh Jul 5, 2022

Sunt-ing commented Jul 5, 2022

feat(metrics): add backpressure metrics #3636

feat(metrics): add backpressure metrics #3636

Conversation

Sunt-ing commented Jul 4, 2022 • edited

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

skyzh Jul 4, 2022

Choose a reason for hiding this comment

skyzh commented Jul 4, 2022 • edited

skyzh left a comment

Choose a reason for hiding this comment

StrikeW commented Jul 4, 2022

Sunt-ing commented Jul 4, 2022

Sunt-ing commented Jul 4, 2022

Sunt-ing commented Jul 4, 2022

codecov bot commented Jul 4, 2022 • edited

Codecov Report

StrikeW left a comment

Choose a reason for hiding this comment

StrikeW Jul 5, 2022

Choose a reason for hiding this comment

skyzh Jul 5, 2022

Choose a reason for hiding this comment

Sunt-ing commented Jul 5, 2022

Sunt-ing commented Jul 4, 2022 •

edited

skyzh commented Jul 4, 2022 •

edited

codecov bot commented Jul 4, 2022 •

edited