feat(metrics): add actor input and output row number #3391

MingjiHan99 · 2022-06-21T21:10:34Z

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

Adding actor input and output row number metrics by collect message size from DispatchExecutor, ReceiverExecutor and MergerExecutor

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

codecov · 2022-06-21T21:27:36Z

Codecov Report

Merging #3391 (efbda6a) into main (c88f830) will increase coverage by 0.01%.
The diff coverage is 95.91%.

@@            Coverage Diff             @@
##             main    #3391      +/-   ##
==========================================
+ Coverage   73.81%   73.83%   +0.01%     
==========================================
  Files         765      765              
  Lines      105357   105450      +93     
==========================================
+ Hits        77767    77855      +88     
- Misses      27590    27595       +5

Flag	Coverage Δ
rust	`73.83% <95.91%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/stream/src/executor/debug/trace.rs	`0.00% <ø> (ø)`
src/stream/src/from_proto/merge.rs	`0.00% <0.00%> (ø)`
src/stream/src/task/stream_manager.rs	`0.00% <0.00%> (ø)`
src/stream/src/executor/dispatch.rs	`77.32% <100.00%> (+0.37%)`	⬆️
src/stream/src/executor/integration_tests.rs	`98.57% <100.00%> (+0.16%)`	⬆️
src/stream/src/executor/merge.rs	`91.91% <100.00%> (+0.67%)`	⬆️
src/stream/src/executor/monitor/streaming_stats.rs	`100.00% <100.00%> (ø)`
src/stream/src/executor/receiver.rs	`72.72% <100.00%> (+9.31%)`	⬆️
src/frontend/src/expr/utils.rs	`98.74% <0.00%> (-0.51%)`	⬇️
src/connector/src/filesystem/file_common.rs	`80.80% <0.00%> (+0.44%)`	⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

skyzh

Doesn't stream_actor_row_count already do what you want?

MingjiHan99 · 2022-06-22T17:37:34Z

Doesn't stream_actor_row_count already do what you want?

We hope to get actor input and output rate from dashboard directly.

skyzh

I might be captious and have somehow high standard upon the PR instead of dealing with it as a kind of research code. Just a reminder, if you're demanding on the progress of your project, you can always fork and develop on your own branch instead of getting everything carefully reviewed and merged to main, which might be time-consuming for you. Anyway, feel free to ping me and other developers involving in the scaling thing and get things reviewed. For me, a large PR might take ~3days to a week to review, depending on my workload at that time. Small PRs like this can be reviewed much more quickly.

grafana/risingwave-dashboard.py

src/stream/src/executor/debug/trace.rs

src/stream/src/executor/dispatch.rs

src/stream/src/executor/merge.rs

skyzh · 2022-06-22T18:01:34Z

src/stream/src/executor/merge.rs

+        actor_id: u32,
+        status: OperatorInfoStatus,
+        upstreams: Vec<Receiver<Message>>,
+        metrics: Arc<StreamingMetrics>,


I think you can also map and record metrics at:

SelectReceivers::new(self.actor_id, status, upstreams, self.metrics.clone()); // Channels that're blocked by the barrier to align. select_all.boxed()

instead of here. SelectReceivers just multiplex the stream, and the metrics should not be coupled with this part.

MingjiHan99 · 2022-06-23T02:12:00Z

I might be captious and have somehow high standard upon the PR instead of dealing with it as a kind of research code. Just a reminder, if you're demanding on the progress of your project, you can always fork and develop on your own branch instead of getting everything carefully reviewed and merged to main, which might be time-consuming for you. Anyway, feel free to ping me and other developers involving in the scaling thing and get things reviewed. For me, a large PR might take ~3days to a week to review, depending on my workload at that time. Small PRs like this can be reviewed much more quickly.

The true rate metrics and input/output number are metrics used for auto-scaling. I think I will add more metrics after this PR. I will contact with other developers for auto-scaling.

skyzh · 2022-06-23T03:36:54Z

src/stream/src/executor/debug/trace.rs

@@ -23,7 +23,7 @@ use crate::executor::error::StreamExecutorError;
 use crate::executor::monitor::StreamingMetrics;
 use crate::executor::{ExecutorInfo, Message, MessageStream};
 use crate::task::ActorId;
-
+const ENABLE_EXECUTOR_ROW_COUNT: bool = false;


Better to add docs to this global variable:

/// Set to true to enable per-executor row count metrics. This will produce a lot of timeseries and might affect the prometheus performance. If you only need actor input and output rows data, see stream_actor_in_record_cnt and stream_actor_out_record_cnt instead.

skyzh · 2022-06-23T03:38:09Z

src/stream/src/executor/dispatch.rs

@@ -162,6 +164,12 @@ impl DispatchExecutorInner {
    async fn dispatch(&mut self, msg: Message) -> Result<()> {
        match msg {
            Message::Chunk(chunk) => {
+                let actor_id_str = self.actor_id.to_string();


pre-calculate means that we should put this in struct body, so that throughout the actor lifetime this string will only be generated once.

skyzh · 2022-06-23T03:40:21Z

src/stream/src/executor/receiver.rs

+                        metrics
+                            .actor_in_record_cnt
+                            .with_label_values(&[&actor_id_str])
+                            .inc_by(chunk.cardinality().try_into().unwrap());


What's that try_into().unwrap() thing? Do you mean as u64?

skyzh

Rest LGTM. Please fix the comments before merging. The new merging process is to add mergify/can-merge label instead of clicking the merge button.

Sunt-ing

LGTM. BTW, this PR is used for auto-scaling. Since auto-scaling is an important feature for our "ease of use" claims, this is expected to stay in the repo.

Sunt-ing · 2022-06-23T03:50:46Z

src/stream/src/executor/debug/trace.rs

+                if ENABLE_EXECUTOR_ROW_COUNT {
+                    metrics
+                        .executor_row_count
+                        .with_label_values(&[&actor_id_string, &executor_id_string])
+                        .inc_by(chunk.cardinality() as u64);
+                }


The sampling method can be used in executor_row_count. Maybe implement it in the next PR.

Sunt-ing · 2022-06-23T03:51:09Z

src/stream/src/executor/debug/trace.rs

+        if ENABLE_EXECUTOR_ROW_COUNT {
+            if let Message::Chunk(chunk) = &message {
+                if chunk.cardinality() > 0 {
+                    metrics
+                        .executor_row_count
+                        .with_label_values(&[&actor_id_string, &executor_id_string])
+                        .inc_by(chunk.cardinality() as u64);
+                }


Sunt-ing · 2022-06-23T03:53:18Z

src/stream/src/executor/merge.rs

+                if let Ok(Message::Chunk(chunk)) = &msg {
+                    metrics
+                        .actor_in_record_cnt
+                        .with_label_values(&[&actor_id_str])


and maybe here

skyzh · 2022-06-23T04:38:20Z

LGTM. BTW, this PR is used for auto-scaling. Since auto-scaling is an important feature for our "ease of use" claims, this is expected to stay in the repo.

Of course, as long as the code looks good enough :)

skyzh · 2022-06-23T14:50:40Z

I believe not all comments are resolved. Please help resolve them before merging :)

github-actions bot added the type/feature label Jun 21, 2022

MingjiHan99 force-pushed the actor_in_out_record branch from 5ea69ec to 4feb853 Compare June 21, 2022 21:14

MingjiHan99 requested review from skyzh, Sunt-ing and KeXiangWang June 21, 2022 21:15

skyzh reviewed Jun 22, 2022

View reviewed changes

MingjiHan99 requested a review from skyzh June 22, 2022 17:37

skyzh reviewed Jun 22, 2022

View reviewed changes

MingjiHan99 requested a review from skyzh June 23, 2022 03:32

skyzh reviewed Jun 23, 2022

View reviewed changes

skyzh approved these changes Jun 23, 2022

View reviewed changes

Sunt-ing approved these changes Jun 23, 2022

View reviewed changes

Sunt-ing added mergify/can-merge Indicates that the PR can be added to the merge queue and removed mergify/can-merge Indicates that the PR can be added to the merge queue labels Jun 23, 2022

MingjiHan99 enabled auto-merge (squash) June 23, 2022 11:53

MingjiHan99 added 7 commits June 23, 2022 10:35

add output metric

d462968

add in_record_metric

022404a

update dashboard

64fa330

add true rate

cbb715d

disable executor row count

fddeb24

remove useless metrics on grafana and improve code

0178ac2

generate dashboard json

efbda6a

MingjiHan99 force-pushed the actor_in_out_record branch from ff5fd6f to efbda6a Compare June 23, 2022 14:41

skyzh disabled auto-merge June 23, 2022 14:50

MingjiHan99 merged commit 6701289 into main Jun 23, 2022

MingjiHan99 deleted the actor_in_out_record branch June 23, 2022 15:34

skyzh mentioned this pull request Jun 23, 2022

fix(streaming): readability improvements #3442

Merged

3 tasks

BugenZhao mentioned this pull request Jun 9, 2023

chore: remove enable_stream_row_count config #10261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add actor input and output row number #3391

feat(metrics): add actor input and output row number #3391

MingjiHan99 commented Jun 21, 2022 •

edited

codecov bot commented Jun 21, 2022 •

edited

skyzh left a comment

MingjiHan99 commented Jun 22, 2022

skyzh left a comment •

edited

skyzh Jun 22, 2022

MingjiHan99 Jun 23, 2022

MingjiHan99 commented Jun 23, 2022

skyzh Jun 23, 2022

skyzh Jun 23, 2022

skyzh Jun 23, 2022

skyzh left a comment

Sunt-ing left a comment

Sunt-ing Jun 23, 2022

Sunt-ing Jun 23, 2022

Sunt-ing Jun 23, 2022

skyzh commented Jun 23, 2022

skyzh commented Jun 23, 2022

feat(metrics): add actor input and output row number #3391

feat(metrics): add actor input and output row number #3391

Conversation

MingjiHan99 commented Jun 21, 2022 • edited

What's changed and what's your intention?

Checklist

codecov bot commented Jun 21, 2022 • edited

Codecov Report

skyzh left a comment

Choose a reason for hiding this comment

MingjiHan99 commented Jun 22, 2022

skyzh left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MingjiHan99 commented Jun 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh left a comment

Choose a reason for hiding this comment

Sunt-ing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh commented Jun 23, 2022

skyzh commented Jun 23, 2022

MingjiHan99 commented Jun 21, 2022 •

edited

codecov bot commented Jun 21, 2022 •

edited

skyzh left a comment •

edited