Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): add actor input and output row number #3391

Merged
merged 7 commits into from
Jun 23, 2022

Conversation

MingjiHan99
Copy link
Contributor

@MingjiHan99 MingjiHan99 commented Jun 21, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

Adding actor input and output row number metrics by collect message size from DispatchExecutor, ReceiverExecutor and MergerExecutor

Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

@codecov
Copy link

codecov bot commented Jun 21, 2022

Codecov Report

Merging #3391 (efbda6a) into main (c88f830) will increase coverage by 0.01%.
The diff coverage is 95.91%.

@@            Coverage Diff             @@
##             main    #3391      +/-   ##
==========================================
+ Coverage   73.81%   73.83%   +0.01%     
==========================================
  Files         765      765              
  Lines      105357   105450      +93     
==========================================
+ Hits        77767    77855      +88     
- Misses      27590    27595       +5     
Flag Coverage Δ
rust 73.83% <95.91%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/stream/src/executor/debug/trace.rs 0.00% <ø> (ø)
src/stream/src/from_proto/merge.rs 0.00% <0.00%> (ø)
src/stream/src/task/stream_manager.rs 0.00% <0.00%> (ø)
src/stream/src/executor/dispatch.rs 77.32% <100.00%> (+0.37%) ⬆️
src/stream/src/executor/integration_tests.rs 98.57% <100.00%> (+0.16%) ⬆️
src/stream/src/executor/merge.rs 91.91% <100.00%> (+0.67%) ⬆️
src/stream/src/executor/monitor/streaming_stats.rs 100.00% <100.00%> (ø)
src/stream/src/executor/receiver.rs 72.72% <100.00%> (+9.31%) ⬆️
src/frontend/src/expr/utils.rs 98.74% <0.00%> (-0.51%) ⬇️
src/connector/src/filesystem/file_common.rs 80.80% <0.00%> (+0.44%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Copy link
Contributor

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't stream_actor_row_count already do what you want?

@MingjiHan99 MingjiHan99 requested a review from skyzh June 22, 2022 17:37
@MingjiHan99
Copy link
Contributor Author

Doesn't stream_actor_row_count already do what you want?

We hope to get actor input and output rate from dashboard directly.

Copy link
Contributor

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be captious and have somehow high standard upon the PR instead of dealing with it as a kind of research code. Just a reminder, if you're demanding on the progress of your project, you can always fork and develop on your own branch instead of getting everything carefully reviewed and merged to main, which might be time-consuming for you. Anyway, feel free to ping me and other developers involving in the scaling thing and get things reviewed. For me, a large PR might take ~3days to a week to review, depending on my workload at that time. Small PRs like this can be reviewed much more quickly.

grafana/risingwave-dashboard.py Outdated Show resolved Hide resolved
src/stream/src/executor/debug/trace.rs Outdated Show resolved Hide resolved
src/stream/src/executor/dispatch.rs Outdated Show resolved Hide resolved
src/stream/src/executor/merge.rs Outdated Show resolved Hide resolved
actor_id: u32,
status: OperatorInfoStatus,
upstreams: Vec<Receiver<Message>>,
metrics: Arc<StreamingMetrics>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also map and record metrics at:

SelectReceivers::new(self.actor_id, status, upstreams, self.metrics.clone());
         // Channels that're blocked by the barrier to align.

         select_all.boxed()

instead of here. SelectReceivers just multiplex the stream, and the metrics should not be coupled with this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@MingjiHan99
Copy link
Contributor Author

I might be captious and have somehow high standard upon the PR instead of dealing with it as a kind of research code. Just a reminder, if you're demanding on the progress of your project, you can always fork and develop on your own branch instead of getting everything carefully reviewed and merged to main, which might be time-consuming for you. Anyway, feel free to ping me and other developers involving in the scaling thing and get things reviewed. For me, a large PR might take ~3days to a week to review, depending on my workload at that time. Small PRs like this can be reviewed much more quickly.

The true rate metrics and input/output number are metrics used for auto-scaling. I think I will add more metrics after this PR. I will contact with other developers for auto-scaling.

@MingjiHan99 MingjiHan99 requested a review from skyzh June 23, 2022 03:32
@@ -23,7 +23,7 @@ use crate::executor::error::StreamExecutorError;
use crate::executor::monitor::StreamingMetrics;
use crate::executor::{ExecutorInfo, Message, MessageStream};
use crate::task::ActorId;

const ENABLE_EXECUTOR_ROW_COUNT: bool = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add docs to this global variable:

/// Set to true to enable per-executor row count metrics. This will produce a lot of timeseries and might affect the prometheus performance. If you only need actor input and output rows data, see stream_actor_in_record_cnt and stream_actor_out_record_cnt instead.

@@ -162,6 +164,12 @@ impl DispatchExecutorInner {
async fn dispatch(&mut self, msg: Message) -> Result<()> {
match msg {
Message::Chunk(chunk) => {
let actor_id_str = self.actor_id.to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-calculate means that we should put this in struct body, so that throughout the actor lifetime this string will only be generated once.

metrics
.actor_in_record_cnt
.with_label_values(&[&actor_id_str])
.inc_by(chunk.cardinality().try_into().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's that try_into().unwrap() thing? Do you mean as u64?

Copy link
Contributor

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Please fix the comments before merging. The new merging process is to add mergify/can-merge label instead of clicking the merge button.

Copy link
Contributor

@Sunt-ing Sunt-ing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. BTW, this PR is used for auto-scaling. Since auto-scaling is an important feature for our "ease of use" claims, this is expected to stay in the repo.

Comment on lines +55 to +60
if ENABLE_EXECUTOR_ROW_COUNT {
metrics
.executor_row_count
.with_label_values(&[&actor_id_string, &executor_id_string])
.inc_by(chunk.cardinality() as u64);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampling method can be used in executor_row_count. Maybe implement it in the next PR.

Comment on lines +82 to +89
if ENABLE_EXECUTOR_ROW_COUNT {
if let Message::Chunk(chunk) = &message {
if chunk.cardinality() > 0 {
metrics
.executor_row_count
.with_label_values(&[&actor_id_string, &executor_id_string])
.inc_by(chunk.cardinality() as u64);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Comment on lines +150 to +153
if let Ok(Message::Chunk(chunk)) = &msg {
metrics
.actor_in_record_cnt
.with_label_values(&[&actor_id_str])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe here

@skyzh
Copy link
Contributor

skyzh commented Jun 23, 2022

LGTM. BTW, this PR is used for auto-scaling. Since auto-scaling is an important feature for our "ease of use" claims, this is expected to stay in the repo.

Of course, as long as the code looks good enough :)

@Sunt-ing Sunt-ing added mergify/can-merge Indicates that the PR can be added to the merge queue and removed mergify/can-merge Indicates that the PR can be added to the merge queue labels Jun 23, 2022
@MingjiHan99 MingjiHan99 enabled auto-merge (squash) June 23, 2022 11:53
@skyzh
Copy link
Contributor

skyzh commented Jun 23, 2022

I believe not all comments are resolved. Please help resolve them before merging :)

@MingjiHan99 MingjiHan99 merged commit 6701289 into main Jun 23, 2022
@MingjiHan99 MingjiHan99 deleted the actor_in_out_record branch June 23, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants