Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add metrics about raft message wait flush duration #16214

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

crazycs520
Copy link
Contributor

@crazycs520 crazycs520 commented Dec 22, 2023

What is changed and how it works?

Issue Number: ref #12362

What's Changed:

add metrics about raft message wait flush duration 

In following metrics, I restarted 2 TiKV servers every 2 minutes. During restart, raft message wait flush duration was high probably cause by waiting for the connection to be established. However, the reason for the high wait flush latency at other times is not known.

image

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None

Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Copy link
Contributor

ti-chi-bot bot commented Dec 22, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Connor1996
  • bufferflies

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added release-note-none contribution Type: PR - From contributors size/XXL labels Dec 22, 2023
@crazycs520
Copy link
Contributor Author

@cfzjywxk @overvenus PTAL

Signed-off-by: crazycs520 <crazycs520@gmail.com>
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why the wait flush duration is so high

@ti-chi-bot ti-chi-bot bot added the status/LGT1 Status: PR - There is already 1 approval label Dec 26, 2023
@crazycs520
Copy link
Contributor Author

So why the wait flush duration is so high

In the monitoring by PR description, I restarted 2 TiKV servers every 2 minutes. During restart, wait flush duration was too high probably by waiting for the connection to be established. However, the reason for the high wait flush latency at other times is not known

@crazycs520
Copy link
Contributor Author

@overvenus @cfzjywxk PTAL

@cfzjywxk
Copy link
Collaborator

So why the wait flush duration is so high

In the monitoring by PR description, I restarted 2 TiKV servers every 2 minutes. During restart, wait flush duration was too high probably by waiting for the connection to be established. However, the reason for the high wait flush latency at other times is not known

@crazycs520
It's better to describe the tests and observed phenomenon, it could help to explain why the metric here is useful and how is it used for diagnose.

@ti-chi-bot ti-chi-bot bot added status/LGT2 Status: PR - There are already 2 approvals and removed status/LGT1 Status: PR - There is already 1 approval labels Dec 26, 2023
@cfzjywxk cfzjywxk requested a review from zyguan December 26, 2023 08:20
Copy link
Contributor

@zyguan zyguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the meaning of the metric. Think about the following case, observe(t5-t1) can be thought as the flush wait duration of m1, but how to understand observe(t8-t6)?

t1: push(m1)       record_begin_wait
t2: push(m2)
t3: push(m3)
t4: push(m4)
t5: flush([m1,m2]) take_begin_wait     observe(t5-t1)
t6: push(m5)       record_begin_wait
t7: push(m6)
t8: flush([m3,m4]) take_begin_wait     observe(t8-t6)
t9: flush([m5,m6])

@crazycs520
Copy link
Contributor Author

I don't get the meaning of the metric. Think about the following case, observe(t5-t1) can be thought as the flush wait duration of m1, but how to understand observe(t8-t6)?

t1: push(m1)       record_begin_wait
t2: push(m2)
t3: push(m3)
t4: push(m4)
t5: flush([m1,m2]) take_begin_wait     observe(t5-t1)
t6: push(m5)       record_begin_wait
t7: push(m6)
t8: flush([m3,m4]) take_begin_wait     observe(t8-t6)
t9: flush([m5,m6])

Nice catch. But this shouldn't happen. since the default config is:

raft_client_queue_size =  8192;
raft_client_grpc_send_msg_buffer = 512 * 1024;

So each flush will send all messages in the queue. So the example should be:

t1: push(m1)       record_begin_wait
t2: push(m2)
t3: push(m3)
t4: push(m4)
t5: flush([m1,m2,m3,m4]) take_begin_wait     observe(t5-t1)
t6: push(m5)       record_begin_wait
t7: push(m6)
t8: flush([m5,m6]) take_begin_wait     observe(t8-t6)

I don't record the timestamp for each message to avoid performance issues.

@zyguan
Copy link
Contributor

zyguan commented Dec 26, 2023

I don't get the meaning of the metric. Think about the following case, observe(t5-t1) can be thought as the flush wait duration of m1, but how to understand observe(t8-t6)?

t1: push(m1)       record_begin_wait
t2: push(m2)
t3: push(m3)
t4: push(m4)
t5: flush([m1,m2]) take_begin_wait     observe(t5-t1)
t6: push(m5)       record_begin_wait
t7: push(m6)
t8: flush([m3,m4]) take_begin_wait     observe(t8-t6)
t9: flush([m5,m6])

Nice catch. But this shouldn't happen. since the default config is:

raft_client_queue_size =  8192;
raft_client_grpc_send_msg_buffer = 512 * 1024;

So each flush will send all messages in the queue. So the example should be:

t1: push(m1)       record_begin_wait
t2: push(m2)
t3: push(m3)
t4: push(m4)
t5: flush([m1,m2,m3,m4]) take_begin_wait     observe(t5-t1)
t6: push(m5)       record_begin_wait
t7: push(m6)
t8: flush([m5,m6]) take_begin_wait     observe(t8-t6)

I don't record the timestamp for each message to avoid performance issues.

According to the code, the capacity of the buffer is raft_msg_max_batch_size, which is 128 by default. Besides, the observed duration could be hard to understand even when cap(buffer) > cap(queue). Think about the following case, where cap(queue) = 2 and cap(buffer) = 3, how to explain observe(t13-t11)?

t0:  init           queue:[]           buffer:[]
t1:  push(m1)       queue:[m1]         buffer:[]               record_begin_wait
t2:  push(m2)       queue:[m1,m2]      buffer:[]
t3:  fill_msg       queue:[]           buffer:[m1,m2]
t4:  push(m3)       queue:[m3]         buffer:[m1,m2]
t5:  push(m4)       queue:[m3,m4]      buffer:[m1,m2]
t6:  fill_msg       queue:[]           buffer:[m1,m2,m3]+m4
t7:  push(m5)       queue:[m5]         buffer:[m1,m2,m3]+m4
t8:  push(m6)       queue:[m5,m6]      buffer:[m1,m2,m3]+m4
t9:  flush          queue:[m5,m6]      buffer:[m4]             take_begin_wait    observe(t9-t1)
t10: fill_msg       queue:[]           buffer:[m4,m5,m6]
t11: push(m7)       queue:[m7]         buffer:[m4,m5,m6]       record_begin_wait
t12: fill_msg       queue:[]           buffer:[m4,m5,m6]+m7
t13: flush          queue:[]           buffer:[m7]             take_begin_wait    observe(t13-t11)

@crazycs520
Copy link
Contributor Author

According to the code, the capacity of the buffer is raft_msg_max_batch_size, which is 128 by default. Besides, the observed duration is hard to understand even when cap(buffer) > cap(queue). Think about the following case, where cap(queue) = 2 and cap(buffer) = 3), how to explain observe(t13-t11)?

t0:  init           queue:[]           buffer:[]
t1:  push(m1)       queue:[m1]         buffer:[]               record_begin_wait
t2:  push(m2)       queue:[m1,m2]      buffer:[]
t3:  fill_msg       queue:[]           buffer:[m1,m2]
t4:  push(m3)       queue:[m3]         buffer:[m1,m2]
t5:  push(m4)       queue:[m3,m4]      buffer:[m1,m2]
t6:  fill_msg       queue:[]           buffer:[m1,m2,m3]+m4
t7:  push(m5)       queue:[m5]         buffer:[m1,m2,m3]+m4
t8:  push(m6)       queue:[m5,m6]      buffer:[m1,m2,m3]+m4
t9:  flush          queue:[m5,m6]      buffer:[m4]             take_begin_wait    observe(t9-t1)
t10: fill_msg       queue:[]           buffer:[m4,m5,m6]
t11: push(m7)       queue:[m7]         buffer:[m4,m5,m6]       record_begin_wait
t12: fill_msg       queue:[]           buffer:[m4,m5,m6]+m7
t13: flush          queue:[]           buffer:[m7]             take_begin_wait    observe(t13-t11)

It seems that when send buffer overflow, this monitoring will be inaccurate, which is smaller than the actual flush wait time

Copy link
Contributor

ti-chi-bot bot commented Jan 5, 2024

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cfzjywxk
Copy link
Collaborator

@crazycs520
Please resolve the conflicts.

@zyguan PTAL again

@zyguan
Copy link
Contributor

zyguan commented Jan 10, 2024

@crazycs520 Please resolve the conflicts.

@zyguan PTAL again

It seems there is no change since my last review. The issue of the current implementation is that the observed duration is not well defined, which may lead to:

  • the 1st case since raft_client_queue_size >> raft_msg_max_batch_size
  • the 2nd case even when the batch size > the queue size

The raft client can be described as the following (note grpc_raft_conn_num can be greater than 1). IMO, push events and flush events are relatively independent.

2024-01-10_224401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution Type: PR - From contributors needs-rebase release-note-none size/XXL status/LGT2 Status: PR - There are already 2 approvals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants