-
Notifications
You must be signed in to change notification settings - Fork 248
A100: Client-side WRR slow start configuration #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
A100: Client-side WRR slow start configuration #498
Conversation
CC @dfawley |
@dfawley did we get a chance to check this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a nice addition (it's something we've had requests for internally at Datadog). Generally the approach sounds good to me. Ideally we would be able to support this mechanism for other load balancers supported by gRPC and Envoy (lr & rr). But given that lr & rr do not implement endpoint weights, it seems difficult to fit the feature in those balancers.
@atollena I have addressed your comments regarding the proposal, let me know if I need to make any changes in it |
@ejona86 can I start with the implementation of the feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to see an implementation of this in one of the 3 languages. I wouldn't wait for the gRFC to be approved before starting, especially since the result can be used as a custom, private LB policy before it is merged to upstream-supported gRPC implementations.
|
||
## Implementation | ||
|
||
This will be implemented in all languages C++, Java, and Go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good place to add the actual implementation of your choice when you have it ready (it can be a draft or closed PR that you can re-open once the gRFC has been approved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the implementation for Java, would love if you could review the changes :)
10be215
to
43b016a
Compare
@atollena I have opened the respective MR at the envoy side as well, they will merge it once the proposal is approved at our end. For this proposal I just need to handle the case for the timing part and make the respective changes in the proposal once done I will assign for re-review. |
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
0a3ea16
to
e490283
Compare
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Antoine Tollenaere <atollena@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly seems fine to me, I just have some nits. Also, since I don't see a slow start config in the envoy WRR policy proto, I'm curious if you've (or someone else has?) sent a proposal to add it there already.
|
||
## Metrics | ||
|
||
The following metric will be exposed to help monitor the slow start behavior: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is this metric to include? Should we consider not adding it until it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally in most of the cases the slow_start_window will be applicable only for a short duration like 1-2 mins and if everything works well, it will not be called again.
I just thought that we can implement it in disabled mode, if someone wants to keeps a track of it.
We can decide this and I will make the respective changes in the proposal
@dfawley envoyproxy/envoy#40090 this is the MR I have created at envoy's end to include the slow_start_config in proto. However it requires this grpc proposal to be approved first before it is merged to the master. Then we will need to sync across repo and then we are good to implement this feature. I have added sample MR with the java implementation as well, please do check it as well. |
Co-authored-by: Doug Fawley <dfawley@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up!
// | ||
// As time progresses, more and more traffic would be sent to endpoint, which is in slow start window. | ||
// Once host exits slow start, time_factor and aggression no longer affect its weight. | ||
google.protobuf.FloatValue aggression = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming 0 is not a valid value here, I think we can use float
instead of google.protobuf.FloatValue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth I am actually confused here on how to integrate this. Actually envoy already has this proto for SlowStartConfig
which it uses for RR and LR load balancing.
Hence, there was no need of any changes in SlowStartConfig
proto anywhere, the only change required was to inlcude it in the ClientSideWeightedRoundRobin
proto at the envoy side.
So I think in an earlier suggestion on this proposal, I was asked to change it to a google type while envoy types were being used 60cc616
I just wanted to reference this proto for others to relate to the equations, but if we are going to use the envoy proto only then there we should make minimal changes for not confusing others. What would you suggest.
// Configures the minimum percentage of the original weight that will be used for an endpoint | ||
// in slow start. This helps to avoid a scenario in which endpoints receive no traffic during the | ||
// slow start window. Must be between 0 and 100. If not specified, the default is 10%. | ||
google.protobuf.UInt32Value min_weight_percent = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 0 is not a valid value here, I think we can use uint32
instead of google.protobuf.UInt32Value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern as stated above
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
Signed-off-by: anurag.ag <anuragagarwal561994@users.noreply.github.com>
@markdroth I have made the respective changes and added a few comments for where I was confused, please help me resolve the same. |
No description provided.