Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic connection management and keep alive parameters #1334

Merged
merged 6 commits into from
Mar 30, 2021

Conversation

vitarb
Copy link
Contributor

@vitarb vitarb commented Feb 26, 2021

Summary:
Added default keep-alive server parameters and enforcement policy for gRPC connection management.
This compliments client-side changes that we recently made in Go and java SDKs.

List of added properties (with frontend.keepAlive prefix) and their default values:

  • MinTime: minimum amount of time a client should wait before sending a keepalive ping. Default 1 min. Rationale - set slightly higher than minimal interval to avoid lots of spamming requests.
  • PermitWithoutStream: If true, server allows keepalive pings even when there are no active streams(RPCs). Default is true. Rationale - allow clients ping server without having active RPCs.
  • MaxConnectionIdle: is a duration for the amount of time after which an idle connection would be closed by sending a GoAway. Default is 10 min. Rationale - default client-side ping interval is 60 sec and long poll interval is every 60-70 sec. If client is idle for 2 minutes it must be dead and connection can be closed.
  • MaxConnectionAge: is a duration for the maximum amount of time a connection may exist before it will be closed by sending a GoAway. Default is 20 min. Rationale - periodically close connections to re-balance them more evenly between front-ends.
  • MaxConnectionAgeGrace: is an additive period after MaxConnectionAge after which the connection will be forcibly closed. Default is 2 min. Rationale - give some time for the client to finish existing RPCs and then close the channel gracefully.
  • Time: After a duration of this time if the server doesn't see any activity it pings the client to see if the transport is still alive. Default is 10 min. Server pings client if it is idle. Rationale - we want to avoid pings to workers as much as possible, setting it reasonably high so we don't hit it often.
  • Timeout: After having pinged for keepalive check, the server waits for a duration of Timeout and if no activity is seen even after that the connection is closed. Defaults to 10 sec.

@vitarb vitarb requested a review from a team February 26, 2021 05:53
@alexshtin
Copy link
Member

My proposal for default values:

  • MinTime: 30 sec
  • PermitWithoutStream: true
  • MaxConnectionIdle: 70 sec (may be just link it to longPollInterval config)
  • MaxConnectionAge: 10 min
  • MaxConnectionAgeGrace: 70 sec (let ongoing long poll to finish)
  • Time: 10 min
  • Timeout: 10 sec

KeepAlivePermitWithoutStream: dc.GetBoolProperty(dynamicconfig.KeepAlivePermitWithoutStream, true),
KeepAliveMaxConnectionIdle: dc.GetDurationProperty(dynamicconfig.KeepAliveMaxConnectionIdle, 10*time.Minute),
KeepAliveMaxConnectionAge: dc.GetDurationProperty(dynamicconfig.KeepAliveMaxConnectionAge, 20*time.Minute),
KeepAliveMaxConnectionAgeGrace: dc.GetDurationProperty(dynamicconfig.KeepAliveMaxConnectionAgeGrace, 2*time.Minute),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comment here saying that this needs to be slightly bigger than long poll interval. And there is another parameter which also correlates with long poll interval.

@vitarb
Copy link
Contributor Author

vitarb commented Mar 26, 2021

Discussed with Alex/Wenquan, and adjusted numbers accordingly.

@vitarb
Copy link
Contributor Author

vitarb commented Mar 26, 2021

Discussed with @mfateev and agreed on a set of parameters that are slightly different from what we've discussed:
List of added properties (with frontend.keepAlive prefix) and their default values:

  • MinTime: minimum amount of time a client should wait before sending a keepalive ping. Default 10 sec Rationale - make sure that clients can set minimal value of 10 sec without breaking connections.
  • PermitWithoutStream: If true, server allows keepalive pings even when there are no active streams(RPCs). Default is true. Rationale - allow clients ping server without having active RPCs.
  • MaxConnectionIdle: is a duration for the amount of time after which an idle connection would be closed by sending a GoAway. Default is 2 min. Rationale - default client-side ping interval is 60 sec and long poll interval is every 60-70 sec. If client is idle for 2 minutes it must be dead and connection can be closed.
  • MaxConnectionAge: is a duration for the maximum amount of time a connection may exist before it will be closed by sending a GoAway. Default is 5 min. Rationale - periodically close connections to re-balance them more evenly between front-ends.
  • MaxConnectionAgeGrace: is an additive period after MaxConnectionAge after which the connection will be forcibly closed. Default is 70 sec. Rationale - give some time for the client to finish existing RPCs and then close the channel gracefully.
  • Time: After a duration of this time if the server doesn't see any activity it pings the client to see if the transport is still alive. Default is 1 min. Server pings client if it is idle. Rationale - if the client is idle for a minute, it's probably not a valid temporal client.
  • Timeout: After having pinged for keepalive check, the server waits for a duration of Timeout and if no activity is seen even after that the connection is closed. Defaults to 10 sec.

Copy link
Contributor

@wxing1292 wxing1292 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MinTime: minimum amount of time a client should wait before sending a keepalive ping. Default 10 sec Rationale - make sure that clients can set minimal value of 10 sec without breaking connections.

at least this number is too small, let us sync

changing from every 60s to every 10s is 6x increase in amount of load in terms of pinging.
although ping itself is lightweight, it still consume network traffic and requires frontend to respond, meaning health check is not free

@vitarb
Copy link
Contributor Author

vitarb commented Mar 27, 2021

Note that this property defines low end of the range, it doesn't mean we would go as low. In fact I anticipate default on the client side should be 30 sec or 1 min, with a possibility of an override to a lower value, which some OSS clients may want if they would like to detect failures faster and don't have too many workers.

@@ -268,11 +292,25 @@ func (s *Service) Start() {
)

opts, err := s.params.RPCFactory.GetFrontendGRPCServerOptions()
kep := keepalive.EnforcementPolicy{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here specifying that the configs below are intended for small scale use case

@vitarb vitarb merged commit df0c869 into temporalio:master Mar 30, 2021
@vitarb vitarb deleted the conn-mgmt branch March 30, 2021 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants