Add periodic health checks to TChannel #318

prashantv · 2016-04-15T21:58:32Z

If a TChannel connection stalls (e.g., the remote side has disappeared,
and traffic starts black-holing), TChannel waits for TCP to detect this
issue, but this is often not configurable, and it can be a long period
of time (we've observed many minutes in production).

In many P2P situations, it's desirable to detect these failures as soon
as possible. Support active health checking using TChannel pings on a
periodic basis, with configurable timeouts, number of failures etc.

akshayjshah · 2016-04-19T20:47:18Z

connection.go

+	}
+	if timeout == 0 {
+		timeout = time.Second
+	}


Do we want to validate/enforce that timeout <= interval?

motiejus · 2016-04-20T00:11:53Z

The "health" branch of tchannel-go is good for ringpop: if destination blackholes, we see that within 1 second, without filling any outgoing buffers, and ringpop has a better chance to detect the transport error.

I am all-in for this feature.

prashantv · 2016-04-20T00:33:50Z

Is there any concerns about having constant pings on any TCP connection? tchannel-go does not currently close working TCP connections, so over time, if ringpop ends up connecting to every other host, each of those connections will have constant health checks.

The main concern I have about this change is the default health timeout period, since some services may have much slower responses. We could even ship this change but have a pretty high value (like 10 seconds or something) -- although I imagine ringpop would want a much lower value. Since it doesn't create the channel, the service owner would need to set a lower value, which seems like extra configuration that we could avoid by having a lower default.

yurishkuro · 2016-05-17T20:00:55Z

connection.go

+			}
+			return
+		}
+		cancel()


is ok that cancel() is not called when this returns in L948?

yurishkuro · 2016-05-25T21:31:56Z

@prashantv is there an eta for this change?

prashantv · 2016-05-26T16:04:23Z

I'll probably not get to this in the next week, there's a bunch of relay work that's higher priority right now.

madhuravi · 2017-07-10T23:40:47Z

@prashantv Are there any plans to get this in? Looks like it's been over a year since it was originally created. This would be useful for cadence.

codecov · 2017-09-27T18:03:07Z

Codecov Report

Merging #318 into dev will decrease coverage by 0.43%.
The diff coverage is 96.39%.

@@            Coverage Diff             @@
##              dev     #318      +/-   ##
==========================================
- Coverage   85.94%   85.51%   -0.44%     
==========================================
  Files          37       38       +1     
  Lines        4668     4762      +94     
==========================================
+ Hits         4012     4072      +60     
- Misses        534      565      +31     
- Partials      122      125       +3

Impacted Files	Coverage Δ
introspection.go	`90.94% <100%> (+0.07%)`	⬆️
channel.go	`87.64% <100%> (+0.14%)`	⬆️
connection.go	`78.89% <100%> (-5.31%)`	⬇️
health.go	`95% <95%> (ø)`
relay.go	`83.5% <0%> (-1.27%)`	⬇️
peer.go	`93.02% <0%> (-0.3%)`	⬇️
mex.go	`76.2% <0%> (+0.34%)`	⬆️
logger.go	`87.01% <0%> (+1.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9dc4c1...89c0c1c. Read the comment docs.

prashantv · 2017-09-28T04:33:24Z

I've updated the diff with full test coverage. Health checking is opt-in, and by default no active health checks are performed.

The caller can choose the interval between checks, what the timeouts are for each ping message, and how many consecutive failures should cause a connection to be closed.

If a TChannel connection stalls (e.g., the remote side has disappeared, and traffic starts black-holing), TChannel waits for TCP to detect this issue, but this is often not configurable, and it can be a long period of time (we've observed many minutes in production). In many P2P situations, it's desirable to detect these failures as soon as possible. Support active health checking using TChannel pings on a periodic basis, with configurable timeouts, number of failures etc.

akshayjshah · 2017-09-29T20:30:42Z

health.go

+		}
+
+		ctx, cancel := context.WithTimeout(c.healthCheckCtx, opts.Timeout)
+		defer cancel()


Do we want to defer this until the function returns, or should we make sure to execute it at the end of this loop iteration?

Good catch! Could have been a huge memory leak, changed ti to cancel immediately after the ping method, since we don't use the ctx for anything else.

akshayjshah · 2017-09-29T20:43:50Z

I am very excited about this.

akshayjshah reviewed Apr 19, 2016
View reviewed changes

prashantv force-pushed the health branch from 5112a4b to acdf9c0 Compare April 28, 2016 03:50

yurishkuro reviewed May 17, 2016
View reviewed changes

connection.go Outdated

}

return

}

cancel()

Copy link

Contributor

yurishkuro May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ok that cancel() is not called when this returns in L948?

prashantv force-pushed the dev branch from 795d3dc to 090c354 Compare May 25, 2016 01:24

prashantv force-pushed the dev branch from bb44bd5 to afcd68c Compare July 16, 2016 00:49

prashantv force-pushed the dev branch from a7019f6 to 615295f Compare January 19, 2017 22:12

wybczu requested review from mmihic and removed request for mmihic April 3, 2017 08:35

prashantv force-pushed the health branch 3 times, most recently from 12ab0d6 to 5d979b3 Compare April 20, 2017 08:27

madhuravi mentioned this pull request Jul 12, 2017

Add worker method for blocking start to enable clean exit uber-go/cadence-client#183

Merged

prashantv force-pushed the health branch from 5d979b3 to 77743f5 Compare September 27, 2017 06:54

prashantv changed the title ~~WIP: Add periodic health checks to TChannel~~ Add periodic health checks to TChannel Sep 27, 2017

prashantv requested review from kriskowal and billf September 27, 2017 06:57

prashantv force-pushed the health branch 3 times, most recently from 7e8859d to 7ce4c7e Compare September 27, 2017 18:02

prashantv force-pushed the health branch from 7ce4c7e to 4fa47de Compare September 28, 2017 04:31

prashantv force-pushed the health branch from 4fa47de to 09b91fc Compare September 28, 2017 23:07

Wait for connection to close, fix flaky test

99c04e9

prashantv force-pushed the health branch from 9f04772 to 99c04e9 Compare September 29, 2017 02:17

akshayjshah approved these changes Sep 29, 2017

View reviewed changes

Don't defer in an infinite loop

89c0c1c

prashantv merged commit 1fcf82e into dev Sep 29, 2017

prashantv deleted the health branch September 29, 2017 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add periodic health checks to TChannel #318

Add periodic health checks to TChannel #318

prashantv commented Apr 15, 2016 •

edited

akshayjshah Apr 19, 2016 •

edited

motiejus commented Apr 20, 2016

prashantv commented Apr 20, 2016

yurishkuro May 17, 2016

yurishkuro commented May 25, 2016

prashantv commented May 26, 2016

madhuravi commented Jul 10, 2017

codecov bot commented Sep 27, 2017 •

edited

prashantv commented Sep 28, 2017

akshayjshah Sep 29, 2017

prashantv Sep 29, 2017

akshayjshah commented Sep 29, 2017

Add periodic health checks to TChannel #318

Add periodic health checks to TChannel #318

Conversation

prashantv commented Apr 15, 2016 • edited

akshayjshah Apr 19, 2016 • edited

Choose a reason for hiding this comment

motiejus commented Apr 20, 2016

prashantv commented Apr 20, 2016

yurishkuro May 17, 2016

Choose a reason for hiding this comment

yurishkuro commented May 25, 2016

prashantv commented May 26, 2016

madhuravi commented Jul 10, 2017

codecov bot commented Sep 27, 2017 • edited

Codecov Report

prashantv commented Sep 28, 2017

akshayjshah Sep 29, 2017

Choose a reason for hiding this comment

prashantv Sep 29, 2017

Choose a reason for hiding this comment

akshayjshah commented Sep 29, 2017

prashantv commented Apr 15, 2016 •

edited

akshayjshah Apr 19, 2016 •

edited

codecov bot commented Sep 27, 2017 •

edited