Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add periodic health checks to TChannel #318

Merged
merged 3 commits into from Sep 29, 2017
Merged

Add periodic health checks to TChannel #318

merged 3 commits into from Sep 29, 2017

Conversation

prashantv
Copy link
Contributor

@prashantv prashantv commented Apr 15, 2016

If a TChannel connection stalls (e.g., the remote side has disappeared,
and traffic starts black-holing), TChannel waits for TCP to detect this
issue, but this is often not configurable, and it can be a long period
of time (we've observed many minutes in production).

In many P2P situations, it's desirable to detect these failures as soon
as possible. Support active health checking using TChannel pings on a
periodic basis, with configurable timeouts, number of failures etc.

connection.go Outdated
}
if timeout == 0 {
timeout = time.Second
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to validate/enforce that timeout <= interval?

@motiejus
Copy link

The "health" branch of tchannel-go is good for ringpop: if destination blackholes, we see that within 1 second, without filling any outgoing buffers, and ringpop has a better chance to detect the transport error.

I am all-in for this feature.

@prashantv
Copy link
Contributor Author

Is there any concerns about having constant pings on any TCP connection? tchannel-go does not currently close working TCP connections, so over time, if ringpop ends up connecting to every other host, each of those connections will have constant health checks.

The main concern I have about this change is the default health timeout period, since some services may have much slower responses. We could even ship this change but have a pretty high value (like 10 seconds or something) -- although I imagine ringpop would want a much lower value. Since it doesn't create the channel, the service owner would need to set a lower value, which seems like extra configuration that we could avoid by having a lower default.

connection.go Outdated
}
return
}
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ok that cancel() is not called when this returns in L948?

@yurishkuro
Copy link
Contributor

@prashantv is there an eta for this change?

@prashantv
Copy link
Contributor Author

I'll probably not get to this in the next week, there's a bunch of relay work that's higher priority right now.

@madhuravi
Copy link

@prashantv Are there any plans to get this in? Looks like it's been over a year since it was originally created. This would be useful for cadence.

@prashantv prashantv changed the title WIP: Add periodic health checks to TChannel Add periodic health checks to TChannel Sep 27, 2017
@codecov
Copy link

codecov bot commented Sep 27, 2017

Codecov Report

Merging #318 into dev will decrease coverage by 0.43%.
The diff coverage is 96.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #318      +/-   ##
==========================================
- Coverage   85.94%   85.51%   -0.44%     
==========================================
  Files          37       38       +1     
  Lines        4668     4762      +94     
==========================================
+ Hits         4012     4072      +60     
- Misses        534      565      +31     
- Partials      122      125       +3
Impacted Files Coverage Δ
introspection.go 90.94% <100%> (+0.07%) ⬆️
channel.go 87.64% <100%> (+0.14%) ⬆️
connection.go 78.89% <100%> (-5.31%) ⬇️
health.go 95% <95%> (ø)
relay.go 83.5% <0%> (-1.27%) ⬇️
peer.go 93.02% <0%> (-0.3%) ⬇️
mex.go 76.2% <0%> (+0.34%) ⬆️
logger.go 87.01% <0%> (+1.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9dc4c1...89c0c1c. Read the comment docs.

@prashantv
Copy link
Contributor Author

I've updated the diff with full test coverage. Health checking is opt-in, and by default no active health checks are performed.

The caller can choose the interval between checks, what the timeouts are for each ping message, and how many consecutive failures should cause a connection to be closed.

If a TChannel connection stalls (e.g., the remote side has disappeared,
and traffic starts black-holing), TChannel waits for TCP to detect this
issue, but this is often not configurable, and it can be a long period
of time (we've observed many minutes in production).

In many P2P situations, it's desirable to detect these failures as soon
as possible. Support active health checking using TChannel pings on a
periodic basis, with configurable timeouts, number of failures etc.
health.go Outdated
}

ctx, cancel := context.WithTimeout(c.healthCheckCtx, opts.Timeout)
defer cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to defer this until the function returns, or should we make sure to execute it at the end of this loop iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Could have been a huge memory leak, changed ti to cancel immediately after the ping method, since we don't use the ctx for anything else.

@akshayjshah
Copy link
Contributor

I am very excited about this.

@prashantv prashantv merged commit 1fcf82e into dev Sep 29, 2017
@prashantv prashantv deleted the health branch September 29, 2017 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants