Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancelled contexts in watchers cause high CPU usage #3012

Closed
palourde opened this issue May 30, 2019 · 4 comments

Comments

Projects
None yet
2 participants
@palourde
Copy link
Member

commented May 30, 2019

We discovered during 5.7.0 QA where upon removing 1000 proxy entities, the CPU usage would remain extremely high on at least one backend node. Here's the graph of the profile the reliability team took against a 5.7.0 cluster: caviar_pprof.pdf

From @ccressent:

Basically, it boils down to this loop going haywire:

for ctx.Err() == nil {
Checking ctx.Err() very rapidly causes all this time spent locking/unlocking. There are things I want to try to finishing understanding what's going on, but like you, I've been unable to reproduce the problem in the first place...

Here's some things we could try to reproduce the problem:

  • Try to reproduce this problem against 5.9.0
  • Try to reproduce this problem against 5.7.0
  • Try to reproduce this problem with a cluster first deployed with 5.6.0 and then upgraded to 5.7.0 to simulate the conditions against which QA was done.
@palourde

This comment has been minimized.

Copy link
Member Author

commented May 30, 2019

I was unable to reproduce this issue against a 5.9.0 cluster. Is it possible this issue was fixed in 5.8.0 or 5.9.0? Could it be related to the serialization changes we made in 5.7.0?

@ccressent

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

@echlebek made a good catch: the limiter used in that function should be inside the for ctx.Err() == nil { ... } loop! I'd still like to find a way to reproduce the issue so that we can "prove" that reworking the limiter fixes it.

@palourde

This comment has been minimized.

Copy link
Member Author

commented May 30, 2019

@ccressent I've deployed a 5.6.0 cluster, upgraded to 5.7.0, then created/destroyed 1000 proxy entities without any success 😓. The load always quickly come back to almost nothing.

@ccressent ccressent added this to the 5.10.0 milestone Jun 6, 2019

@ccressent

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

Based on my testing and profiling, this issue is caused by a combination of 2
things: a failure of the etcd client to properly report that a watcher is
effectively dead, and our code assuming that it can always detect when a watcher
dies.

etcd's watcher can fail to properly report their death

As reported in this issue (still open), watchers can fail to properly report
that they are not watching anything anymore. Despite the maintainer's claim that
giving a client multiple node endpoints should allow for automatic failover, I'm
unable to observe this behaviour.

These types of watchers waiting indefinitely in the client can also be triggered
by short network connectivity issues!

Our code assumes we will always detect a watcher's death

In a nutshell, it assumes that the context's error will be non-nil if the
watcher's channel has been closed. This is not always the case. This leads to
our code entering a tight loop that tries to read a closed channel.

We seem to have modelled our code on the examples provided in the etcd
documentation, but those examples are bad in the sense that they only show how
to grab values from the watcher's channel, not how to completely manage the life
cycle of the watcher.

How to reproduce

It boils down to managing to make etcd create those "zombie" watchers. Once
that happens, any goroutine in our backend that's using one of those zombies
will go haywire.

I've found two ways to create a zombie: overloading etcd and temporary
severing the connection between some cluster members. The later is much faster
and much simpler to execute: temporarily adding an appropriate firewall rule
does the trick.

sudo iptables -A INPUT -p tcp --sport 2380 -j DROP

This will make the backend it's executed on disappear from the etcd cluster as
far as the other members are concerned. Removing that rule makes the backend
reappear:

sudo iptables -D INPUT -p tcp --sport 2380 -j DROP

From then on, sensuctl cluster health will report all the backends as
connected and healthy, but some of them might still be using zombie watchers in
a tight loop, creating a spike in CPU usage.

Potential fixes

Periodically throw away all the watchers

This is the approach taken by these guys. This gives an upper bound on the
length of time zombie watchers might be used. I'm not sure this is great from a
correctness standpoint: until a zombie watcher is reaped that way, it won't
report any changes to the key that's being watched… But at least the system
eventually corrects itself.

Better attempt to detect a disconnection

The etcd maintainers mention trying to use Sessions to do that. According
to them, Watchers were never designed to be able to detect disconnections and
they just retry forever (see this issue).

Alternatively, we might be able to use the WithProgressNotify() option on the
watcher and then time it out if there is no notification for a given amount of
time, 1 minute being suggested. Though one of the maintainers mentions this is a
bad idea in general and suggest using Sessions again.

Interesting links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.