Time out processing could be more robust to slow callbacks. #947

Closed
paul-ollis opened this Issue Nov 29, 2013 · 1 comment

Projects

None yet

2 participants

@paul-ollis

I have seen this occur in version 3.1.1

The affected code is in ioloop.py, method start, which begins at line 549. More
specifically the issue arises in the code loop, starting starting at line 605
which handles expired timeouts.

If an invoked callback takes an unreasonable time to execute and adds a new
timout for itself such that the expirey time has elapsed before it returns then
the the affected code will (I believe) loop forever. Adding the line::

now = self.time()

immediately after line 618, which invokes _run_callback fixes this, at least
for non-pathological code.

FYI I discovered this whilst trying to use Tornado to replace the bulk of some
code I am refactoring. Otherwise, I would probably have naturally used the
PeriodicCallback class and not encountered this issue.

If this 'fix' is considered worthwhile, I am happy to add a matching regression
test and patch file.

@bdarnell
Member
bdarnell commented Dec 1, 2013

Simply advancing the clock while executing timeout callbacks could lead to pathological behavior in other ways: slow callbacks could keep the IOLoop stuck forever in the "timeout" phase and never run any non-timeout callbacks or check for IO. The best thing to do is probably to copy all timeout callbacks that are due to be executed to a separate list before running any of them (maybe just copy the callbacks to self._callbacks instead of running them in the 'while self._timeouts' loop).

@bdarnell bdarnell added a commit that closed this issue Jun 15, 2014
@bdarnell bdarnell Improve callback scheduling.
Collect all timeouts to be run before running any of them; this
prevents starvation when a slow callback reschedules itself.

Call time() again before setting poll_timeout to avoid scheduling
anomalies with slow callbacks.

Closes #947.
1591ade
@bdarnell bdarnell closed this in 1591ade Jun 15, 2014
@yannmh yannmh added a commit to DataDog/dd-agent that referenced this issue Mar 4, 2016
@yannmh yannmh [core] detect forwarder pathological activity
On some specific system date and scheduled transaction flush time, the
Datadog Agent forwarder's schedule breaks and enters in a frenetic loop.
It gets unreachable, flushes (no transaction) at a very high rate
(several ~100 flushes/second). At this point, only a restart fixes the
situation.

Under the hood, the issue is related to `tornado==3.2.2`. More
information is available here:
tornadoweb/tornado#947

Before considering any upgrade, a simple fix is to use the Watchdog to
detect this pathological high activity.
09db8cb
@yannmh yannmh referenced this issue in DataDog/dd-agent Mar 4, 2016
Merged

[core] detect forwarder pathological activity #2322

@yannmh yannmh added a commit to DataDog/dd-agent that referenced this issue Apr 29, 2016
@yannmh yannmh [core] detect forwarder pathological activity
On some specific system date and scheduled transaction flush time, the
Datadog Agent forwarder's schedule breaks and enters in a frenetic loop.
It gets unreachable, flushes (no transaction) at a very high rate
(several ~100 flushes/second). At this point, only a restart fixes the
situation.

Under the hood, the issue is related to `tornado==3.2.2`. More
information is available here:
tornadoweb/tornado#947

Before considering any upgrade, a simple fix is to use the Watchdog to
detect this pathological high activity.
569bfba
@yannmh yannmh added a commit to DataDog/dd-agent that referenced this issue Apr 29, 2016
@yannmh yannmh [core] detect forwarder pathological activity
On some specific system date and scheduled transaction flush time, the
Datadog Agent forwarder's schedule breaks and enters in a frenetic loop.
It gets unreachable, flushes (no transaction) at a very high rate
(several ~100 flushes/second). At this point, only a restart fixes the
situation.

Under the hood, the issue is related to `tornado==3.2.2`. More
information is available here:
tornadoweb/tornado#947

Before considering any upgrade, a simple fix is to use the Watchdog to
detect this pathological high activity.
0639b51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment