Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to RESetMapReduce timeout #21233

Closed
embray opened this issue Aug 12, 2016 · 35 comments
Closed

Fix to RESetMapReduce timeout #21233

embray opened this issue Aug 12, 2016 · 35 comments

Comments

@embray
Copy link
Contributor

embray commented Aug 12, 2016

This fixes two sort of related issues.

The first is a test failure I had for this module on my Windows machine (though possibly not all that related to Cygwin):

sage -t src/sage/parallel/map_reduce.py
**********************************************************************
File "src/sage/parallel/map_reduce.py", line 217, in sage.parallel.map_reduce
Failed example:
    try:
        res = EX.run(timeout=0.01)
    except AbortError:
        print("Computation timeout")
    else:
        print("Computation normally finished")
        res
Expected:
    Computation timeout
Got:
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/home/embray/src/sagemath/sage-cygwin/local/lib/python/threading.py", line 810, in __bootstrap_inner
        self.run()
      File "/home/embray/src/sagemath/sage-cygwin/local/lib/python/threading.py", line 1082, in run
        self.function(*self.args, **self.kwargs)
      File "/home/embray/src/sagemath/sage-cygwin/local/lib/python2.7/site-packages/sage/parallel/map_reduce.py", line 1208, in abort
        self._abort.value = True
    AttributeError: 'bool' object has no attribute 'value'
    <BLANKLINE>
    Computation normally finished
    40320*x^8 + 5040*x^7 + 720*x^6 + 120*x^5 + 24*x^4 + 6*x^3 + 2*x^2 + x + 1

The computation timeout method crashes and the computation finishes "normally" due to a bug in the ordering of self._timer.cancel() and self.finish() which I fixed immediately.

However, the test was still failing due to the calculation still finishing before the timeout. I lowered the timeout by an order of magnitude, which worked around it. But I also wanted to convince myself that the timeout was really working properly, so I instead modified the test to timeout on a much larger calculation. That was failing which led to finding the second issue fixed here, which is described in more detail in the commit message.

Possibly related: #24241

CC: @hivert

Component: combinatorics

Keywords: map-reduce windows cygwin

Author: Erik Bray

Branch/Commit: c426977

Reviewer: Emmanuel Charpentier, Travis Scrimshaw

Issue created by migration from https://trac.sagemath.org/ticket/21233

@embray embray added this to the sage-7.4 milestone Aug 12, 2016
@embray
Copy link
Contributor Author

embray commented Aug 12, 2016

comment:2

Now I'm thinking that in this part

+            except Empty:
+                aborted = self._aborted.value
+                logger.debug('Timed out waiting for results; aborted: %s' %
+                             aborted)
+                if aborted:
+                    return

I can actually even do away with checking self._aborted since if the get() times out then we've already exceeded the timeout for the computation anyways. Instead maybe it can just call self._timer.join() just to ensure that the timer already went off and called self.abort.

Or maybe even do away with the abort timer altogether and let the timeout be handled entirely here. Not sure what implications that might have, if any. I suppose you'd still want to have the abort timer in the case of a non-trivial reduce function.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 12, 2016

Branch pushed to git repo; I updated commit sha1. New commits:

2cb9701Fix a slight possibility of not aborting until after twice the requested timeout.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 12, 2016

Changed commit from b6caede to 2cb9701

@fchapoton
Copy link
Contributor

comment:4

+from Queue import Empty

this is not compatible with python3

@embray
Copy link
Contributor Author

embray commented Aug 16, 2016

comment:5

I mean, okay. But the only difference is Queue -> queue. Do we have a policy now that all new code should be cross-compatible? Is that documented somewhere? Not that I'd be against that. Is six one of Sage's requirements now because that would be helpful?

@fchapoton
Copy link
Contributor

comment:6

I am trying to move towards py3, so I keep an eye on backward moves. There are surely many that escape my vigilance.

I would be strongly in favor of imposing that new code must be py3-compatible. Some of the patchbot plugins try to look out for that.

"from six.moves import queue" will do the job for you here.

I am using six all over the place, so I guess it is required.

@embray
Copy link
Contributor Author

embray commented Aug 17, 2016

comment:7

Great, happy to make the change. But let's also get a policy documented to that effect.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 17, 2016

Branch pushed to git repo; I updated commit sha1. New commits:

5c6f30bUse six to abstract the Queue/queue difference in Python 3

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Aug 17, 2016

Changed commit from 2cb9701 to 5c6f30b

@hivert
Copy link

hivert commented Oct 3, 2016

comment:9

Hi Erik,

In the method get_result you added a timeout on the call to self._results.get. Though I'm Ok with that, if this timeout fire your join the timer and simply return. I'd rather have an AbortError raised in that case. More precisely, I think the right way is to call the abort method, as if the other timer has fired. Actually, I'm not sure why you want this new timer. It seems to be redundant with self.timer, isn't it ?

@hivert
Copy link

hivert commented Mar 14, 2017

@embray
Copy link
Contributor Author

embray commented Mar 22, 2017

Changed commit from 5c6f30b to 112db56

@embray
Copy link
Contributor Author

embray commented Mar 22, 2017

comment:12

Wow, this was a long time ago, and I never got around to looking at it again. Thank you hivert for the review...

Actually, I'm not sure why you want this new timer. It seems to be redundant with self.timer, isn't it ?

By "new timer" do you mean the new timeout in self._results.get() (since I didn't add any new timer per se)? I admit I don't exactly recall, except that the main problem I had was very long blocking (for large computations) at this get() call. In the commit message I wrote:

  Sometimes the main thread could hang at self._results.get() in self.get_results()
  This is because if a worker gets started on walk_branch_locally for a very large
  input before the computation is canceled, then self._results.get() will hang on
  the result from that worker.

  This is fixed by having self._results.get() timeout after no more than the
  timeout for the computation so that it can check the self._aborted flag before
  trying again.  This did necessitate changing self._results from a full-featured
  Queue rather than SimpleQueue; not sure what noticable impact that will have on
  a large calculation.

I think the point here is that the abort timer only calls self._shutdown() which sends poison pills to the workers but doesn't forcibly kill them. So if a worker is somewhere deep in walk_branch_locally it doesn't get the poison pill. Instead it tries to finish the job it's currently doing, and self._results.get() waits forever.

But I guess you might have a point that now the Timer thread is not as important. I guess one use for it still is if the reduce_function takes too long for some reason.


New commits:

112db56Merge branch 'develop' into t/21233/map-reduce-abort-issues

@embray
Copy link
Contributor Author

embray commented Mar 22, 2017

comment:13

I see your point though about raising AbortError and calling self.abort(), consistent with the timeout from the Timer.

@embray
Copy link
Contributor Author

embray commented Mar 23, 2017

comment:14

Actually, I took a closer look at this again, and I think it's correct as-is. In get_results() if a timeout occurs it calls self._timer.join() is it still waits for the timer in self._timer to run out naturally (there's a race condition here between the timer ending and self._results.get() timing out, but it shouldn't matter).

The point is that the timer ends and self.abort() gets called no matter what. So by the time get_results() exits, if the computation timed out self._aborted will still be set to True, resulting in an AbortError being raised.

@embray
Copy link
Contributor Author

embray commented Apr 24, 2017

comment:15

This fix is needed to fix some test results on Cygwin, but is not a major blocker otherwise.

@embray embray modified the milestones: sage-7.4, sage-8.0 Apr 24, 2017
@embray
Copy link
Contributor Author

embray commented Jun 30, 2017

Changed keywords from map-reduce to map-reduce windows cygwin

@EmmanuelCharpentier
Copy link
Mannequin

EmmanuelCharpentier mannequin commented Jul 1, 2017

Reviewer: Emmanuel Charpentier

@kiwifb
Copy link
Member

kiwifb commented Jul 22, 2017

comment:19

It just plain timed out for me on sage-on-gentoo but the backtrace doesn't look useful, I may have to work on the debugging setup.

@embray
Copy link
Contributor Author

embray commented Jul 31, 2017

comment:20

Can't reproduce that at all. You get it only with this patch?

@embray
Copy link
Contributor Author

embray commented Jul 31, 2017

comment:21

Sage 8.0 has been released, I guess, so the milestone should be changed, I guess.

@embray embray modified the milestones: sage-8.0, sage-8.1 Jul 31, 2017
@embray
Copy link
Contributor Author

embray commented Jul 31, 2017

comment:22

It occurs to me that that test is just bad. The RESetParallelIterator does not necessarily return results in a predictable order, so it's no wonder this would randomly fail.

@embray
Copy link
Contributor Author

embray commented Jul 31, 2017

@embray
Copy link
Contributor Author

embray commented Jul 31, 2017

New commits:

e5bd44eFix misbehaved test

@embray
Copy link
Contributor Author

embray commented Jul 31, 2017

Changed commit from 112db56 to e5bd44e

@embray
Copy link
Contributor Author

embray commented Aug 22, 2017

comment:24

This is still the only consistently failing test on Cygwin. Any chance someone can check this? It had positive_review except for one randomly failing test, which actually wasn't particularly related to this ticket, which I fixed.

@fchapoton
Copy link
Contributor

comment:25

does not apply

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 23, 2017

Changed commit from e5bd44e to c426977

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 23, 2017

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

af0071eFixes a couple issues with aborting a map_reduce computation:
65746a0This test was failing on my Windows machine, due to the calculation finishing before the timeout.
4fc88a8Fix a slight possibility of not aborting until after twice the requested timeout.
57b4b41Use six to abstract the Queue/queue difference in Python 3
c426977Fix misbehaved test

@jdemeyer

This comment has been minimized.

@tscrim
Copy link
Collaborator

tscrim commented Nov 22, 2017

comment:29

LGTM.

@tscrim
Copy link
Collaborator

tscrim commented Nov 22, 2017

Changed reviewer from Emmanuel Charpentier to Emmanuel Charpentier, Travis Scrimshaw

@vbraun
Copy link
Member

vbraun commented Dec 11, 2017

Changed branch from u/embray/map-reduce-abort-issues to c426977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants