Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_alive doctest failure in map_reduce #24241

Closed
videlec opened this issue Nov 19, 2017 · 26 comments
Closed

is_alive doctest failure in map_reduce #24241

videlec opened this issue Nov 19, 2017 · 26 comments

Comments

@videlec
Copy link
Contributor

videlec commented Nov 19, 2017

Some patchbots report unstopped workers

sage -t --long src/sage/parallel/map_reduce.py
**********************************************************************
File "src/sage/parallel/map_reduce.py", line 1090, in sage.parallel.map_reduce.RESetMapReduce.start_workers
Failed example:
    all(w.is_alive() for w in S._workers)
Expected:
    True
Got:
    False
**********************************************************************
1 item had failures:
   1 of   9 in sage.parallel.map_reduce.RESetMapReduce.start_workers
    [296 tests, 1 failure, 28.70 s]
----------------------------------------------------------------------
sage -t --long src/sage/parallel/map_reduce.py  # 1 doctest failed
----------------------------------------------------------------------

see

CC: @hivert

Component: combinatorics

Keywords: random_fail

Author: Florent Hivert

Branch/Commit: 6eeda41

Reviewer: Jeroen Demeyer

Issue created by migration from https://trac.sagemath.org/ticket/24241

@videlec videlec added this to the sage-8.1 milestone Nov 19, 2017
@jdemeyer

This comment has been minimized.

@embray
Copy link
Contributor

embray commented Nov 22, 2017

comment:2

I'm not sure an obvious way to reproduce this, but maybe we could go ahead and merge #21233 and see if that fixes it? I've been waiting forever for someone to just give it positive review (which it had previously but Volker removed it...)

@videlec
Copy link
Contributor Author

videlec commented Nov 22, 2017

comment:3

Replying to @embray:

I'm not sure an obvious way to reproduce this, but maybe we could go ahead and merge #21233 and see if that fixes it?

+1. And I would love to have a way to grep through the patchbot reports easily!

@embray
Copy link
Contributor

embray commented Nov 22, 2017

comment:4

Replying to @videlec:

Replying to @embray:

I'm not sure an obvious way to reproduce this, but maybe we could go ahead and merge #21233 and see if that fixes it?

+1. And I would love to have a way to grep through the patchbot reports easily!

Open an issue on the patchbot GitHub project for that. I would love that too but it's probably not entirely trivial (if nothing else we'd want to index the report logs).

@embray
Copy link
Contributor

embray commented Dec 21, 2017

comment:5

I'm not totally sure this was fixed by #21233. Now, on several of my Cygwin patchbot runs, this module fails on the initial test run, not quite in the way reported by this ticket, but possibly similar. I get sage -t --long src/sage/parallel/map_reduce.py # Timed out after testing finished which is something I've never seen before...

@vbraun
Copy link
Member

vbraun commented Dec 25, 2017

comment:6

I'm also seeing this on the buildbot

@vbraun
Copy link
Member

vbraun commented Dec 25, 2017

Changed keywords from none to random_fail

@jdemeyer
Copy link

comment:7

This is just to say that I got this again.

@jdemeyer jdemeyer changed the title unstopped MapReduce workers is_alive doctest failure in map_reduce Jul 12, 2018
@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

comment:9

The doctest looks like a race condition. If I'm understanding things correctly, the workers are started and will then stop naturally (after an unspecified amount of time). If they stop really quickly, then this doctest will fail:

            sage: from sage.parallel.map_reduce import RESetMapReduce
            sage: S = RESetMapReduce(roots=[])
            sage: S.setup_workers(2)
            sage: S.start_workers()
            sage: all(w.is_alive() for w in S._workers)
            True

@jdemeyer
Copy link

comment:10

I can make this test fail pretty consistently with

sage: sleep(0.02); all(w.is_alive() for w in S._workers)

If a doctest is sensitive to 20ms delays, it's a bad test.

@embray
Copy link
Contributor

embray commented Jul 12, 2018

comment:11

Indeed; I see the problem here. When I originally commented on this ticket, I admit, I don't think I looked very closely at the exact test that was failing.

If there's no work for the workers to do, then there's no guarantee that you'll ever find them all running simultaneously.

If you really wanted to test this, one possibility might be to set up a test logger that collects all log messages in a list, and then checks that the expected log messages are found (e.g. one "Started" and one "Exiting" for each worker started.

@hivert
Copy link

hivert commented Jul 12, 2018

comment:12

Replying to @embray:

If you really wanted to test this, one possibility might be to set up a test logger that collects all log messages in a list, and then checks that the expected log messages are found (e.g. one "Started" and one "Exiting" for each worker started.

Thanks to all of you for catching this one. I'm confirming jdemeyer analysis. If there is no work to do, there is no robust lower bound for the time the worker stays alive.

@embray: there is a logger is the code but the level is normally set too low to see the message. Another possibilities would be to give as work to the worker a sleep(1) instruction.

@hivert
Copy link

hivert commented Jul 12, 2018

@hivert
Copy link

hivert commented Jul 12, 2018

comment:13

Sorry based my file on the wrong branch... Fixing it

@jdemeyer
Copy link

comment:14

The doctest fix looks good on first sight, I would still keep the sleep(1) in the last test though.

I cannot really comment on the other changes, which seem to be related to Python 3.


New commits:

8778e24Tentative fix of MapReduce.is_alive

@jdemeyer
Copy link

Commit: 8778e24

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Jul 12, 2018

Changed commit from 8778e24 to 9c1ab33

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Jul 12, 2018

Branch pushed to git repo; I updated commit sha1. New commits:

9c1ab33Fixed wrong base

@hivert
Copy link

hivert commented Jul 12, 2018

comment:16

Replying to @jdemeyer:

The doctest fix looks good on first sight, I would still keep the sleep(1) in the last test though.

I cannot really comment on the other changes, which seem to be related to Python 3.

Sorry based my file on the wrong branch... Should be fixed now


New commits:

9c1ab33Fixed wrong base

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Jul 12, 2018

Branch pushed to git repo; I updated commit sha1. New commits:

6eeda41Put back timeout to 1

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Jul 12, 2018

Changed commit from 9c1ab33 to 6eeda41

@hivert
Copy link

hivert commented Jul 12, 2018

Author: Florent Hivert

@jdemeyer
Copy link

Reviewer: Jeroen Demeyer

@hivert
Copy link

hivert commented Jul 13, 2018

comment:20

Replying to @jdemeyer:
Thanks Jeroen

@vbraun
Copy link
Member

vbraun commented Aug 5, 2018

Changed branch from u/hivert/is_alive_doctest_failure_in_map_reduce to 6eeda41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants