Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis intermittent failures on our submit_mult_calcs external client tests in 2.7 only #259

Closed
spencerahill opened this issue Mar 21, 2018 · 9 comments
Labels

Comments

@spencerahill
Copy link
Owner

This is starting to happen with more regularity now, e.g. https://travis-ci.org/spencerahill/aospy/jobs/356503697#L854. I've pasted the full error message below.

Further motivation to drop 2.7? 😜 c.f. #256

==================================== ERRORS ====================================
___ ERROR at setup of test_submit_mult_calcs_external_client[exec_options0] ____
    @pytest.fixture
    def external_client():
        # Explicitly specify we want only 4 workers so that when running on
        # Travis we don't request too many.
>       cluster = distributed.LocalCluster(n_workers=4)
aospy/test/test_automate.py:221: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/deploy/local.py:127: in __init__
    self.start(ip=ip, n_workers=n_workers)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/deploy/local.py:145: in start
    sync(self.loop, self._start, **kwargs)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/utils.py:254: in sync
    six.reraise(*error[0])
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/utils.py:238: in f
    result[0] = yield make_coro()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1099: in run
    value = future.result()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py:260: in result
    raise_exc_info(self._exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1107: in run
    yielded = self.gen.throw(*exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/deploy/local.py:163: in _start
    yield [self._start_worker(**self.worker_kwargs) for i in range(n_workers)]
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1099: in run
    value = future.result()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py:260: in result
    raise_exc_info(self._exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:849: in callback
    result_list.append(f.result())
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py:260: in result
    raise_exc_info(self._exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1107: in run
    yielded = self.gen.throw(*exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/deploy/local.py:180: in _start_worker
    yield w._start()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1099: in run
    value = future.result()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py:260: in result
    raise_exc_info(self._exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1107: in run
    yielded = self.gen.throw(*exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py:155: in _start
    response = yield self.instantiate()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1099: in run
    value = future.result()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py:260: in result
    raise_exc_info(self._exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1107: in run
    yielded = self.gen.throw(*exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py:223: in instantiate
    self.process.start()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1099: in run
    value = future.result()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py:260: in result
    raise_exc_info(self._exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1107: in run
    yielded = self.gen.throw(*exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py:363: in start
    self._wait_until_started())
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1099: in run
    value = future.result()
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py:260: in result
    raise_exc_info(self._exc_info)
../../../miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py:1113: in run
    yielded = self.gen.send(value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <distributed.nanny.WorkerProcess object at 0x7f202dbe8e10>
    @gen.coroutine
    def _wait_until_started(self):
        delay = 0.05
        while True:
            if self.status != 'starting':
                return
            try:
                msg = self.init_result_q.get_nowait()
>               assert msg == 'started', msg
E               AssertionError: {'dir': '/home/travis/build/spencerahill/aospy/dask-worker-space/worker-mIruo9', 'address': 'tcp://127.0.0.1:36375'}
../../../miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py:471: AssertionError
---------------------------- Captured stderr setup -----------------------------
distributed.utils - ERROR - {'dir': '/home/travis/build/spencerahill/aospy/dask-worker-space/worker-mIruo9', 'address': 'tcp://127.0.0.1:36375'}
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
    raise_exc_info(self._exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/distributed/deploy/local.py", line 163, in _start
    yield [self._start_worker(**self.worker_kwargs) for i in range(n_workers)]
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
    raise_exc_info(self._exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
    raise_exc_info(self._exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/distributed/deploy/local.py", line 180, in _start_worker
    yield w._start()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
    raise_exc_info(self._exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
    raise_exc_info(self._exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
    raise_exc_info(self._exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
    raise_exc_info(self._exc_info)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/travis/miniconda/envs/test_env/lib/python2.7/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'dir': '/home/travis/build/spencerahill/aospy/dask-worker-space/worker-mIruo9', 'address': 'tcp://127.0.0.1:36375'}
@spencerkclark
Copy link
Collaborator

Interesting...I've been getting these sorts of intermittent errors offline (outside the aospy test suite) with the latest version of distributed as well. I wonder if it is an upstream issue?

@spencerahill
Copy link
Owner Author

Good to know. Besides 3.4, the other environments are also using the same distributed version as 2.7:

Were your errors using 2.7 or otherwise?

@spencerkclark
Copy link
Collaborator

spencerkclark commented Mar 22, 2018

My errors were using Python 3.6.1 and distributed version 1.21.3.

@spencerahill
Copy link
Owner Author

Ok. So it's possible that them being only on 2.7 for us is just coincidental.

I'd say this is unimportant enough to just wait it out for now...if it persists much longer we can dig in more/file a report.

@spencerahill
Copy link
Owner Author

Just saw this: http://matthewrocklin.com/blog/work/2018/03/21/dask-0.17.2

May be related to the Tornado (which Dask uses) version. New 5.0 release introduced some funniness w/ dask

@spencerkclark
Copy link
Collaborator

I think this may have been fixed in the latest release of distributed. We can keep an eye on it. See dask/distributed#1865.

@spencerahill
Copy link
Owner Author

Nice, thanks for catching that. Once we have a few commits where this doesn't occur, I'll close this.

@spencerahill
Copy link
Owner Author

In the most recent batch of these errors, the distributed version is 1.21.6 in both the failing and passing cases. So it's not as simple as a particular version.

The Tornado version is also the same across the passing and failing tests here, 5.0.2.

@spencerkclark
Copy link
Collaborator

I resolved my issues on the GFDL PPAN cluster by specifying death_timeout=None in my calls to LocalCluster; this disables the timeout logic. That could potentially be a fix here.

spencerkclark added a commit to spencerkclark/aospy that referenced this issue Apr 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants