Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytest-xdist hangs if max-slave-restart is set and the restart limit is reached #45

Closed
fizyk opened this issue Feb 9, 2016 · 15 comments
Labels

Comments

@fizyk
Copy link

fizyk commented Feb 9, 2016

It can be observed in this build:
https://travis-ci.org/ClearcodeHQ/pytest-dbfixtures/jobs/108118580

I'm testing if a package is configured properly (all non py files are included, hence change to install) and installing the package with "pip install ." before the test.
pytest_load_initial_conftests throws a ValidationError if the additional files can not be accessed. however I end up with a loop of restarting slaves indefinitely, or if I set --max-slave-restart, I end up with hanging py.test as well.

@RonnyPfannschmidt
Copy link
Member

I found time to take a look, its indeed 2 bugs, one in pytest that will require the upcoming 2.9 release (parameter passing)

The other one in xdist being a deadlock condition that's scheduled for fixing of xdist 2.0 since it needs a extensive internal refactoring to address the issue propperly

@fizyk
Copy link
Author

fizyk commented Feb 10, 2016

Thanks. Will look forward for new xdist and py.test then :)

@alfredodeza
Copy link

alfredodeza commented Oct 26, 2016

I am hitting this same issue. I get to trigger the problem when using --rsyncdir and my conftest.py file gets executed remotely. In my case there is an exception raised in conftest.py which causes the slave to be restarted.

Since the slave restart got into a loop (slave connects, conftest raises exception, slave restarts) I added the --max-slave-restart=1 initially and then with --max-slave-restart=0 both with the same effect: the connection hangs forever. There is no way to exit out of this unless a Ctrl-C is sent to the process.

[node2] node down: Traceback (most recent call last):
  File "<string>", line 1072, in executetask
  File "<string>", line 1, in do_exec
  File "<remote exec>", line 155, in <module>
  File "<remote exec>", line 129, in remote_initconfig
  File "/home/vagrant/pyexecnetcache/_pytest/config.py", line 912, in fromdictargs
    config.parse(args, addopts=False)
  File "/home/vagrant/pyexecnetcache/_pytest/config.py", line 1039, in parse
    self.hook.pytest_cmdline_preparse(config=self, args=args)
  File "/home/vagrant/pyexecnetcache/_pytest/vendored_packages/pluggy.py", line 745, in __call__
    return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs)
  File "/home/vagrant/pyexecnetcache/_pytest/vendored_packages/pluggy.py", line 339, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/home/vagrant/pyexecnetcache/_pytest/vendored_packages/pluggy.py", line 334, in <lambda>
    _MultiCall(methods, kwargs, hook.spec_opts).execute()
  File "/home/vagrant/pyexecnetcache/_pytest/vendored_packages/pluggy.py", line 614, in execute
    res = hook_impl.function(*args)
  File "/home/vagrant/pyexecnetcache/tests/conftest.py", line 100, in pytest_cmdline_preparse
    components = load_component_config(component_path, args=args)
  File "/home/vagrant/pyexecnetcache/tests/conftest.py", line 42, in load_component_config
    raise RuntimeError('`%s` is not a file.' % abspath)
RuntimeError: `/home/vagrant/pyexecnetcache/components/base.py` is not a file.

Slave restarting disabled

Versions used:

py.test 3.0.3
pytest-xdist-1.15.0

@alfredodeza
Copy link

Maybe allowing a timeout value? execnet allows setting a timeout (in seconds) to disconnect gracefully from the remote connection.

This is pretty bad for us because we are relying on this feature for (automated) CI test runs and we keep hitting this

@nicoddemus
Copy link
Member

Thanks @alfredodeza.

Just to comment that this issue has not been forgotten, it is just that we maintainers didn't yet have the time to look at it properly. If someone wants to try a PR we would appreciate it.

@lemontree255
Copy link

Hi,

Is there any updates on this issue?

@RonnyPfannschmidt
Copy link
Member

nope

@timj
Copy link
Contributor

timj commented Sep 28, 2017

In case it helps, the problem seems to be that the while loop in DSession.loop_once() does not understand that all the workers have disappeared. It keeps on seeing if the queue is empty with a 2 seconds timeout and the answer is always "yes, empty" and then it tries again. Does DSession have any way to know that there are no workers left?

The other thing I see is that forked processes that die don't seem to be collected properly: they hang around until the parent dies (which is the problem I have when infinite worker restarts are allowed: eventually no more processes can be created and the system locks up).

@timj
Copy link
Contributor

timj commented Sep 28, 2017

This works for me, checking to see if there are any active nodes, although I'm sure there's a better way to force an exit rather than raising RuntimeError:

    def loop_once(self):
        """Process one callback from one of the slaves."""
        while 1:
            if not self._active_nodes:
                # If everything has died stop looping
                self.triggershutdown()
                raise RuntimeError("No active nodes")
            try:
                eventcall = self.queue.get(timeout=2.0)
                break
            except queue.Empty:
                continue

@nicoddemus
Copy link
Member

Did you try just returning from it after triggershutdown?

    def loop_once(self):
        """Process one callback from one of the slaves."""
        while 1:
            if not self._active_nodes:
                # If everything has died stop looping
                self.triggershutdown()
                return
            try:
                eventcall = self.queue.get(timeout=2.0)
                break
            except queue.Empty:
                continue

@timj
Copy link
Contributor

timj commented Oct 2, 2017

Yes. If you do that the process exits with "no tests found". I think it should more strongly indicate that things went wrong.

@nicoddemus
Copy link
Member

Oh right. Hmm not sure then, perhaps @RonnyPfannschmidt has a suggestion in which exception should be thrown then?

@RonnyPfannschmidt
Copy link
Member

nope

@nicoddemus
Copy link
Member

In that case I think raising an internal error is better than the alternative, which is to just loop forever. We can worry about providing a nicer message later.

@timj would you like to open a PR with this fix?

@nicoddemus
Copy link
Member

Fixed by #238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants