Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random test failure #84

Merged
merged 4 commits into from Aug 8, 2019
Merged

Fix random test failure #84

merged 4 commits into from Aug 8, 2019

Conversation

bmwiedemann
Copy link
Contributor

@bmwiedemann bmwiedemann commented Aug 25, 2018

While working on reproducible builds for openSUSE, I noticed
that test_popen_nice would randomly fail on 1-core-VMs

         remotenice = gw.remote_exec(getnice).receive()
         gw.exit()
         if remotenice is not None:
             gw = makegateway("popen//nice=5")
             remotenice2 = gw.remote_exec(getnice).receive()
 >           assert remotenice2 == remotenice + 5
 E           assert 0 == (0 + 5)

@bmwiedemann
Copy link
Contributor Author

rebased to master and added more sleeps to make tests pass.
Maybe instead there should be some code change so that gw.remote_foo waits a bit for the status to settle, to make the whole library easier to use?
Does not need to be constant waiting times, but could wait for some "update done" message

@bmwiedemann
Copy link
Contributor Author

Often the behaviour of 1-core-VMs can be simulated by prefixing a command with taskset 1

@nicoddemus
Copy link
Member

I don't really know execnet internals in order to evaluate the changes, but the tests pass and the change is in the test suite only... @RonnyPfannschmidt @hpk42 should we merge this as is, or does someone wants to take the time to implement the "update done" mechanism proposed by @bmwiedemann?

@RonnyPfannschmidt
Copy link
Member

i'm no longer working on execnet

@@ -90,6 +91,7 @@ def test_gateway_status_busy(self, gw):
ch2.waitclose()
for i in range(10):
status = gw.remote_status()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gw.remote_status() is a synchronous operation -- no time.sleep() should be neccessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it was already repeated 10 times - so it is waiting on something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also the test run from the first commit:
https://travis-ci.org/pytest-dev/execnet/jobs/427581580

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe with the introduction of the execmodel backends behaviour changed the expectations and a wait-for with some sleep time is needed to accommodate for the extra sleep time

@hpk42
Copy link
Contributor

hpk42 commented Sep 24, 2018

gw.remote_status() does not change the dict it returns, afterwards.

Do i see it correctly that all test failures are bout "nice" settings and verifications?

If so, i'd say there is some flakyness there and the time.sleep() just make it less likely issues occur.

testing/test_gateway.py Outdated Show resolved Hide resolved
@@ -137,6 +138,7 @@ def getnice(channel):
gw.exit()
if remotenice is not None:
gw = makegateway("popen//nice=5")
time.sleep(0.1)
remotenice2 = gw.remote_exec(getnice).receive()
assert remotenice2 == remotenice + 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a supporting traceback for changing this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of the first commit's message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uploaded a full log at https://www.zq1.de/~bernhard/temp/python-execnet-build.log
To trigger it, I generated some parallel load with perl -e 'fork;fork;fork;while(1){}'

@bmwiedemann
Copy link
Contributor Author

Added 4th sleep because of https://travis-ci.org/pytest-dev/execnet/jobs/432459677

@bmwiedemann
Copy link
Contributor Author

ping. Anything left to do?

@nicoddemus
Copy link
Member

Thanks for the ping @bmwiedemann, I'll let @hpk42 decide on this one.

@hpk42
Copy link
Contributor

hpk42 commented Oct 2, 2018

Can any of these test failures (without your sleep's) be reproduced on python3?

time.sleep()s should not be neccessary, obviously. I suggest to rather add something like:

flakytest = pytest.mark.xfail(reason='on some systems this test fails due to timing problems')

And then marking the offending tests as @flakytest. If it's only happening on python2, then the decorator can use that condition in the marker.

@bmwiedemann
Copy link
Contributor Author

@hpk42 https://travis-ci.org/pytest-dev/execnet/jobs/432459677 from above said py35
and the one I can reproduce locally talks about py.test-3.6

so I think it is a general timing problem.
I'm not sure if this could be problematic outside of tests - in that case, adding @flakytest would not be the right change.

@hpk42
Copy link
Contributor

hpk42 commented Oct 2, 2018

thanks for the clarification. i don't see a difference between @FlakyTest as compared to time.sleep -- in fact silently passing a timing problem (time.sleep) is worse than marking a test and getting it reported as "x" or "X".

@bmwiedemann
Copy link
Contributor Author

Updated to use @flakytest but could only fully test the test_xspec part that I can reproducible locally.

While working on reproducible builds for openSUSE, I noticed
that test_popen_nice would randomly fail on 1-core-VMs

         remotenice = gw.remote_exec(getnice).receive()
         gw.exit()
         if remotenice is not None:
             gw = makegateway("popen//nice=5")
             remotenice2 = gw.remote_exec(getnice).receive()
 >           assert remotenice2 == remotenice + 5
 E           assert 0 == (0 + 5)
In travis-ci it did
test_status_with_threads
E       AssertionError: numexecuting didn't drop to zero

test_gateway_status_busy
E           Failed: did not get correct remote status

test_popen_stderr_tracing
>       assert slave_line in err
E       AssertionError: assert '[2361] creating slavegateway' in 'ay_base.Popen2IO object at 0x7f19024bba58>\n[2361] gw0-slave [serve] spawning receiver thread\n[2361] gw0-slave [serv...1 lendata=6>\n[2361] gw0-slave execution finished\n[2361] gw0-slave sent <Message CHANNEL_CLOSE channel=1 lendata=0>\n'
@nicoddemus nicoddemus merged commit 8cae029 into pytest-dev:master Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants