torproject / stem Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_take_ownership_via_controller fails in Tor's "make test-stem" #52
Comments
|
Hi teor. You previously asked this on... https://trac.torproject.org/projects/tor/ticket/32819 Stem waits up to five seconds for the process to terminate, and tor did not do so. The ball is now in tor's court to fix this, or tell me if the SLA should be something else. |
|
Tor checks every 15 seconds, so the limit should be at least 20 seconds: |
Tor checks for the owning process every fifteen seconds, so this test's timeout is way too low... #52 Puzzling that this passes reliably for me. With that poll rate this should be failing far more often...
|
Thanks teor! Timeout changed... |
|
(The polling is a fallback on some systems, other systems get a notification about process changes immediately.) |
|
Tor's CI is still failing with the same Stem test error: Can you display the tor logs when a stem test fails? |
|
Hi teor. Tor's logs are available on disk at...
Dumping the log for any and all test failures would be unhelpfully verbose. Rather, I'd suggest having the travis process that runs the tests cat that file when there is a non-zero exit status. |
|
Thanks, give me a few moments, and I'll come back with the tor logs, and hopefully the next steps to fix these issues. |
|
Here are the Tor logs: And the stem trace logs: I can't see the stem test names in either log, so I'm going to set up a CI branch that just runs the owing controller process test. Unless you can find the output for that particular test? |
|
Here's the failure in Travis: Stem logs: Tor logs: I can't grep for log domains, because they're not shown in these logs. But I can try to find the right functions. I can also reproduce this failure locally with the latest tor and stem master. |
|
This test passes on 0.4.2, but fails on master. So it's probably a Tor bug: |
|
We're tracking it in https://trac.torproject.org/projects/tor/ticket/33006 |
|
This issue is caused by the precise timing of Tor's controller implementation. On my machine, it passes on:
And it fails on:
On Travis, we see the same results, but Travis always uses --enable-fragile-hardening for its stem tests. It looks like this timing issue was introduced in the https://trac.torproject.org/projects/tor/ticket/30984 refactor, in commit torproject/tor@c744d23. (At least on my machine.) Tor doesn't guarantee control reply timing. And we're unlikely to be able to restore the old timing behaviour. So stem's tests should be adapted to work with the timing in both Tor 0.4.2 and Tor master. See https://trac.torproject.org/projects/tor/ticket/33006 for details. |
Hi teor. Sorry, I don't follow - what is the duration our tests should wait for tor to terminate, or are you saying we should simply drop the test? |
|
This test works on Tor 0.4.2, and Tor master when compiled without extra hardening checks. But it fails when master is compiled with hardening. So either:
I can't see any obvious changes in Tor's output, and I don't know enough about the design of stem's tests to further diagnose this issue. We could try doing a diff of stem's trace logs on master, with and without hardening? But I won't be able to do that today, I have to finish off a proposal for sponsored work. |
Gotcha. The test is simple...
Nothing about this involves precise timing, the only timebound part is a big twenty second window. This should be manually reproduceable without running the tests. I'd be happy to adjust the test if there is something different that it should do, but for the moment I'm unsure if this is actionable on my part. |
|
The tests connect to tor 4 times (or send 4 batches of commands), and they do a lot more than you've described. The third and fourth connections are interleaved, and that interleaving appears to trigger a race condition in stem's tests (or stem, or tor). But it's hard to diagnose this issue, because it's unclear from the stem trace logs:
Here are the detailed logs from the failing test at: The first connection isn't part of the test, but it sets __OwningControllerProcess=20405, runs TAKEOWNERSHIP, then resets __OwningControllerProcess:
Are these commands sent to the same tor process as the test? The next connections are part of the test:
The second connection gets the control port:
The third connection gets __OwningControllerProcess=20486, runs TAKEOWNERSHIP, and runs RESETCONF __OwningControllerProcess, and then is interrupted by the fourth connection:
The fourth connection starts:
It's unclear which connection this is, but it's probably the fourth connection:
It's unclear which connection this is, but it's probably the third connection:
It's unclear which connection this is, but it's probably the fourth connection:
It's unclear which connection this is, but it's probably the end of the third connection:
It appears that the test never finishes the commands to the fourth connection. It looks like stem stopped writing commands to the fourth connection. There's no final [DEBUG] log for the fourth connection, so I can't tell what it was trying to do. And then stem makes a few system calls:
And 20 seconds later, the test finishes, due to a timeout failure:
Here's what I need to know to diagnose this issue:
Here's what stem might be doing wrong:
|
I suspect you're getting confused by multiple tests running in parallel. Above I advised reproducing this outside the tests - did you try that? If not, here's how you do so... When I run that I get... To prove the test worked I commented out the 'take_ownership' line and it failed after twenty seconds as expected with an assertion error. Try that with your hardened tor instance by adding that argument to tor_cmd and see if that reproduces the problem. If so, this narrows the haystack. And if not then the next step is for us to puzzle out why. |
|
Those logs come from this command-line:
As far as I can see, it only runs the problematic test:
I'll have time to do more on this issue next week. |
|
Yes, it's only running one test but it's still bootstrapping a copy of tor in anticipation of running all of the other tests. I can explain the log output if you'd like, but it would be simpler if we narrow the investigation to the script provided above. |
|
Nick and Taylor have discovered that Tor is leaking memory due to a control refactor in master. The memory leaks result in a different exit status when tor is built with --enable-fragile-hardening: Here's two stem changes that will help us diagnose issues like this faster in future: At the moment, stem just hangs when poll() returns an unexpected exit status. It would also be helpful if stem logged tor's stderr output to a file. Address sanitiser errors are always output on stderr. Tor crash backtraces are also usually output on stderr. |
I have a PR for this part, for your consideration: See #54. |
|
Thanks Nick! Great catch, merged your patch. As for stderr we do emit that when tor crashes... https://trac.torproject.org/projects/tor/ticket/27717 However, this only concerns the main tor process we test against. Our integ/process.py spawn separate tor instances to exercise startup/exiting behavior. Their asynchronicity also complicates this (these tests each run in subprocesses). Doable if necessary, but between the significant effort and tiny scope I'd advise against it for now. |
I'm not sure if this is a stem or tor issue, or even a Travis load issue:
FAIL: test_take_ownership_via_controller
Traceback (most recent call last):
File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 143, in
self.method = lambda test: self.result(test) # method that can be mixed into TestCases
File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 210, in result
test.fail(self._result.msg)
AssertionError: tor didn't quit after the controller that owned it disconnected
See https://travis-ci.org/torproject/tor/jobs/637520130#L3650 for details
The text was updated successfully, but these errors were encountered: