Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_take_ownership_via_controller fails in Tor's "make test-stem" #52

Closed
teor2345 opened this issue Jan 16, 2020 · 22 comments
Closed

test_take_ownership_via_controller fails in Tor's "make test-stem" #52

teor2345 opened this issue Jan 16, 2020 · 22 comments

Comments

Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jan 16, 2020

I'm not sure if this is a stem or tor issue, or even a Travis load issue:

FAIL: test_take_ownership_via_controller

Traceback (most recent call last):
File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 143, in
self.method = lambda test: self.result(test) # method that can be mixed into TestCases
File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 210, in result
test.fail(self._result.msg)
AssertionError: tor didn't quit after the controller that owned it disconnected

See https://travis-ci.org/torproject/tor/jobs/637520130#L3650 for details

@atagar
Copy link
Contributor

@atagar atagar commented Jan 16, 2020

Hi teor. You previously asked this on...

https://trac.torproject.org/projects/tor/ticket/32819

Stem waits up to five seconds for the process to terminate, and tor did not do so. The ball is now in tor's court to fix this, or tell me if the SLA should be something else.

@atagar atagar closed this Jan 16, 2020
@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 16, 2020

Tor checks every 15 seconds, so the limit should be at least 20 seconds:
https://github.com/torproject/tor/blob/4f02812242d1fd90d859eb98ac3fb1ed182f18cf/src/lib/evloop/procmon.c#L168

@teor2345 teor2345 reopened this Jan 16, 2020
torproject-pusher pushed a commit that referenced this issue Jan 16, 2020
Tor checks for the owning process every fifteen seconds, so this test's timeout
is way too low...

  #52

Puzzling that this passes reliably for me. With that poll rate this should be
failing far more often...
@atagar
Copy link
Contributor

@atagar atagar commented Jan 16, 2020

@atagar atagar closed this Jan 16, 2020
@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

(The polling is a fallback on some systems, other systems get a notification about process changes immediately.)

@teor2345 teor2345 reopened this Jan 21, 2020
@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

Tor's CI is still failing with the same Stem test error:
https://travis-ci.org/torproject/tor/jobs/639546796#L3657

Can you display the tor logs when a stem test fails?
Then we'll have more information to help diagnose this issue.

@atagar
Copy link
Contributor

@atagar atagar commented Jan 21, 2020

Hi teor. Tor's logs are available on disk at...

Log debug file /home/travis/build/torproject/tor/stem/test/data/tor_log

Dumping the log for any and all test failures would be unhelpfully verbose. Rather, I'd suggest having the travis process that runs the tests cat that file when there is a non-zero exit status.

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

Thanks, give me a few moments, and I'll come back with the tor logs, and hopefully the next steps to fix these issues.

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

Here are the Tor logs:
https://travis-ci.org/torproject/tor/jobs/639546796#L4730

And the stem trace logs:
https://travis-ci.org/torproject/tor/jobs/639546796#L5731

I can't see the stem test names in either log, so I'm going to set up a CI branch that just runs the owing controller process test.

Unless you can find the output for that particular test?

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

Here's the failure in Travis:

Stem logs:
https://travis-ci.org/teor2345/tor/builds/639735548#L5757

Tor logs:
https://travis-ci.org/teor2345/tor/builds/639735548#L4708
I think the Tor logs are too verbose to show anything interesting. Is there a way of restricting them to info-level? Or particular log domains?

I can't grep for log domains, because they're not shown in these logs. But I can try to find the right functions.

I can also reproduce this failure locally with the latest tor and stem master.

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

This test passes on 0.4.2, but fails on master. So it's probably a Tor bug:
https://travis-ci.org/torproject/tor/jobs/637709536

@teor2345 teor2345 closed this Jan 21, 2020
@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

@teor2345 teor2345 reopened this Jan 21, 2020
@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 21, 2020

This issue is caused by the precise timing of Tor's controller implementation.

On my machine, it passes on:

  • master without --enable-fragile-hardening
  • Tor 0.4.2 with --enable-fragile-hardening
  • Tor 0.4.2 without --enable-fragile-hardening

And it fails on:

  • master with --enable-fragile-hardening

On Travis, we see the same results, but Travis always uses --enable-fragile-hardening for its stem tests.

It looks like this timing issue was introduced in the https://trac.torproject.org/projects/tor/ticket/30984 refactor, in commit torproject/tor@c744d23. (At least on my machine.)

Tor doesn't guarantee control reply timing. And we're unlikely to be able to restore the old timing behaviour. So stem's tests should be adapted to work with the timing in both Tor 0.4.2 and Tor master.

See https://trac.torproject.org/projects/tor/ticket/33006 for details.

@atagar
Copy link
Contributor

@atagar atagar commented Jan 22, 2020

Tor doesn't guarantee control reply timing. And we're unlikely to be able to restore the old timing behaviour. So stem's tests should be adapted to work with the timing in both Tor 0.4.2 and Tor master.

Hi teor. Sorry, I don't follow - what is the duration our tests should wait for tor to terminate, or are you saying we should simply drop the test?

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 22, 2020

This test works on Tor 0.4.2, and Tor master when compiled without extra hardening checks. But it fails when master is compiled with hardening.

So either:

  • there is a bug in Tor, where it sends different replies when hardening is turned on, or
  • there is a timing race condition in stem's test, where it depends on the exact millisecond timing of tor's replies.

I can't see any obvious changes in Tor's output, and I don't know enough about the design of stem's tests to further diagnose this issue.

We could try doing a diff of stem's trace logs on master, with and without hardening?

But I won't be able to do that today, I have to finish off a proposal for sponsored work.

@atagar
Copy link
Contributor

@atagar atagar commented Jan 22, 2020

there is a timing race condition in stem's test, where it depends on the exact millisecond timing of tor's replies

Gotcha. The test is simple...

  • Spawn a tor process with "__OwningControllerProcess [test pid]" among its arguments.
  • Connect then disconnect from its control port.
  • Wait up to twenty seconds for it to terminate.

Nothing about this involves precise timing, the only timebound part is a big twenty second window. This should be manually reproduceable without running the tests.

I'd be happy to adjust the test if there is something different that it should do, but for the moment I'm unsure if this is actionable on my part.

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 23, 2020

The tests connect to tor 4 times (or send 4 batches of commands), and they do a lot more than you've described.

The third and fourth connections are interleaved, and that interleaving appears to trigger a race condition in stem's tests (or stem, or tor).

But it's hard to diagnose this issue, because it's unclear from the stem trace logs:

  • how many tor instances there are,
  • how many connections there are,
  • what stem code is running on each connection (particularly unfinished connections), and
  • which TRACE logs belong to each tor instance and each connection.

Here are the detailed logs from the failing test at:
https://travis-ci.org/torproject/tor/builds/639737322#L5715

The first connection isn't part of the test, but it sets __OwningControllerProcess=20405, runs TAKEOWNERSHIP, then resets __OwningControllerProcess:

01/21/2020 01:31:27 [TRACE] Sent to tor: PROTOCOLINFO 1
01/21/2020 01:31:27 [TRACE] Received from tor:
250-PROTOCOLINFO 1
250-AUTH METHODS=NULL
250-VERSION Tor="0.4.3.0-alpha-dev"
250 OK
01/21/2020 01:31:27 [TRACE] Sent to tor: AUTHENTICATE
01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:27 [TRACE] Sent to tor: SETEVENTS SIGNAL CONF_CHANGED STATUS_SERVER
01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:27 [TRACE] Sent to tor: GETCONF __owningcontrollerprocess
01/21/2020 01:31:27 [TRACE] Received from tor: 250 __OwningControllerProcess=20405
01/21/2020 01:31:27 [DEBUG] GETCONF __owningcontrollerprocess (runtime: 0.0007)
01/21/2020 01:31:27 [TRACE] Sent to tor: TAKEOWNERSHIP
01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:27 [TRACE] Sent to tor: RESETCONF __OwningControllerProcess
01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:27 [TRACE] Received from tor:
650-CONF_CHANGED
650-__OwningControllerProcess
650 OK
01/21/2020 01:31:27 [DEBUG] RESETCONF __OwningControllerProcess (runtime: 0.0127)
...

Are these commands sent to the same tor process as the test?

The next connections are part of the test:

01/21/2020 01:31:27 [NOTICE] Beginning test test.integ.process.TestProcess.test_take_ownership_via_controller
01/21/2020 01:31:28 [TRACE] Sent to tor: PROTOCOLINFO 1
...

The second connection gets the control port:

01/21/2020 01:31:28 [TRACE] Received from tor:
250-PROTOCOLINFO 1
250-AUTH METHODS=NULL
250-VERSION Tor="0.4.3.0-alpha-dev"
250 OK
01/21/2020 01:31:28 [TRACE] Sent to tor: AUTHENTICATE
01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:28 [TRACE] Sent to tor: GETCONF ControlPort
01/21/2020 01:31:28 [TRACE] Received from tor: 250 ControlPort=14241

The third connection gets __OwningControllerProcess=20486, runs TAKEOWNERSHIP, and runs RESETCONF __OwningControllerProcess, and then is interrupted by the fourth connection:

01/21/2020 01:31:28 [TRACE] Sent to tor: PROTOCOLINFO 1
01/21/2020 01:31:28 [TRACE] Received from tor:
250-PROTOCOLINFO 1
250-AUTH METHODS=NULL
250-VERSION Tor="0.4.3.0-alpha-dev"
250 OK
01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:28 [TRACE] Sent to tor: AUTHENTICATE
01/21/2020 01:31:28 [TRACE] Sent to tor: SETEVENTS SIGNAL CONF_CHANGED STATUS_SERVER
01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:28 [TRACE] Sent to tor: GETCONF __owningcontrollerprocess
01/21/2020 01:31:28 [TRACE] Received from tor: 250 __OwningControllerProcess=20486
01/21/2020 01:31:28 [DEBUG] GETCONF __owningcontrollerprocess (runtime: 0.0007)
01/21/2020 01:31:28 [TRACE] Sent to tor: TAKEOWNERSHIP
01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK
01/21/2020 01:31:28 [TRACE] Sent to tor: RESETCONF __OwningControllerProcess

The fourth connection starts:

01/21/2020 01:31:28 [TRACE] Sent to tor: PROTOCOLINFO 1
01/21/2020 01:31:28 [TRACE] Received from tor:
250-PROTOCOLINFO 1
250-AUTH METHODS=NULL
250-VERSION Tor="0.4.3.0-alpha-dev"
250 OK
01/21/2020 01:31:28 [TRACE] Sent to tor: AUTHENTICATE
01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK

It's unclear which connection this is, but it's probably the fourth connection:

01/21/2020 01:31:28 [TRACE] Sent to tor: GETCONF ControlPort
01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK

It's unclear which connection this is, but it's probably the third connection:

01/21/2020 01:31:28 [TRACE] Received from tor:
650-CONF_CHANGED
650-__OwningControllerProcess
650 OK

It's unclear which connection this is, but it's probably the fourth connection:

01/21/2020 01:31:28 [TRACE] Received from tor: 250 ControlPort=50000

It's unclear which connection this is, but it's probably the end of the third connection:

01/21/2020 01:31:28 [DEBUG] RESETCONF __OwningControllerProcess (runtime: 0.0146)

It appears that the test never finishes the commands to the fourth connection. It looks like stem stopped writing commands to the fourth connection. There's no final [DEBUG] log for the fourth connection, so I can't tell what it was trying to do.

And then stem makes a few system calls:

01/21/2020 01:31:28 [DEBUG] System call: /usr/bin/python3 setup.py install --prefix /tmp/stem_test (runtime: 1.56)
...

And 20 seconds later, the test finishes, due to a timeout failure:

01/21/2020 01:31:48 [NOTICE] Finished test test.integ.process.TestProcess.test_take_ownership_via_controller

Here's what I need to know to diagnose this issue:

  • are these connections, or batches of commands?
  • are they all sent to the same tor process?
  • which control commands are running on each connection and tor process?

Here's what stem might be doing wrong:

  • does the third connection actually close? (If it doesn't, then the TAKEOWNERSHIP from that connection won't make tor exit.)
  • why is there a fourth connection? (As long as the third connection closes, it shouldn't matter what it does, but it seems weird.)
  • does tor actually close, but stem is hung on the fourth connection, so it never notices?

@atagar
Copy link
Contributor

@atagar atagar commented Jan 23, 2020

The tests connect to tor 4 times (or send 4 batches of commands), and they do a lot more than you've described.

I suspect you're getting confused by multiple tests running in parallel. Above I advised reproducing this outside the tests - did you try that?

If not, here's how you do so...

import logging
import os
import tempfile
import time

import stem.control
import stem.process
import stem.util.log

DATA_DIR = '/tmp/demo_test'


def test_take_ownership_via_controller():
  with tempfile.TemporaryDirectory() as data_directory:
    control_port = '1244'

    tor_process = stem.process.launch_tor_with_config(
      tor_cmd = 'tor',
      config = {
        'SocksPort': '1245',
        'ControlPort': control_port,
        'DataDirectory': DATA_DIR,
      },
      completion_percent = 0,
      take_ownership = True,
    )

    # We're the controlling process. Just need to connect then disconnect.

    controller = stem.control.Controller.from_port(port = int(control_port))
    controller.authenticate()
    controller.close()

    # give tor a few seconds to quit
    start_time = time.time()

    while time.time() - start_time < 20:
      if tor_process.poll() == 0:
        print('tor terminated after %0.2f seconds' % (time.time() - start_time))
        return  # tor exited

      time.sleep(0.01)

    raise AssertionError("tor didn't quit after the controller that owned it disconnected")


if __name__ == '__main__':
  if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

  # set up trace logging

  handler = logging.FileHandler(os.path.join(DATA_DIR, 'log'), mode = 'w')
  handler.setLevel(logging.DEBUG - 5)
  handler.setFormatter(stem.util.log.FORMATTER)
  stem.util.log.get_logger().addHandler(handler)

  test_take_ownership_via_controller()

When I run that I get...

atagar@morrigan:~/Desktop/stem$ python3.7 demo.py 
tor terminated after 0.01 seconds

atagar@morrigan:~/Desktop/stem$ cat /tmp/demo_test/log 
01/22/2020 18:21:10 [DEBUG] System call: tor --version (runtime: 0.01)
01/22/2020 18:21:10 [TRACE] Received from system (tor --version), stdout:
b'Tor version 0.4.3.0-alpha-dev (git-5e70c27e8560ac18).'
01/22/2020 18:21:10 [TRACE] Sent to tor: PROTOCOLINFO 1
01/22/2020 18:21:10 [TRACE] Received from tor:
250-PROTOCOLINFO 1
250-AUTH METHODS=NULL
250-VERSION Tor="0.4.3.0-alpha-dev"
250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: AUTHENTICATE
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: SETEVENTS SIGNAL CONF_CHANGED STATUS_SERVER
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: GETCONF __owningcontrollerprocess
01/22/2020 18:21:10 [TRACE] Received from tor: 250 __OwningControllerProcess=27969
01/22/2020 18:21:10 [DEBUG] GETCONF __owningcontrollerprocess (runtime: 0.0005)
01/22/2020 18:21:10 [TRACE] Sent to tor: TAKEOWNERSHIP
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: RESETCONF __OwningControllerProcess
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Received from tor:
650-CONF_CHANGED
650-__OwningControllerProcess
650 OK
01/22/2020 18:21:10 [DEBUG] RESETCONF __OwningControllerProcess (runtime: 0.0043)
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "read of closed file"
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "peek of closed file"
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "peek of closed file"
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "peek of closed file"

To prove the test worked I commented out the 'take_ownership' line and it failed after twenty seconds as expected with an assertion error.

Try that with your hardened tor instance by adding that argument to tor_cmd and see if that reproduces the problem. If so, this narrows the haystack. And if not then the next step is for us to puzzle out why.

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 23, 2020

Those logs come from this command-line:

python3 "$STEM_SOURCE_DIR"/run_tests.py --tor src/app/tor --integ --test process.test_take_ownership_via_controller --log TRACE --log-file stem.log
https://travis-ci.org/torproject/tor/jobs/639737323/config

As far as I can see, it only runs the problematic test:

Running tests...

process.test_take_ownership_via_controller... failed (20.74s)

test_take_ownership_via_controller [FAILURE]

======================================================================

FAIL: test_take_ownership_via_controller
https://travis-ci.org/torproject/tor/builds/639737322#L3650

I'll have time to do more on this issue next week.

@atagar
Copy link
Contributor

@atagar atagar commented Jan 23, 2020

Yes, it's only running one test but it's still bootstrapping a copy of tor in anticipation of running all of the other tests. I can explain the log output if you'd like, but it would be simpler if we narrow the investigation to the script provided above.

@teor2345
Copy link
Contributor Author

@teor2345 teor2345 commented Jan 23, 2020

Nick and Taylor have discovered that Tor is leaking memory due to a control refactor in master. The memory leaks result in a different exit status when tor is built with --enable-fragile-hardening:
https://trac.torproject.org/projects/tor/ticket/33039

Here's two stem changes that will help us diagnose issues like this faster in future:

At the moment, stem just hangs when poll() returns an unexpected exit status.
Can you please make stem raise an exception instead?
If stem raises an exception, we can discover these kinds of failures quickly in the test logs, rather than having to reproduce each error in an isolated test case.

It would also be helpful if stem logged tor's stderr output to a file. Address sanitiser errors are always output on stderr. Tor crash backtraces are also usually output on stderr.

@nmathewson
Copy link
Contributor

@nmathewson nmathewson commented Jan 23, 2020

Nick and Taylor have discovered that Tor is leaking memory due to a control refactor in master. The memory leaks result in a different exit status when tor is built with --enable-fragile-hardening:
https://trac.torproject.org/projects/tor/ticket/33039

Here's two stem changes that will help us diagnose issues like this faster in future:

At the moment, stem just hangs when poll() returns an unexpected exit status.
Can you please make stem raise an exception instead?
If stem raises an exception, we can discover these kinds of failures quickly in the test logs, rather than having to reproduce each error in an isolated test case.

I have a PR for this part, for your consideration: See #54.

@atagar
Copy link
Contributor

@atagar atagar commented Jan 23, 2020

Thanks Nick! Great catch, merged your patch. As for stderr we do emit that when tor crashes...

https://trac.torproject.org/projects/tor/ticket/27717
https://gitweb.torproject.org/stem.git/commit/?id=0b734b6

However, this only concerns the main tor process we test against. Our integ/process.py spawn separate tor instances to exercise startup/exiting behavior. Their asynchronicity also complicates this (these tests each run in subprocesses). Doable if necessary, but between the significant effort and tiny scope I'd advise against it for now.

@atagar atagar closed this Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment