Refactor test eth node infrastructure #3809

ulope · 2019-04-12T19:02:10Z

This should fix the intermittent node start errors we've been seeing on
circle ci.

Fixes: #3632, fixes #3661

This implements the following:

Fix get_free_port utility by 'abusing' TIME_WAIT mode (see docstring for details)
Refactor the eth node launcher infrastructure in the tests to use a
mirakuru subclass instead of the home-grown utilities used before.
Using mirakuru allows all node launching utilities to be invoked as
context managers
All test fixtures and the smoketest now use the context manager style
The smoketest gained a new --eth-client option which allows to run
the smoketests with parity

Drive-by fix:

Passing --debug to the smoketest skips stdout capturing, making
working with pdb etc. easier. (Fixes Smoketest should show stdout #3736)

codecov · 2019-04-12T20:58:47Z

Codecov Report

Merging #3809 into master will increase coverage by 0.27%.
The diff coverage is 55.55%.

@@            Coverage Diff             @@
##           master    #3809      +/-   ##
==========================================
+ Coverage   77.75%   78.03%   +0.27%     
==========================================
  Files         109      109              
  Lines       14865    14871       +6     
  Branches     2076     2082       +6     
==========================================
+ Hits        11559    11605      +46     
+ Misses       2598     2553      -45     
- Partials      708      713       +5

Impacted Files	Coverage Δ
raiden/settings.py	`100% <ø> (ø)`	⬆️
raiden/ui/cli.py	`37.7% <10.25%> (+7.02%)`	⬆️
raiden/constants.py	`100% <100%> (ø)`	⬆️
raiden/accounts.py	`48.17% <100%> (+0.6%)`	⬆️
raiden/network/rpc/smartcontract_proxy.py	`80% <66.66%> (ø)`	⬆️
raiden/network/utils.py	`63.04% <68.42%> (-0.85%)`	⬇️
raiden/utils/http.py	`71.55% <72.41%> (+1.91%)`	⬆️
raiden/network/rpc/client.py	`63.23% <75%> (ø)`	⬆️
raiden/utils/cli.py	`50.55% <88.88%> (+0.37%)`	⬆️
raiden/transfer/events.py	`75.28% <0%> (-1.16%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8101cae...bb737f6. Read the comment docs.

hackaugusto

I have some questions but the changes looks good

hackaugusto · 2019-04-15T08:24:21Z

.circleci/config.yml

-            - smoketest-udp-production-3.7
-            - smoketest-udp-development-3.7
+            - smoketest-udp-production-geth-3.7
+            - smoketest-udp-development-geth-3.7


I think we should also have the smoketest-matrix-development-parity-3.7 as a requirement for the jobs at the 4th step.

Hm not sure I understand what you mean.

I meant that all smoketests should be a requirement for the integration tests

hackaugusto · 2019-04-15T08:36:39Z

raiden/network/utils.py

-    This should not be used in misson-critical applications - a race condition may occur if
-    someone grabs the port before caller of this function has chance to use it.
+
+    Unless the `reliable` parameter is set to `True` (the default) this is prone to race


Can you provide a bit more of information about TIME_WAIT mode? e.g. a man page or something that explain the behavior on linux+mac.

I'm asking because to me it seems the race is still there, this is what I understand is happening:

pytest is iterating over port numbers

for each new port_candidate it's opening a socket, and setting SO_REUSEADDR

the above socket is set to listening, a connection is open to it, and then both sockets are closed.

pytest spawns geth and gives the port_candidate to geth

For the above to remove the race condition, the port number use open by pytest has to be available to geth but not to other processes in the system, which is not how I understand SO_REUSEADDR, but I don't know about the TIME_WAIT state. Perhaps it allows the port to be reused by processes in the same process group?

Also, doesn't it make sense to split get_free_port into two versions? One that uses port number 0 and another that uses the iterator? It doesn't seem to make sense to have a loop if the OS is giving us a random port.

(Helpful TCP state transition reference)

Ok so here's how I understand the problem and solution:

Binding to port 0 makes the OS chose a random 'free' port. From the OS's perspective 'free' means any port that is in CLOSED state.

In the previous implementation we simply bound to the port, grabbed the port number and closed it. That means it's immediately returned into the 'free' pool since its state will be CLOSED again.

Since we're running multiple things that require listening ports (and some require more than one port) it appears to me that sometimes the OS will re-use the same port number in successive iterations.

By forcing the port into TIME_WAIT state (by connecting to the socket and then closing the connection) we effectively take it out of the OS's 'free' pool for the duration of that state (usually 60s).

By setting SO_REUSEADDR (and relying on the fact that most other 'server' kind of applications do the same) we however allow that port to be reused by another application (i.e. geth, synapse, etc.) while it's still in the TIME_WAIT state. AFAIU SO_REUSEADDR is independent of process group.

Regarding your question about splitting it. It basically already is. The name of the function is probably bad. It should really be called get_free_port_generator since it always returns a generator that will either return ports in increasing order (if you pass an initial_port > 0) or just random ports indefinitely.

Okay, I see. The race across multiple processes is still there, IOW, this will not prevent another process from using the port. However tix fix the problem with subsequent calls with port 0 (which is really the only thing we care about, since we should not really have a rouge process trying to catch the same port as the test)

hackaugusto · 2019-04-15T08:42:22Z

raiden/network/utils.py

+                if reliable:
+                    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+                try:
+                    sock.bind(('127.0.0.1', port_candidate))


I think we should leave the ip address as an argument, otherwise we cannot write tests for multiple interfaces. (Well, we can also add it back if we write tests like these again, @konradkonrad what do you think?)

The reason I removed it is that the bind-to-zero-TIME_WAIT trick only works for specific interfaces (i.e. not 0.0.0.0) since there you can't guarantee which ports are available on which interface. Also so far we didn't use anything else besides 127.0.0.1 in the test suite.

Of course we could re-add the bind address and enforce that the combination bind_address='0.0.0.0', reliable=True isn't used.

hackaugusto · 2019-04-15T08:43:27Z

raiden/tests/integration/cli/conftest.py

-    for geth_process in testchain['processes_list']:
-        geth_process.kill()
+def testchain_provider(request):
+    eth_client = EthClient(request.config.getoption('blockchain_type'))


It would be better to have def testchain_provider(blockchain_type): instead.

hackaugusto · 2019-04-15T08:44:00Z

raiden/tests/integration/cli/test_cli_production.py

@@ -36,7 +36,7 @@ def test_cli_full_init(cli_args, raiden_spawner):
        child.close()


-@pytest.mark.timeout(35)
+@pytest.mark.timeout(45)


Is this because of parity?

Seems so, yes :/
Apparently recent parity versions have a bug when shutting down on private chains (can't find the issue right now) doesn't work right, causing delays (until the node is killed by the executor).

Example log from one of the test runs with the parity shutdown error at the end: https://39133-50646767-gh.circle-artifacts.com/0/eth-logs/parity/test_filter_start_block_inclusive/53348e51.log

hackaugusto · 2019-04-15T08:49:28Z

raiden/tests/utils/eth_node.py

@@ -85,7 +92,7 @@ def geth_to_cmd(
        '--rpcapi', 'eth,net,web3,personal,txpool',
        '--rpcaddr', '0.0.0.0',
        '--networkid', str(chain_id),
-        '--verbosity', str(verbosity),
+        '--verbosity', str(_GETH_VERBOSITY_LEVEL[verbosity]),


Shouldn't we handle invalid values?

Yeah, we could. But OTOH a KeyError seemed good enough to me.

hackaugusto · 2019-04-15T08:53:04Z

raiden/tests/utils/eth_node.py

    chain_spec = PARITY_CHAIN_SPEC_STUB.copy()
+    chain_spec['params']['networkID'] = chain_id


Just a comment, networkID and chainID are not the same, however it should be fine for our tests to use either, since we don't do hard forks.

Right, good point.

hackaugusto · 2019-04-15T08:53:42Z

raiden/tests/utils/eth_node.py

+    log_file_name = os.path.join(path, 'account_import.log')
+    try:
+        with open(log_file_name, 'w') as log_file:
+            subprocess.check_call(


From the docs:

Code needing to capture stdout or stderr should use run() instead:

https://docs.python.org/3/library/subprocess.html#subprocess.check_call

Err, yeah good point. Or even use check_output... No idea what I was thinking here.

hackaugusto · 2019-04-15T09:01:10Z

raiden/tests/utils/eth_node.py

@@ -523,14 +510,13 @@ def run_private_blockchain(
        )

        if node.miner:
-            config['unlock'] = 0
+            config['unlock'] = to_checksum_address(config['address'])


Why do you want to unlock the miner account?

No idea why that's done. I just converted it to use the explicit address instead of the index in the keystore since geth was giving a deprecation warning that unlock by index is going away.

Hummm, maybe we should remove it, otherwise eth_sendTransaction will work and that is sort of unexpected.

I'll remove it and see if anything breaks

Doesn't work without unlock. Apparently geth requires the Clique miner account to be unlocked:

WARN [04-15|17:22:01.283] Block sealing failed err="authentication needed: password or unlock"

hackaugusto · 2019-04-15T09:01:57Z

raiden/tests/utils/eth_node.py

        )
-        parity_create_account(nodes_configuration[0], base_datadir, chainspec_path)
+        parity_create_account(nodes_configuration[0], base_datadir, genesis_path)


Why are we doing this here?

Yeah I was wondering that myself but didn't want to make the PR even bigger than it already is.
Ideally that should be moved to eth_nodes_to_cmds to be symmetrical to how we do it for geth.

LefterisJP · 2019-04-15T12:31:17Z

raiden/constants.py

@@ -44,8 +44,8 @@


 class EthClient(Enum):
-    GETH = 1


@ulope what's the benefit of this change? Edit: Looking further down I guess it's to allow for this: https://github.com/raiden-network/raiden/pull/3809/files#diff-75ca5ec60bf91194fbc444f1da91a0a6R27

Yes, and a few other places where the string value comes in handy.

LefterisJP · 2019-04-15T12:34:49Z

raiden/network/rpc/client.py

@@ -509,14 +509,14 @@ def __init__(
            # available nonce
            available_nonce = web3.eth.getTransactionCount(address_checksumed, 'pending')

-        elif eth_node == constants.EthClient.PARITY:
+        elif eth_node is constants.EthClient.PARITY:


Why replace == with is for the Enum values? Is this standard practise? You want to compare to the object rather than its value I guess?

It doesn't really make a difference but yes, identity is the recommended way to compare Enums: https://docs.python.org/3/library/enum.html#comparisons

ulope · 2019-04-15T18:35:48Z

@hackaugusto I've addressed the actionable items and re-applied your changes from a79c7f7 (which had conflicts).

If you have no objection I'll merge it as soon as the tests are green.

This should fix the intermittent node start errors we've been seeing on circle ci. Fixes: raiden-network#3632 This implements the following: - Fix get_free_port utility by 'abusing' TIME_WAIT mode (see docstring for details) - Refactor the eth node launcher infrastructure in the tests to use a mirakuru subclass instead of the home-grown utilities used before. - Using mirakuru allows all node launching utilities to be invoked as context managers - All test fixtures and the smoketest now use the context manager style - The smoketest gained a new `--eth-client` option which allows to run the smoketests with parity Drive-by fix: - Passing `--debug` to the smoketest skips stdout capturing, making working with pdb etc. easier. (Fixes raiden-network#3736)

hackaugusto · 2019-04-15T19:21:48Z

If you have no objection I'll merge it as soon as the tests are green.

Yep, go ahead :)

[skip tests]

ulope added the dev: Please Review label Apr 12, 2019

ulope force-pushed the fix/test_eth_node_start branch 6 times, most recently from 95656b5 to a5fb410 Compare April 12, 2019 20:42

hackaugusto approved these changes Apr 15, 2019

View reviewed changes

LefterisJP reviewed Apr 15, 2019

View reviewed changes

ulope force-pushed the fix/test_eth_node_start branch from a5fb410 to 7eb597d Compare April 15, 2019 18:33

ulope force-pushed the fix/test_eth_node_start branch from 7eb597d to bb737f6 Compare April 15, 2019 18:52

hackaugusto merged commit 6756e74 into raiden-network:master Apr 15, 2019

hackaugusto mentioned this pull request Apr 16, 2019

integration tests wait on all smoketests #3823

Merged

ulope added a commit to ulope/raiden that referenced this pull request Apr 16, 2019

Fix deployment error due to raiden-network#3809

bac8b59

ulope added a commit to ulope/raiden that referenced this pull request Apr 16, 2019

Fix deployment error due to raiden-network#3809

01d96af

[skip tests]

ulope added a commit that referenced this pull request Apr 16, 2019

Fix deployment error due to #3809

0e69ae8

[skip tests]

hackaugusto mentioned this pull request Sep 17, 2019

Geth is not properly closed after session closed #3101

Closed

		chain_spec = PARITY_CHAIN_SPEC_STUB.copy()
		chain_spec['params']['networkID'] = chain_id

Refactor test eth node infrastructure #3809

Refactor test eth node infrastructure #3809

Conversation

ulope commented Apr 12, 2019 • edited

codecov bot commented Apr 12, 2019 • edited

Codecov Report

hackaugusto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulope Apr 15, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulope commented Apr 15, 2019 • edited

hackaugusto commented Apr 15, 2019

ulope commented Apr 12, 2019 •

edited

codecov bot commented Apr 12, 2019 •

edited

ulope Apr 15, 2019 •

edited

ulope commented Apr 15, 2019 •

edited