New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor test eth node infrastructure #3809
Refactor test eth node infrastructure #3809
Conversation
95656b5
to
a5fb410
Compare
Codecov Report
@@ Coverage Diff @@
## master #3809 +/- ##
==========================================
+ Coverage 77.75% 78.03% +0.27%
==========================================
Files 109 109
Lines 14865 14871 +6
Branches 2076 2082 +6
==========================================
+ Hits 11559 11605 +46
+ Misses 2598 2553 -45
- Partials 708 713 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions but the changes looks good
- smoketest-udp-production-3.7 | ||
- smoketest-udp-development-3.7 | ||
- smoketest-udp-production-geth-3.7 | ||
- smoketest-udp-development-geth-3.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also have the smoketest-matrix-development-parity-3.7
as a requirement for the jobs at the 4th step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm not sure I understand what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that all smoketests should be a requirement for the integration tests
This should not be used in misson-critical applications - a race condition may occur if | ||
someone grabs the port before caller of this function has chance to use it. | ||
|
||
Unless the `reliable` parameter is set to `True` (the default) this is prone to race |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a bit more of information about TIME_WAIT
mode? e.g. a man page or something that explain the behavior on linux+mac.
I'm asking because to me it seems the race is still there, this is what I understand is happening:
- pytest is iterating over port numbers
- for each new
port_candidate
it's opening a socket, and settingSO_REUSEADDR
- the above socket is set to listening, a connection is open to it, and then both sockets are closed.
- pytest spawns geth and gives the
port_candidate
to geth
For the above to remove the race condition, the port number use open by pytest has to be available to geth but not to other processes in the system, which is not how I understand SO_REUSEADDR
, but I don't know about the TIME_WAIT
state. Perhaps it allows the port to be reused by processes in the same process group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, doesn't it make sense to split get_free_port
into two versions? One that uses port number 0 and another that uses the iterator? It doesn't seem to make sense to have a loop if the OS is giving us a random port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Helpful TCP state transition reference)
Ok so here's how I understand the problem and solution:
Binding to port 0 makes the OS chose a random 'free' port. From the OS's perspective 'free' means any port that is in CLOSED
state.
In the previous implementation we simply bound to the port, grabbed the port number and closed it. That means it's immediately returned into the 'free' pool since its state will be CLOSED
again.
Since we're running multiple things that require listening ports (and some require more than one port) it appears to me that sometimes the OS will re-use the same port number in successive iterations.
By forcing the port into TIME_WAIT
state (by connecting to the socket and then closing the connection) we effectively take it out of the OS's 'free' pool for the duration of that state (usually 60s).
By setting SO_REUSEADDR
(and relying on the fact that most other 'server' kind of applications do the same) we however allow that port to be reused by another application (i.e. geth, synapse, etc.) while it's still in the TIME_WAIT
state. AFAIU SO_REUSEADDR
is independent of process group.
Regarding your question about splitting it. It basically already is. The name of the function is probably bad. It should really be called get_free_port_generator
since it always returns a generator that will either return ports in increasing order (if you pass an initial_port > 0
) or just random ports indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see. The race across multiple processes is still there, IOW, this will not prevent another process from using the port. However tix fix the problem with subsequent calls with port 0
(which is really the only thing we care about, since we should not really have a rouge process trying to catch the same port as the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly
if reliable: | ||
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
try: | ||
sock.bind(('127.0.0.1', port_candidate)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave the ip address as an argument, otherwise we cannot write tests for multiple interfaces. (Well, we can also add it back if we write tests like these again, @konradkonrad what do you think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I removed it is that the bind-to-zero-TIME_WAIT trick only works for specific interfaces (i.e. not 0.0.0.0
) since there you can't guarantee which ports are available on which interface. Also so far we didn't use anything else besides 127.0.0.1
in the test suite.
Of course we could re-add the bind address and enforce that the combination bind_address='0.0.0.0', reliable=True
isn't used.
for geth_process in testchain['processes_list']: | ||
geth_process.kill() | ||
def testchain_provider(request): | ||
eth_client = EthClient(request.config.getoption('blockchain_type')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to have def testchain_provider(blockchain_type):
instead.
@@ -36,7 +36,7 @@ def test_cli_full_init(cli_args, raiden_spawner): | |||
child.close() | |||
|
|||
|
|||
@pytest.mark.timeout(35) | |||
@pytest.mark.timeout(45) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because of parity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems so, yes :/
Apparently recent parity versions have a bug when shutting down on private chains (can't find the issue right now) doesn't work right, causing delays (until the node is killed by the executor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example log from one of the test runs with the parity shutdown error at the end: https://39133-50646767-gh.circle-artifacts.com/0/eth-logs/parity/test_filter_start_block_inclusive/53348e51.log
@@ -85,7 +92,7 @@ def geth_to_cmd( | |||
'--rpcapi', 'eth,net,web3,personal,txpool', | |||
'--rpcaddr', '0.0.0.0', | |||
'--networkid', str(chain_id), | |||
'--verbosity', str(verbosity), | |||
'--verbosity', str(_GETH_VERBOSITY_LEVEL[verbosity]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we handle invalid values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could. But OTOH a KeyError seemed good enough to me.
chain_spec = PARITY_CHAIN_SPEC_STUB.copy() | ||
chain_spec['params']['networkID'] = chain_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment, networkID
and chainID
are not the same, however it should be fine for our tests to use either, since we don't do hard forks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good point.
raiden/tests/utils/eth_node.py
Outdated
log_file_name = os.path.join(path, 'account_import.log') | ||
try: | ||
with open(log_file_name, 'w') as log_file: | ||
subprocess.check_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs:
Code needing to capture stdout or stderr should use run() instead:
https://docs.python.org/3/library/subprocess.html#subprocess.check_call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, yeah good point. Or even use check_output
... No idea what I was thinking here.
@@ -523,14 +510,13 @@ def run_private_blockchain( | |||
) | |||
|
|||
if node.miner: | |||
config['unlock'] = 0 | |||
config['unlock'] = to_checksum_address(config['address']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to unlock the miner account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea why that's done. I just converted it to use the explicit address instead of the index in the keystore since geth was giving a deprecation warning that unlock by index is going away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hummm, maybe we should remove it, otherwise eth_sendTransaction
will work and that is sort of unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it and see if anything breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't work without unlock
. Apparently geth requires the Clique miner account to be unlocked:
WARN [04-15|17:22:01.283] Block sealing failed err="authentication needed: password or unlock"
) | ||
parity_create_account(nodes_configuration[0], base_datadir, chainspec_path) | ||
parity_create_account(nodes_configuration[0], base_datadir, genesis_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was wondering that myself but didn't want to make the PR even bigger than it already is.
Ideally that should be moved to eth_nodes_to_cmds
to be symmetrical to how we do it for geth.
@@ -44,8 +44,8 @@ | |||
|
|||
|
|||
class EthClient(Enum): | |||
GETH = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulope what's the benefit of this change? Edit: Looking further down I guess it's to allow for this: https://github.com/raiden-network/raiden/pull/3809/files#diff-75ca5ec60bf91194fbc444f1da91a0a6R27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and a few other places where the string value comes in handy.
@@ -509,14 +509,14 @@ def __init__( | |||
# available nonce | |||
available_nonce = web3.eth.getTransactionCount(address_checksumed, 'pending') | |||
|
|||
elif eth_node == constants.EthClient.PARITY: | |||
elif eth_node is constants.EthClient.PARITY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why replace ==
with is
for the Enum values? Is this standard practise? You want to compare to the object rather than its value I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really make a difference but yes, identity is the recommended way to compare Enums: https://docs.python.org/3/library/enum.html#comparisons
a5fb410
to
7eb597d
Compare
@hackaugusto I've addressed the actionable items and re-applied your changes from a79c7f7 (which had conflicts). If you have no objection I'll merge it as soon as the tests are green. |
This should fix the intermittent node start errors we've been seeing on circle ci. Fixes: raiden-network#3632 This implements the following: - Fix get_free_port utility by 'abusing' TIME_WAIT mode (see docstring for details) - Refactor the eth node launcher infrastructure in the tests to use a mirakuru subclass instead of the home-grown utilities used before. - Using mirakuru allows all node launching utilities to be invoked as context managers - All test fixtures and the smoketest now use the context manager style - The smoketest gained a new `--eth-client` option which allows to run the smoketests with parity Drive-by fix: - Passing `--debug` to the smoketest skips stdout capturing, making working with pdb etc. easier. (Fixes raiden-network#3736)
7eb597d
to
bb737f6
Compare
Yep, go ahead :) |
This should fix the intermittent node start errors we've been seeing on
circle ci.
Fixes: #3632, fixes #3661
This implements the following:
mirakuru subclass instead of the home-grown utilities used before.
context managers
--eth-client
option which allows to runthe smoketests with parity
Drive-by fix:
--debug
to the smoketest skips stdout capturing, makingworking with pdb etc. easier. (Fixes Smoketest should show stdout #3736)