Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor test eth node infrastructure #3809

Merged
merged 1 commit into from Apr 15, 2019

Conversation

ulope
Copy link
Collaborator

@ulope ulope commented Apr 12, 2019

This should fix the intermittent node start errors we've been seeing on
circle ci.

Fixes: #3632, fixes #3661

This implements the following:

  • Fix get_free_port utility by 'abusing' TIME_WAIT mode (see docstring for details)
  • Refactor the eth node launcher infrastructure in the tests to use a
    mirakuru subclass instead of the home-grown utilities used before.
  • Using mirakuru allows all node launching utilities to be invoked as
    context managers
  • All test fixtures and the smoketest now use the context manager style
  • The smoketest gained a new --eth-client option which allows to run
    the smoketests with parity

Drive-by fix:

@ulope ulope force-pushed the fix/test_eth_node_start branch 6 times, most recently from 95656b5 to a5fb410 Compare April 12, 2019 20:42
@codecov
Copy link

codecov bot commented Apr 12, 2019

Codecov Report

Merging #3809 into master will increase coverage by 0.27%.
The diff coverage is 55.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3809      +/-   ##
==========================================
+ Coverage   77.75%   78.03%   +0.27%     
==========================================
  Files         109      109              
  Lines       14865    14871       +6     
  Branches     2076     2082       +6     
==========================================
+ Hits        11559    11605      +46     
+ Misses       2598     2553      -45     
- Partials      708      713       +5
Impacted Files Coverage Δ
raiden/settings.py 100% <ø> (ø) ⬆️
raiden/ui/cli.py 37.7% <10.25%> (+7.02%) ⬆️
raiden/constants.py 100% <100%> (ø) ⬆️
raiden/accounts.py 48.17% <100%> (+0.6%) ⬆️
raiden/network/rpc/smartcontract_proxy.py 80% <66.66%> (ø) ⬆️
raiden/network/utils.py 63.04% <68.42%> (-0.85%) ⬇️
raiden/utils/http.py 71.55% <72.41%> (+1.91%) ⬆️
raiden/network/rpc/client.py 63.23% <75%> (ø) ⬆️
raiden/utils/cli.py 50.55% <88.88%> (+0.37%) ⬆️
raiden/transfer/events.py 75.28% <0%> (-1.16%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8101cae...bb737f6. Read the comment docs.

Copy link
Contributor

@hackaugusto hackaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions but the changes looks good

- smoketest-udp-production-3.7
- smoketest-udp-development-3.7
- smoketest-udp-production-geth-3.7
- smoketest-udp-development-geth-3.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have the smoketest-matrix-development-parity-3.7 as a requirement for the jobs at the 4th step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm not sure I understand what you mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that all smoketests should be a requirement for the integration tests

This should not be used in misson-critical applications - a race condition may occur if
someone grabs the port before caller of this function has chance to use it.

Unless the `reliable` parameter is set to `True` (the default) this is prone to race
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a bit more of information about TIME_WAIT mode? e.g. a man page or something that explain the behavior on linux+mac.

I'm asking because to me it seems the race is still there, this is what I understand is happening:

  • pytest is iterating over port numbers
  • for each new port_candidate it's opening a socket, and setting SO_REUSEADDR
  • the above socket is set to listening, a connection is open to it, and then both sockets are closed.
  • pytest spawns geth and gives the port_candidate to geth

For the above to remove the race condition, the port number use open by pytest has to be available to geth but not to other processes in the system, which is not how I understand SO_REUSEADDR, but I don't know about the TIME_WAIT state. Perhaps it allows the port to be reused by processes in the same process group?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't it make sense to split get_free_port into two versions? One that uses port number 0 and another that uses the iterator? It doesn't seem to make sense to have a loop if the OS is giving us a random port.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Helpful TCP state transition reference)

Ok so here's how I understand the problem and solution:

Binding to port 0 makes the OS chose a random 'free' port. From the OS's perspective 'free' means any port that is in CLOSED state.

In the previous implementation we simply bound to the port, grabbed the port number and closed it. That means it's immediately returned into the 'free' pool since its state will be CLOSED again.

Since we're running multiple things that require listening ports (and some require more than one port) it appears to me that sometimes the OS will re-use the same port number in successive iterations.

By forcing the port into TIME_WAIT state (by connecting to the socket and then closing the connection) we effectively take it out of the OS's 'free' pool for the duration of that state (usually 60s).

By setting SO_REUSEADDR (and relying on the fact that most other 'server' kind of applications do the same) we however allow that port to be reused by another application (i.e. geth, synapse, etc.) while it's still in the TIME_WAIT state. AFAIU SO_REUSEADDR is independent of process group.

Regarding your question about splitting it. It basically already is. The name of the function is probably bad. It should really be called get_free_port_generator since it always returns a generator that will either return ports in increasing order (if you pass an initial_port > 0) or just random ports indefinitely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. The race across multiple processes is still there, IOW, this will not prevent another process from using the port. However tix fix the problem with subsequent calls with port 0 (which is really the only thing we care about, since we should not really have a rouge process trying to catch the same port as the test)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

if reliable:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
try:
sock.bind(('127.0.0.1', port_candidate))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave the ip address as an argument, otherwise we cannot write tests for multiple interfaces. (Well, we can also add it back if we write tests like these again, @konradkonrad what do you think?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I removed it is that the bind-to-zero-TIME_WAIT trick only works for specific interfaces (i.e. not 0.0.0.0) since there you can't guarantee which ports are available on which interface. Also so far we didn't use anything else besides 127.0.0.1 in the test suite.

Of course we could re-add the bind address and enforce that the combination bind_address='0.0.0.0', reliable=True isn't used.

for geth_process in testchain['processes_list']:
geth_process.kill()
def testchain_provider(request):
eth_client = EthClient(request.config.getoption('blockchain_type'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have def testchain_provider(blockchain_type): instead.

@@ -36,7 +36,7 @@ def test_cli_full_init(cli_args, raiden_spawner):
child.close()


@pytest.mark.timeout(35)
@pytest.mark.timeout(45)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because of parity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems so, yes :/
Apparently recent parity versions have a bug when shutting down on private chains (can't find the issue right now) doesn't work right, causing delays (until the node is killed by the executor).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example log from one of the test runs with the parity shutdown error at the end: https://39133-50646767-gh.circle-artifacts.com/0/eth-logs/parity/test_filter_start_block_inclusive/53348e51.log

@@ -85,7 +92,7 @@ def geth_to_cmd(
'--rpcapi', 'eth,net,web3,personal,txpool',
'--rpcaddr', '0.0.0.0',
'--networkid', str(chain_id),
'--verbosity', str(verbosity),
'--verbosity', str(_GETH_VERBOSITY_LEVEL[verbosity]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we handle invalid values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could. But OTOH a KeyError seemed good enough to me.

chain_spec = PARITY_CHAIN_SPEC_STUB.copy()
chain_spec['params']['networkID'] = chain_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment, networkID and chainID are not the same, however it should be fine for our tests to use either, since we don't do hard forks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good point.

log_file_name = os.path.join(path, 'account_import.log')
try:
with open(log_file_name, 'w') as log_file:
subprocess.check_call(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs:

Code needing to capture stdout or stderr should use run() instead:

https://docs.python.org/3/library/subprocess.html#subprocess.check_call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, yeah good point. Or even use check_output... No idea what I was thinking here.

@@ -523,14 +510,13 @@ def run_private_blockchain(
)

if node.miner:
config['unlock'] = 0
config['unlock'] = to_checksum_address(config['address'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to unlock the miner account?

Copy link
Collaborator Author

@ulope ulope Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea why that's done. I just converted it to use the explicit address instead of the index in the keystore since geth was giving a deprecation warning that unlock by index is going away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hummm, maybe we should remove it, otherwise eth_sendTransaction will work and that is sort of unexpected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it and see if anything breaks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't work without unlock. Apparently geth requires the Clique miner account to be unlocked:

WARN [04-15|17:22:01.283] Block sealing failed                     err="authentication needed: password or unlock"

)
parity_create_account(nodes_configuration[0], base_datadir, chainspec_path)
parity_create_account(nodes_configuration[0], base_datadir, genesis_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wondering that myself but didn't want to make the PR even bigger than it already is.
Ideally that should be moved to eth_nodes_to_cmds to be symmetrical to how we do it for geth.

@@ -44,8 +44,8 @@


class EthClient(Enum):
GETH = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulope what's the benefit of this change? Edit: Looking further down I guess it's to allow for this: https://github.com/raiden-network/raiden/pull/3809/files#diff-75ca5ec60bf91194fbc444f1da91a0a6R27

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and a few other places where the string value comes in handy.

@@ -509,14 +509,14 @@ def __init__(
# available nonce
available_nonce = web3.eth.getTransactionCount(address_checksumed, 'pending')

elif eth_node == constants.EthClient.PARITY:
elif eth_node is constants.EthClient.PARITY:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replace == with is for the Enum values? Is this standard practise? You want to compare to the object rather than its value I guess?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really make a difference but yes, identity is the recommended way to compare Enums: https://docs.python.org/3/library/enum.html#comparisons

@ulope
Copy link
Collaborator Author

ulope commented Apr 15, 2019

@hackaugusto I've addressed the actionable items and re-applied your changes from a79c7f7 (which had conflicts).

If you have no objection I'll merge it as soon as the tests are green.

This should fix the intermittent node start errors we've been seeing on
circle ci.

Fixes: raiden-network#3632

This implements the following:
- Fix get_free_port utility by 'abusing' TIME_WAIT mode (see docstring for details)
- Refactor the eth node launcher infrastructure in the tests to use a
  mirakuru subclass instead of the home-grown utilities used before.
- Using mirakuru allows all node launching utilities to be invoked as
  context managers
- All test fixtures and the smoketest now use the context manager style
- The smoketest gained a new `--eth-client` option which allows to run
  the smoketests with parity

Drive-by fix:
- Passing `--debug` to the smoketest skips stdout capturing, making
  working with pdb etc. easier. (Fixes raiden-network#3736)
@hackaugusto
Copy link
Contributor

If you have no objection I'll merge it as soon as the tests are green.

Yep, go ahead :)

@hackaugusto hackaugusto merged commit 6756e74 into raiden-network:master Apr 15, 2019
ulope added a commit to ulope/raiden that referenced this pull request Apr 16, 2019
ulope added a commit to ulope/raiden that referenced this pull request Apr 16, 2019
ulope added a commit that referenced this pull request Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants