Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix set_total_deposit failure after channel is closed #4685

Merged
merged 6 commits into from Aug 30, 2019
Merged

Fix set_total_deposit failure after channel is closed #4685

merged 6 commits into from Aug 30, 2019

Conversation

rakanalh
Copy link
Contributor

@rakanalh rakanalh commented Aug 29, 2019

Fixes: #4639

Description

Looking at the stacktrace posted in the issue, it was obvious that the node was being restarted and the withdraw transaction was being sent on-chain during the initialization phase.

What happened?

  1. A withdraw was initialized
  2. Withdraw was signed by both participants and the transaction was queued.
  3. Node goes offline
  4. Partner closes channel
  5. Node goes back online and tries to execute the withdraw transaction on-chain.

The problems that caused this were:

  1. A race condition of trying to withdraw after having lost the race with the partner node which closed the channel was not handled. Raiden was raising an Unrecoverable error when the transaction failed because the channel state on-chain was not open
  2. When initializing the transaction queue, the withdraw transaction was not being "invalidated" by the channel closed state change.

Solution implemented:

  1. If channel state > OPEN then we lost the race with the other node that closed the channel... it should be a recoverable error. If channel state < OPEN, the channel did not exist... crash.
  2. Detect a channel closed state change and make sure this invalidates any withdraw transactions before they get sent on-chain.

PR review check list

Quality check list that cannot be automatically verified.

  • Safety
  • Code quality
    • Error conditions are handled
    • Exceptions are propagated to the correct parent greenlet
    • Exceptions are correctly classified as recoverable or unrecoverable
  • Compatibility
    • State changes are forward compatible
    • Transport messages are backwards and forward compatible
  • Commits
    • Have good messages
    • Squashed unecessary commits
  • If it's a bug fix:
    • Regression test for the bug
      • Properly covers the bug
      • If an integration test is used, it could not be written as a unit test
  • Is it a user facing feature/bug fix?
    • Changelog entry has been added

If the channel was beyond OPEN state (Closed, Settled, Removed), then
a withdraw operation is simply invalidated by a race condition.
If the channel was not even OPEN (Non existent) then an Unrecoverable
error is raised.
@rakanalh rakanalh changed the title Issue 4639 Fix set_total_deposit failure after channel is closed Aug 29, 2019
@Dominik1999 Dominik1999 requested a review from karlb August 29, 2019 15:28
@codecov
Copy link

codecov bot commented Aug 29, 2019

Codecov Report

Merging #4685 into develop will decrease coverage by 0.01%.
The diff coverage is 69.23%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4685      +/-   ##
===========================================
- Coverage     80.9%   80.89%   -0.02%     
===========================================
  Files          118      118              
  Lines        14222    14234      +12     
  Branches      2193     2195       +2     
===========================================
+ Hits         11506    11514       +8     
  Misses        2074     2074              
- Partials       642      646       +4
Impacted Files Coverage Δ
raiden/waiting.py 81.54% <ø> (ø) ⬆️
raiden/network/proxies/token_network.py 51.45% <0%> (-0.07%) ⬇️
raiden/transfer/node.py 94.24% <100%> (+0.04%) ⬆️
raiden/transfer/events.py 97.24% <100%> (+0.11%) ⬆️
raiden/storage/serialization/types.py 59.09% <0%> (-9.1%) ⬇️
raiden/raiden_event_handler.py 91.5% <0%> (-0.66%) ⬇️
raiden/network/transport/matrix/transport.py 81.28% <0%> (+0.29%) ⬆️
raiden/network/transport/matrix/client.py 69.8% <0%> (+1.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7804bd3...8ad601a. Read the comment docs.

Copy link
Contributor

@karlb karlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed PR description!

If an integration test is used, it could not be written as a unit test

The code changes could have been tested in a unit test, but the complete case from the bug report is covered more thoroughly by this integration test. So I'm not sure how to rate this item.

Changelog entry has been added

Does this count as user facing? I lean towards saying "yes".

@@ -1402,11 +1402,18 @@ def _set_total_withdraw(
block_identifier=failed_at_blockhash,
)

if detail.state != ChannelState.OPENED:
if detail.state > ChannelState.OPENED:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the comparison work?

https://docs.python.org/3/library/enum.html#comparisons says:

Ordered comparisons between enumeration values are not supported. Enum members are not integers (but see IntEnum below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChannelState is an IntEnum so comparisons work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was never big fan of comparing int values of enums as if you are not careful and add some new value later on in the wrong place it can cause really hard to find bugs. That is just my experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I was looking at https://github.com/raiden-network/raiden/blob/develop/raiden/transfer/state.py#L70, which is a plain Enum. Is there a reason for having both or is it just for historic reasons?

raiden/tests/integration/test_recovery.py Outdated Show resolved Hide resolved
raiden/tests/integration/test_recovery.py Outdated Show resolved Hide resolved
@rakanalh
Copy link
Contributor Author

If an integration test is used, it could not be written as a unit test

Anything proxy-related or raiden-service related isn't directly unit testable because of the whole bunch of dependencies that are required by those components. The option is to either mock them all (which is bad) or just write an integration test.

Does this count as user facing? I lean towards saying "yes".

We need a better definition here IMO... i would lean towards no because the failure and the fix is a result of an automated process (recovery) after restart which the user did not trigger.

@LefterisJP
Copy link
Contributor

We need a better definition here IMO... i would lean towards no because the failure and the fix is a result of an automated process (recovery) after restart which the user did not trigger.

We generally consider all bugs as changelog-worthy since users can inadvertedly hit them.

@rakanalh rakanalh merged commit 6b9fd0f into raiden-network:develop Aug 30, 2019
@rakanalh rakanalh deleted the issue-4639 branch August 30, 2019 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cannot call setTotalWithdraw on a channel that is not open. causes crash on start
4 participants