Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency of non-protocol required confirmation messages (Delivered, Processed) #3254

Open
3 tasks
andrevmatos opened this issue Jan 10, 2019 · 29 comments

Comments

@andrevmatos
Copy link
Contributor

andrevmatos commented Jan 10, 2019

Abstract

Raiden protocol requires strong guarantees about delivery and acceptance (correctly processing) of most of its messages, due to strict requirements related to state consistency between parts. Initially because of limitations of the UDP protocol, Raiden added some confirmation-control messages which are not specific to the inner working of the protocol, but are required specifically [only] to control retry of the message, until they're received/acccepted or some breaking event happens (e.g. channel is closed, no need to continue retrying).
Currently, the implementation is that sender should keep retrying sending any message until they receive a respective Delivered or Processed message, indicating it was correctly received or accepted by the partner (see #3087). This adds load and latency to the transport which isn't actually required for proper working of the protocol, and also requires additional processing due to signing and validation of every message, specially problematic if one have interactive signing (possibly the case for a future light client implementation, #114) or limited hardware (#3199 , #2385).

Motivation (iteration 1)

In talks with @hackaugusto , it became clearer that Delivered and Processed were split from an Ack message on UDP times, which had mostly the function of current Processed message, which is to confirm correct processing and stop retry of a given message, currently replied to BalanceProof-containing messages. For that, we may remove Delivered messages completely, as most of its function is already provided by Processed messages.

Specification

Currently, Delivered messages are used to confirm 3 messages: SecretRequest, SecretReveal and Processed, and this is how we can remove the need for it:

  • SecretRequest: simplest case, confirmed by the respective SecretReveal
  • SecretReveal: although doesn't contain a balance-proof, it can be confirmed by a Processed message, as its content needs to be validated against the state (e.g. there's a pending lock with the given secret), so a Processed message should confirm it instead of Delivered (which means nothing on this context anyway, effectively replacing Delivered for Processed for SecretReveal).
  • Processed: become a one-shot message, as currently is Delivered, meaning it isn't retried, but sent only once as a response to some other message. One-shot messages doesn't need confirmation (therefore, no Delivered required), but also needs to handle duplicate/retried messages coming.
    • For already-accepted BalanceProof-containing messages/state changes, it's not going to get processed again, so if the nonce is smaller than current one, a new (one-shot, even if ) Processed message is generated and sent anyway.
    • For SecretReveal message, as proposed above, if there's a pending or unlocked lock with the given hash, a Processed should be sent to confirm it

This proposal also supersedes #2383 and #2505 . May be related to #490 (as per receiver by-nonce queue and caching of responses).

Backwards Compatibility

The Delivered messages could still be sent for backwards compatibility, and some handling of it could still be done for the SecretReveal confirmation coming from old clients. Nonetheless, this could be a good opportunity to improve protocol by breaking compatibility, possibly also doing #3283 before next release.

Timeboxing

Investigation and split into more detailed issues or implementation timeboxed to 2 PD

@andrevmatos
Copy link
Contributor Author

andrevmatos commented Jan 11, 2019

On last bullet point, about RefundTransfer, maybe we can rely on the SecretReveal => Secret/Unlock from a mediator/initiator to next mediator on the failed routes with refunded transfers to stop retrying, removing the need for Processed on it too (same logic for normal reveal/unlock apply then). To investigate.

@hackaugusto
Copy link
Contributor

hackaugusto commented Jan 14, 2019

We have discussed this in the past multiple times, in my opinion it is a bad idea. My comments on the proposal:

Regarding the abstract

This adds load and latency to the transport which isn't actually required for proper working of the protocol

I believe the claim about latency here is wrong (I agree with the load though).

From the perspective of a single transfer: There is no waiting involved with either Delivered or Processed messages. These messages do not contain a balance proof, and therefore don't need to be ordered, meaning they can be sent as soon as it is appropriate ( == no waiting/latency). Additionally, these messages are sent backwards in respect to the protocol messages, to a different recipient, meaning these messages are sent in parallel to the protocol messages.

From the perspective of a single node: In this case, I actually think your proposal will add latency instead of removing it. Assume that you have a channel A-B, and that node A is trying to send two transfers through this channel at the same time, now assume the first transfer has a longer path than the second. Assume that messages are sent one at the time, then the second transfer will have to wait for the longer path to complete its protocol before proceeding with the second, increasing the latency. This point generalizes and holds even if messages are batched, the difference is that instead of talking about a message we would have to talk about a group of messages.

and also requires additional processing due to signing and validation of every message, specially problematic if one have interactive signing

This is true for the current implementation, however, signing is being used to authenticate the messages to prevent an attack. The important thing here is that we can use different mechanisms to do the authentication, e.g. end-to-end encryption, with E2E user intervention isn't necessary, as long as the channel is not compromised (e.g. the symmetrical key is safe).

Regarding the motivation

These confirmation messages, so far, are used only to tell sender it can stop retrying a message.

I agree completely with this statements here. Yes, these two messages have a unique purpose. Delivered messages inform the sender that the recipient has received and safely stored the message. Processed messages means that the message was not only stored, but also that the contents of the message which had to be validated against the blockchain were validated and accepted as valid. I not only think that having a single meaning per message is fine, I also think it's something we should really strive for.

Regarding the specification

LockedTransfer initiator => mediator: either a SecretRequest will be received from the target, or a RefundTransfer will be received from the current mediator.
LockedTransfer mediator => mediator|target: either a SecretReveal or RefundTransfer will be received from next hop/mediator or target

You are assuming that either a secret request or a refund transfer will always arrive, what if neither arrive? This could happen for a variety of reasons:

  • There is no capacity from the mediator-> initiator, so the refund transfer cannot be sent and the lock expires.
  • A node downstream, mediator or target, received the locked transfer inside the danger zone and halted the protocol to wait for a remove expired lock. (This can happen because 1. of a poorly chosen initial lock expiration, because 2. a node went off-line for a long period, because 3. of high latency in a node's transport, be it matrix or something else).

SecretRequest target => initiator: a SecretReveal should be received from initiator

This is a good example of why having multiple responsibilities for a single message can be bad, what if the secret is learned by the mediator through the blockchain? Because the node learned it from the blockchain it knows that everyone else in the path learned it from the blockchain, including the previous mediator and the initiator, and for the current version of the code base it does not send a secret reveal backwards, it's only done when the secret is learned off-chain

The proposed change would require us to review all messages and code paths to ensure that they are sent for the unhappy case too, otherwise the node will be unable to clear the message from the queue and make progress.

SecretReveal target|mediator => prev mediator: a Secret/Unlock (see #3132 , #3145) response should be received

This can actually deadlock if messages are sent one at the time and there is a refund transfers. Consider the following path and order messages in the queue:

A-B-C-B-D
B->C: SecretReveal, Unlock
C->B: SecretReveal, Unlock

Batching could fix the above example, however, we would have to introduce a requirement for the minimum batch size to the transport layer and assume this size ensures progress when the queue is bigger than it.

@andrevmatos
Copy link
Contributor Author

Thank you, @hackaugusto for your thoughts. But I think some misunderstandings happened, and would like to get your position on the clarifications below:

meaning they can be sent as soon as it is appropriate ( == no waiting/latency)

That's true for UDP, but on matrix you need to consider each event incurrs in ramp-up-down times for /sync requests on receiver and on sender (sender receives its own messages too, even if they're ignored) which, for a relatively slow method like http long polling, can aggregate to a quite large protocol overhead.

Assume that messages are sent one at the time, then the second transfer will have to wait for the longer path to complete its protocol before proceeding with the second, increasing the latency.

Why? My idea is that the queues won't be halted by a message not yet confirmed, but instead, all messages should be sent as soon as they arrive (and/or batched), and protocol will always move forward regardless of confirmation method. This proposal is about changing the machanics of when we stop retrying some message being sent, but not requiring this to happen at all for the protocol (including different transfers going through the same channel) to go on.

The important thing here is that we can use different mechanisms to do the authentication, e.g. end-to-end encryption

Agree, this point specifically can also be replaced by transport features. Nonetheless, IMO the other points still hold and we can shrink the protocol and needed roundtrips through this proposal.

Yes, these two messages have a unique purpose.

Yes, I know they mean different things, but my proposal is that we can drop this distinction, and implement the actual effect of these messages (in this case, stopping the retry) implicitly through the actual protocol messages, instead of specific ones.

You are assuming that either a secret request or a refund transfer will always arrive, what if neither arrive?

This is built-in. I considered that any of these messages could never arrive, in which case (via this proposal) per protocol they should then be retried until the message is not valid anymore (in this case, lock expiring). If the situation change, the message can eventually be accepted and transfer go through. If not, and transfer never finish, they should continue to be retried until the lock expires or the channel is closed, making them invalid and retry stopped.

This is a good example of why having multiple responsibilities for a single message can be bad, what if the secret is learned by the mediator through the blockchain?

This only affects the condition in which the message (in this case, SecretReveal) gets cleared (stop retry). If the message is revealed on-chain, we can simply stop retrying the SecretReveal messages in any of the pending mediators, as the receiver will (presumably) have also detected the on-chain reveal and won't send the Secret/Unlock, so this is this corner-case condition to stop retrying the message.

Batching could fix the above example, however, we would have to introduce a requirement for the minimum batch size to the transport layer and assume this size ensures progress when the queue is bigger than it.

That's not needed. As I said, the messages can be batched, but batching is just an optimization, they're always sent anyway (preferably in-order, but not necessarily, that's why we rely on the retry mechanism). They won't deadlock, as all messages will always be sent, and eventually should be accepted, or else continue retrying until some higher-tier condition is invalidated (e.g. channel closed).

This proposal makes the delivery/acceptance of the messages async with the protocol. Once the state moves on sender (and then a respective message is queued to partner), sender can move on, making new state changes and queueing other messages on top of it. Eventually, they should be accepted through retry mechanism, and once the side effect of it is detected, we can give up retrying the message. Until then, everything is async, and we can move forward in the protocol.

@hackaugusto
Copy link
Contributor

That's true for UDP, but on matrix you need to consider each event incurrs in ramp-up-down times for /sync requests on receiver and on sender (sender receives its own messages too, even if they're ignored) which, for a relatively slow method like http long polling, can aggregate to a quite large protocol overhead.

This is not specific to the delivered and processes messages, how does this implies these messages incur latency?

Why? My idea is that the queues won't be halted by a message not yet confirmed, but instead, all messages should be sent as soon as they arrive (and/or batched), and protocol will always move forward regardless of confirmation method.

The above assumes that only the nth-first messages of the queue are sent.

[...], but my proposal is that we can drop this distinction, and implement the actual effect of these messages (in this case, stopping the retry) implicitly through the actual protocol messages, instead of specific ones

That was the point, implicit meaning is not a good thing.

This is built-in. I considered that any of these messages could never arrive, in which case (via this proposal) per protocol they should then be retried until the message is not valid anymore (in this case, lock expiring). If the situation change, the message can eventually be accepted and transfer go through. If not, and transfer never finish, they should continue to be retried until the lock expires or the channel is closed, making them invalid and retry stopped.

This is not a sufficient condition, what if the recipient is offline while the lock expires? The locked transfer retries stop and the remove expire lock is added to the queue, but remove expire lock will never be processed without the locked transfer.

This only affects the condition in which the message (in this case, SecretReveal) gets cleared (stop retry). If the message is revealed on-chain, we can simply stop retrying the SecretReveal messages in any of the pending mediators, as the receiver will (presumably) have also detected the on-chain reveal and won't send the Secret/Unlock, so this is this corner-case condition to stop retrying the message

I know that we can try to figure out what are all conditions for retries, that is not the point. The point is that this is extra work, more error prone, and difficult.
It may be true that matrix has some persistence, but IMO we should not design the protocol around the property of this specific transport, so the above holds.

That's not needed. As I said, the messages can be batched, but batching is just an optimization, they're always sent anyway (preferably in-order, but not necessarily, that's why we rely on the retry mechanism). They won't deadlock, as all messages will always be sent, and eventually should be accepted, or else continue retrying until some higher-tier condition is invalidated (e.g. channel closed).

Again, I'm assuming the first nth-messages are sent. Your batch is the whole queue.

This proposal makes the delivery/acceptance of the messages async with the protocol. Once the state moves on sender (and then a respective message is queued to partner), sender can move on, making new state changes and queueing other messages on top of it. Eventually, they should be accepted through retry mechanism, and once the side effect of it is detected, we can give up retrying the message. Until then, everything is async, and we can move forward in the protocol.

So, from what I can see the proposal is to:

  1. reduce number of signatures
  2. reduce latency
  3. reduce load

For 1. I think e2e is a better solution. for 2. I'm not convinced it will have a positive impact. For 3 it's probably negligible, delivered and processed messages with a signature are around 77 bytes, without it they are 12 bytes.

@andrevmatos
Copy link
Contributor Author

andrevmatos commented Jan 14, 2019

This is not specific to the delivered and processes messages, how does this implies these messages incur latency?

incur latency because they're unneeded. any unneeded message incur latency, given matrix internals

The above assumes that only the nth-first messages of the queue are sent.

All messages are always retried, until we're certain they can be cleared. If an out-of-order message is rejected, their previous message will be retried, and eventually accepted. "queue" won't be halted by non-confirmed messages.

That was the point, implicit meaning is not a good thing.

IMHO it is, if it's inherent from the protocol. e.g. LockedTransfer includes more than one information on it, we don't send one message for each, because they always go together. This is semanthic, and is inherent of the protocol, if we say so.

This is not a sufficient condition, what if the recipient is offline while the lock expires? The locked transfer retries stop and the remove expire lock is added to the queue, but remove expire lock will never be processed without the locked transfer.

You're right, then keep sending them until the Processed for the LockExpired comes, because it'll be accepted only after expired LockedTransfer is. There's always a condition to tell when core may tell transport it may stop retrying some message, without requiring specific messages for it.

I know that we can try to figure out what are all conditions for retries, that is not the point. The point is that this is extra work, more error prone, and difficult.

I think it's not more difficult than having specific messages with specific state changes and conditions handled to do the same, which we need to keep in sync and ensure does what it's expected.

It may be true that matrix has some persistence, but IMO we should not design the protocol around the property of this specific transport, so the above holds

This whole proposal has nothing to do with matrix's persistency, nor requires it, it was dropped in favor of transport-agnostic retry-ordering mechanism.

Again, I'm assuming the first nth-messages are sent. Your batch is the whole queue.

Yes, it is, but is doesn't need to be, messages are retried independently of each other, and will eventually be accepted in order.

So, for 1. e2e may tie us to the transport, I'd rather not. Latency and load are not the main reasons for this proposal, but a side effect of the actual target: reduced number of messages required for the core protocol. IMHO latency and load are just a couple of the number of ways a protocol with just the bare minimum size is better than a fatter one.

@ulope
Copy link
Collaborator

ulope commented Jan 16, 2019

In Monday's transport teem meeting we decided on a deadline for discussion on this issue no later than 2019-01-25.

@rakanalh
Copy link
Contributor

My take away from this discussion:

The overall discussion is whether we should remove the dependency on receiving Processed when a node sends a balance-proof message because we can depend on other messages being sent as a side effect which eventually tell us that the node received and actually processed/accepted the previous message.
This is somehow related to our pending_transactions implementation where receiving a specific blockchain event could invalidate / remove a pending transaction because it is no longer required.

From one hand, if i take the example:

LockedTransfer mediator => mediator|target: either a SecretReveal or RefundTransfer will be received from next hop/mediator or target

What could happen is that either the SecretReveal or RefundTransfer messages will be received, therefore removing the original LockedTransfer message from the queue, OR the node received the locked transfer and went offline or went out of capacity, therefore sending the locked transfer again presents no harm as the message will be rejected due to having the same nonce (duplicate message which the receiving node already processed).

On the other hand, the same example of:

LockedTransfer mediator => mediator|target: either a SecretReveal or RefundTransfer will be received from next hop/mediator or target

Could present different questions. Such as: What if the secret was registered on-chain. Since the transport is agnostic about whats happening on the blockchain, this could result in the LockedTransfer to be retried multiple times although the secret became known. The way to stop this message from being retried is to either have the transport also check blockchain events or for the state machine to inform the transport about this which is a bad design (the current architecture implements the complete opposite).

One more example can be given in a topology of channel A-B where A sends a locked transfer to B, B processes that message and goes offline. Later on, A expires the lock, the lock expiry will never be sent because the locked transfer was not removed from the queue because no message was received by A from B saying that this message was processed. Therefore, protocol was halted.

My opinion: i agree with Augusto on the point:

I not only think that having a single meaning per message is fine, I also think it's something we should really strive for.

IMO the Processed message IS part of the protocol and it's meaning: the node has received and accepted the previous message.
While having (for example) the SecretRequest message to say: I accepted the previous message and now i am reacting to it by requesting the secret` makes our protocol message have multiple meanings.

To be discussed and decided on.

@andrevmatos
Copy link
Contributor Author

andrevmatos commented Jan 22, 2019

@rakanalh good points. Yes, your understandings are mostly right, only some clarifications:

The way to stop this message from being retried is to either have the transport also check blockchain events or for the state machine to inform the transport about this which is a bad design (the current architecture implements the complete opposite)

The second case is actually the current implementation, and the intended one: core/state machine knows when a message was received/processed (through handling of the confirmation messages), and push to the transport the information it can give up retrying that message. The only observation is that it does so implictly, by removing the message from the queue, which is monitored by transport to give up (#3252 contains a proposal to make that explicit as well).

This case you pointed out is an edge case of the one raised by Augusto, in which the to-be expected side effects of the LockedTransfer doesn't happen, but the LockExpired which would be the next one to stop retry of the LockedTransfer isn't sent either (because the secret was revealed, so the lock didn't actually expire).
I agree this is an issue, and didn't have time yet to think on another side-effect that could be used to confirm the LockedTransfer in this specific case. Worst case we could always keep specific or generic (Processed) confirmation messages for these very specific case, and that would already IMHO improve protocol by reducing required messages.

Later on, A expires the lock, the lock expiry will never be sent because the locked transfer was not removed from the queue

Not true, in current implementation and also on this proposal, messages continue to be retried until they're cleared. LockExpired doesn't depend on LockedTransfer to be cleared from queue to be sent. All messages are retried, and ordered both on sender and (as per this proposal) on receiver as well. LockedTransfer will be eventually accepted, and LockExpired as well, which Processed will confirm both were accepted and clear them on sender. A retrying message doens't lock the queue, they're always retried in order and eventually accepted and cleared.

@andrevmatos
Copy link
Contributor Author

andrevmatos commented Jan 22, 2019

IMO the Processed message IS part of the protocol and it's meaning: the node has received and accepted the previous message.

You're right, currently they're part of the protocol. This proposal is about arguing they aren't strictly needed for it, somehow superfluous, and could be (mostly) taken out of it (at least as a hard requirement of the protocol, could be kept for compatibility for some time). Again, protocol is what we say it is.

Granularity of messages is subjective, about what we consider different meanings for a given message, depends a lot on perspective. If looking closer enough, most messages has multiple meanings, and if zooming out, we could merge some of them to reduce protocol size.

@andrevmatos
Copy link
Contributor Author

If this "side effect instead of Processed" isn't accepted, a compromise would be to use this approach at least for getting rid of Delivered messages, which already compose 50% of sent messages: Delivered is completely superfluous for all messages that require Processed, and the cases where it's used for confirmation are simpler than the cases for LockedTransfer raised by Augusto and Rakan.

@andrevmatos
Copy link
Contributor Author

I'm working on a small PoC for the Delivered part I want to address for this limited scope of this issue, to guide me on commenting and better describing this next iteration of this proposal. Will edit the PR as soon as I get a better overview of it, hope this week

@andrevmatos
Copy link
Contributor Author

On the workshop, it was decided on the following approach for now:

  • Remove requirement for signature in the Delivered message; the signature can be kept being done by the nodes for now, for backward compatibility, but they sender will be set by matrix transport from the displayname transport recovering instead of internal signature, so Delivered messages without signature (like we're going to send from light clients) are still accepted protocol-wise. Delivered messages are used only for control, and have minimum safety requirements, transport-side sender recovering is enough
  • Remove requirement for signature in the SecretReveal message; the validity of the secret is enough confirmation for it, it doesn't make sense to also require signature; it can still be signed in full nodes (as currently) for backwards compatibility as well.

konradkonrad added a commit to konradkonrad/raiden that referenced this issue Feb 19, 2019
Note: this is just to make it work with the current code base, it should
however not be a retried message. See also
raiden-network#3254 (comment)
konradkonrad added a commit to konradkonrad/raiden that referenced this issue Feb 19, 2019
Note: this is just to make it work with the current code base, it should
however not be a retried message. See also
raiden-network#3254 (comment)
@andrevmatos
Copy link
Contributor Author

Issue was edited to include current proposal, after talks with @hackaugusto . This shouldn't take too long, but could break compatibility (it can be kept with a little more work though, although we could skip keeping it for now and just asking clients to upgrade).

@hackaugusto
Copy link
Contributor

hackaugusto commented Feb 19, 2019

Just to be clear, my opinion is written below and it includes only the things written bellow, nothing else:

  • Sending a delivered in response to a processed is a bug
  • Sending a delivered and a processed for the same message is a bug

And the bugs above should be fixed. We have a few ways to fix this:

  • Upgrade the software and just fix the above issues. This however, requires a client upgrade, because the current version expects a delivered in response to a processed, otherwise the processed message is not cleared from the queue, making it stuck.
  • Add a protocol upgrade, and in the new version fix the above, this would maintain backwards compatibility
  • Once there is a new deploy of the smart contracts, do the fixes, because the networks will be disjoint anyways.

Edit: Important note: The processed message is being added to the global queue, meaning that it does not hold off balance proof messages.

konradkonrad added a commit to konradkonrad/raiden that referenced this issue Feb 25, 2019
Note: this is just to make it work with the current code base, it should
however not be a retried message. See also
raiden-network#3254 (comment)
konradkonrad added a commit to konradkonrad/raiden that referenced this issue Feb 25, 2019
Note: this is just to make it work with the current code base, it should
however not be a retried message. See also
raiden-network#3254 (comment)
konradkonrad added a commit to konradkonrad/raiden that referenced this issue Feb 26, 2019
Note: this is just to make it work with the current code base, it should
however not be a retried message. See also
raiden-network#3254 (comment)
konradkonrad added a commit to konradkonrad/raiden that referenced this issue Feb 26, 2019
Note: this is just to make it work with the current code base, it should
however not be a retried message. See also
raiden-network#3254 (comment)
konradkonrad added a commit to konradkonrad/raiden that referenced this issue Feb 27, 2019
Note: this is just to make it work with the current code base, it should
however not be a retried message. See also
raiden-network#3254 (comment)
@karlb karlb moved this from Sprint Backlog (Team does that) to Backlog in Raiden Berlin Sprint Feb 20, 2020
@andrevmatos
Copy link
Contributor Author

andrevmatos commented Mar 19, 2020

This was done in the Light Client, and in a backwards compatible manner.
Here's how we got it working

  • LockedTransfer, Unlock, LockExpired, RefundTransfer: BP-messages are retryed until receiving the respective Processed
    • Processed for a received BP-message is cached, and sent one-shot whenever the same message is received again
  • SecretRequest: retried until secret is known (from off or on-chain reveal) or expires
  • SecretReveal to target: sent once for each valid received SecretRequest
  • SecretReveal to previous hop: sent when secret is known, retried until receiving Unlock or secret gets registered on-chain
  • WithdrawRequest: retried until we receive WithdrawConfirmation or request expires
  • WithdrawConfirmation: cached and sent once per valid (non-expired) request
  • WithdrawExpiration: we don't care for it, nodes can expire requests on their own, but we may send Delivered to confirm it for backwards compatibility

Delivered messages are superfluous on this setup, but still sent for Processed, SecretRequest, SecretReveal iff the peer doesn't advertise noDelivery capability (for backwards compatibility), but can be safely removed.

@Dominik1999 Dominik1999 moved this from Backlog to Sprint Backlog (Team does that) in Raiden Berlin Sprint Mar 19, 2020
@Dominik1999 Dominik1999 moved this from Sprint Backlog (Team does that) to Backlog in Raiden Berlin Sprint Mar 19, 2020
@fredo
Copy link
Contributor

fredo commented Mar 19, 2020

#5581 partially resolves the issue. What it's doing can be read there.

  • fix Fix Delivered and Processed message handling #5581 to be merged
  • implement the confirmation for partner’s old messages (messages with a nonce <= the latest processed nonce)
  • discuss Andre's solution for backward compatibility
  • Do we want to break backward compatibility (potentially last chance before the RC)

Breaking backward compatibility would make the codebase easier. After the RC and release we should have to stay with the protocol.

@Dominik1999
Copy link
Contributor

have a call with all stakeholders to decide the best approach.
@hackaugusto @ulope @andrevmatos @fredo @Dominik1999 have a call and decide if this is a blocker forour release and how to go on

@Dominik1999 Dominik1999 moved this from Backlog to Sprint Backlog (Team does that) in Raiden Berlin Sprint Mar 19, 2020
@andrevmatos
Copy link
Contributor Author

#3283 (comment) I describe how we implemented Capabilities. If you guys decide to implement it, you can worry less about breaking compatibility as protocol differences can be negotiated through them, as long as thenode can fallback to previous behavior.
If you don't, as I said, our behavior is still backwards compatible with your current, so we will negotiate optimizations with nodes that support it, and fallback if it doesn't.

@Dominik1999 Dominik1999 removed the Effort / 21 hard label Mar 19, 2020
@Dominik1999 Dominik1999 moved this from Sprint Backlog (Team does that) to In progress in Raiden Berlin Sprint Mar 19, 2020
@hackaugusto
Copy link
Contributor

Results from the call:

  • Just removing the delivered message for the current sprint is sufficient. The light client does not rely on the delivered so the removal of the message will not break compatibility among the clients. (Note: The current capability in the light client would have to change name and semantics, since the python client will not have delivered messages anymore, but it will still wait for Processed messages @andrevmatos )
  • Raiden should send a Processed in response to received messages if the nonce of the message is lower than or equal to the partner's nonce in the local state (this is necessary for synchronization in case processed messages are dropped)
  • The withdraw messages are a bit "special" since these are implementing the approach that doesn't always send a processed back. (Note for the implementer, it looks like inplace_delete_message is missing the cleanup for SendWithdrawRequest when a ReceiveWithdrawExpired is received, this needs a test and a fix)

@andrevmatos
Copy link
Contributor Author

andrevmatos commented Mar 19, 2020

@hackaugusto We do send Processed and rely on it, for EnvelopeMessages, so the behavior change you describe should be fully compatible with our current approach and expectations.
The current capability on the LightClient is just a way to tell partners what is already true: that we don't need Delivered messages (the behavior you described), and therefore they don't need to be sent to us, saving the wasted roundtrips. Thanks

@Dominik1999
Copy link
Contributor

So #5581 will be part of Alderaan, whereas further optimization described in this issue are not part of it.

@Dominik1999 Dominik1999 removed this from the Alderaan milestone Mar 19, 2020
@Dominik1999 Dominik1999 moved this from In progress to Backlog in Raiden Berlin Sprint Mar 25, 2020
@andrevmatos
Copy link
Contributor Author

Ok, we're revisiting what was left for this on the LC for raiden-network/light-client#1514 and raiden-network/light-client#249 .
From what we see, WithdrawConfirmation confirms WithdrawRequest, therefore neither need Delivered/Processed, but WithdrawExpired has no confirmation and to this day still requires Processed to confirm it.
Our suggestion is to get rid of WithdrawExpired as per #4721 , and this should conclude this issue. We're still implementing replying the required Processed for now, but won't require it to expire a past WithdrawRequest on our side, therefore making it a non-breaking change/protocol optimization, and possibly introduce a capability to tell other [light] clients they don't need to bother sending the WithdrawExpired when withdrawing from compatible partners.

@Dominik1999 Dominik1999 moved this from Backlog to Sprint Backlog (Team does that) in Raiden Berlin Sprint Jul 6, 2020
@Dominik1999 Dominik1999 moved this from Sprint Backlog (Team does that) to Backlog in Raiden Berlin Sprint Jul 27, 2020
@ulope
Copy link
Collaborator

ulope commented Sep 2, 2020

Basic capability publishing support as been implemented via #6482.
Now this needs to be done so wen can take advantage.

@ulope ulope moved this from Backlog to Sprint Backlog (Team does that) in Raiden Berlin Sprint Sep 3, 2020
@konradkonrad konradkonrad self-assigned this Sep 3, 2020
@konradkonrad
Copy link
Contributor

konradkonrad commented Sep 7, 2020

General approach

  • Remove Delivered
  • All protocol message that don't elicit an immediate protocol response message require a Processed to clear the re-sending. Example: LockedTransfer does not have an immediate response, because the answer (RevealSecret / LockExpired) depends on the outcome of the protocol (success/failure) along the complete path of the payment.

ToDo

  • compare the approach in PR#5581 to what we want to do now
  • currently message clearing is triggered by state changes - I wonder if that should change to "ClearMessage" event, because the clearing now may depend on a state transition?
  • consider moving inplace_delete_message to the transport -- the current implementation of the _RetryQueue is rather opaque, when it comes to clearing items from the queue.
  • Check: > Note for the implementer, it looks like inplace_delete_message is missing the cleanup for SendWithdrawRequest when a ReceiveWithdrawExpired is received, this needs a test and a fix (context).
  • Remove Delivered message, it should not be needed anymore
  • Add edge case tests for Withdraw
  • Change advertised default capabilities (depends on Discuss format of capabilities #6503 + resolution)
  • Decide on fallback behavior for legacy peers. I.e. how to treat peers that don't advertise noDelivery? If we implement legacy behavior, add integration test.

@konradkonrad konradkonrad moved this from Sprint Backlog (Team does that) to In progress in Raiden Berlin Sprint Sep 8, 2020
@konradkonrad
Copy link
Contributor

konradkonrad commented Sep 8, 2020

  • when re-sending Processed, we have to memorize/recall past message ids
  • option: use a (modified) cachetools.TTL cache for non-expired locks, or their messages. scenario: partner node crashes before receiving the terminating Processed message and retries. Since max number of in-flight transfers is bound and we have a timeout, this is safe
  • compare LC implementation

@karlb karlb moved this from In progress to Backlog in Raiden Berlin Sprint Oct 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

7 participants