Fix last duty per epoch validation#328
Conversation
Do not check limit if not the first message of the duty Co-authored-by: diego <diego@sigmaprime.io>
diegomrsantos
left a comment
There was a problem hiding this comment.
LGTM, great job finding this issue. There are limits for the number of messages in a round and the number of rounds in a duty. Therefore, we are assuming this is safe.
|
cc @nkryuchkov |
|
@dknopik @diegomrsantos Great find! We haven't merged ssvlabs/ssv#2190 yet, so we didn't make sure it doesn't cause issues like you found, but we use a mutex lock per peer ID + message ID, so it shouldn't happen. Have you considered adding a similar lock? Without it, there might be complex issues to find. E.g., it's not guaranteed that if message A is received before message B, then all checks on A will happen before all checks on B. So as a result, A might fail some checks, although it's correct, and B might pass some checks, although it's malformed |
|
Hey @nkryuchkov, the issue is not about a data race. In fact, we do have synchronization via the use of We have not checked if your PR is affected as well, but I assume so, as currently our code is closely modeled after yours. To further illustrate the issue, let me show the steps resulting in the issue. None of the steps here "overlap", i.e. locking behaviour does not matter. Let's have a committee with a lot of validators, resulting in a committee duty happening in every slot of an epoch. At the start of the committee duty in the last slot of an epoch, the leader sends a PROPOSE and a PREPARE message. Another operator receives these messages. We are in the last slot of an epoch, so the
This PR effectively changes the validation check to
|
Issue Addressed
err=ExcessiveDutyCount { got: 32, limit: 32 }#324Proposed Changes
Current behaviour: We count the number of duties per epoch per operator. If the operator sends another message while we are at the limit, the message is rejected. However, this behaviour is not quite correct, as we need to allow messages belonging to the duty that pushed the operator to the limit. This PR changes the check to only occur on the first message of a new duty. All further messages of that duty will not increase the duty count, so it does not make sense to check the count then.
Additional Info
While the issue started occurring after #311, it was not the root cause. The bug fixed by #311 merely masked this issue.