Skip to content

Fix last duty per epoch validation#328

Merged
dknopik merged 3 commits into
sigp:unstablefrom
dknopik:fix-validation-duty-count
May 19, 2025
Merged

Fix last duty per epoch validation#328
dknopik merged 3 commits into
sigp:unstablefrom
dknopik:fix-validation-duty-count

Conversation

@dknopik
Copy link
Copy Markdown
Member

@dknopik dknopik commented May 16, 2025

Issue Addressed

Proposed Changes

Current behaviour: We count the number of duties per epoch per operator. If the operator sends another message while we are at the limit, the message is rejected. However, this behaviour is not quite correct, as we need to allow messages belonging to the duty that pushed the operator to the limit. This PR changes the check to only occur on the first message of a new duty. All further messages of that duty will not increase the duty count, so it does not make sense to check the count then.

Additional Info

While the issue started occurring after #311, it was not the root cause. The bug fixed by #311 merely masked this issue.

Do not check limit if not the first message of the duty

Co-authored-by: diego <diego@sigmaprime.io>
@dknopik dknopik requested a review from diegomrsantos May 16, 2025 14:31
@dknopik dknopik added bug Something isn't working validation labels May 16, 2025
diegomrsantos
diegomrsantos previously approved these changes May 16, 2025
Copy link
Copy Markdown
Member

@diegomrsantos diegomrsantos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job finding this issue. There are limits for the number of messages in a round and the number of rounds in a duty. Therefore, we are assuming this is safe.

@diegomrsantos diegomrsantos requested a review from Copilot May 16, 2025 15:38

This comment was marked as resolved.

@diegomrsantos
Copy link
Copy Markdown
Member

cc @nkryuchkov

@nkryuchkov
Copy link
Copy Markdown
Contributor

@dknopik @diegomrsantos Great find!

We haven't merged ssvlabs/ssv#2190 yet, so we didn't make sure it doesn't cause issues like you found, but we use a mutex lock per peer ID + message ID, so it shouldn't happen.

Have you considered adding a similar lock? Without it, there might be complex issues to find.

E.g., it's not guaranteed that if message A is received before message B, then all checks on A will happen before all checks on B. So as a result, A might fail some checks, although it's correct, and B might pass some checks, although it's malformed

@dknopik
Copy link
Copy Markdown
Member Author

dknopik commented May 19, 2025

Hey @nkryuchkov,

the issue is not about a data race. In fact, we do have synchronization via the use of DashMap, which only allows one writer per entry.

We have not checked if your PR is affected as well, but I assume so, as currently our code is closely modeled after yours.

To further illustrate the issue, let me show the steps resulting in the issue. None of the steps here "overlap", i.e. locking behaviour does not matter.

Let's have a committee with a lot of validators, resulting in a committee duty happening in every slot of an epoch. At the start of the committee duty in the last slot of an epoch, the leader sends a PROPOSE and a PREPARE message. Another operator receives these messages. We are in the last slot of an epoch, so the duty_count is 31 before handling these messages. For simplicity, we only look at this validation step.

  1. PROPOSE arrives.
  2. We check if duty_count < 32. It is, so validation passes!
  3. We increment the duty_count as this message refers to a duty we have not seen before. It is now 32.
  4. We do the rest of the message processing.
  5. PREPARE arrives.
  6. We check if duty_count < 32. It is not, so validation fails. But it is not supposed to fail, as this message still belongs to the validly started duty.

This PR effectively changes the validation check to known_duty || duty_count < 32. Then, the steps look like this:

  1. PROPOSE arrives.
  2. We check if known_duty. It is not, so we check duty_count < 32. It is, so validation passes!
  3. We increment the duty_count as this message refers to a duty we have not seen before. It is now 32.
  4. We do the rest of the message processing.
  5. PREPARE arrives.
  6. We check if known_duty. It is, so validation passes!
  7. We do not increment the duty_count, as this message does not start a new duty.
  8. We do the rest of the message processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working validation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants