Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve race conditions in shard ownership #3

Closed
colmsnowplow opened this issue Sep 28, 2021 · 2 comments
Closed

Resolve race conditions in shard ownership #3

colmsnowplow opened this issue Sep 28, 2021 · 2 comments

Comments

@colmsnowplow
Copy link

Because a record's 'staleness' interval is tied to the shardCheckFrequency, when we set this frequency to a low number, we can capture a shard, and have another client claim ownership of the shard before it's released.

We can resolve this race condition by decoupling the shardCheckFrequency from the maxAgeForClientRecord, since it is reasonable that we may take longer then 2.5 seconds to process a record when we poll for records every half-second.

Also investigate if maxAgeForLeaderRecord has similar issue.

@colmsnowplow
Copy link
Author

colmsnowplow commented Oct 27, 2021

Summary after investigating and finding resolution options:

clientRecordMaxAge is the maximum time for 'healthy' processing of a record. This setting determines at what point a checkpointer may forcibly claim ownership of a shard, even though it is already owned by another checkpointer.

By default it is set to shardCheckFrequency*5. At low shardCheckFrequencies (eg. 500ms, if the app is running slow for some reason), this can cause race conditions whereby the owner of the shard is normally processing a record, but another shard claims ownership.

In refreshShards() clientRecordMaxAge also determines what shards have changed (via getClients. This can cause refreshShards() to detect shard ownership changes when none have actually occurred. Further, when this happens, the kinsumer Run() function will stop registering the client to DDB while it reboots shard consumers..

That will also cause issues with leadership - if the leader is rebooting while another client runs refreshShards(), there's a good chance that the second shard incorrectly claims leadership.

At low shardCheckFrequencies, rebooting takes longer than clientRecordMaxAge, and a chain of incorrect shard changes will be detected across other clients too.

Solutions:

  • 1 Allow configuration of clientRecordMaxAge
    A simple mitigation of this is to allow the clientRecordMaxAge to be configurable. This doesn't solve the issue, but allows the user to set a configuration which reduces the chances of hitting the issue. There is no reason we should tie shardCheckFrequency to how long we expect the app to process a given event. Especially when this fork allows us to decouple checkpointing from the Next() function. Depending on what happens between read and checkpointing, we may want our reads to be very low latency but expect the processing of the data to take longer. So it is sensible to offer a configuration of this option.

  • 2 Don't stop registering to the clients table while restarting consumers
    A client is still alive when it has detected a change to shard ownership, so instead of pausing and restarting the shardChangeTicker while we stop and start consumers, we should leave it run but change the behaviour to only register with the clients table while that is happening.

  • 3 Add a buffer to the important places where this cutoff is used
    When a checkpointer forcibly claims ownership of a shard, it does so as soon as clientRecordMaxAge is reached. However, the other checkpointer doesn't just need time to process the last event, it also needs time to run refreshShards() and recognise that the shard ownership model has changed. So, we should add at least 1 shardCheckFrequency to clientRecordMaxAge to allow enough time to avoid the conflict. Again this doesn't eliminate the problem, it just reduces the chances of hitting an ownership challenge.

@colmsnowplow colmsnowplow changed the title Resolve race condition in shard ownership Resolve race conditions in shard ownership Oct 27, 2021
@colmsnowplow
Copy link
Author

Resolved in #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant