Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many "shards" in Alternator Streams #13080

Open
nyh opened this issue Mar 6, 2023 · 6 comments
Open

Too many "shards" in Alternator Streams #13080

nyh opened this issue Mar 6, 2023 · 6 comments
Labels
area/alternator Alternator related Issues area/alternator-streams area/cdc symptom/performance Issues causing performance problems user request
Milestone

Comments

@nyh
Copy link
Contributor

nyh commented Mar 6, 2023

docs/alternator/compatibility.md lists the following difference between Alternator Streams and DynamoDB Streams:

The number of separate "shards" in Alternator's streams is significantly larger than is typical on DynamoDB.

Indeed, whereas in DynamoDB the number of separate "shards" (confusingly, ScyllaDB calls these "CDC streams" and use the name "shards" for something else) is reasonably low - and only grows when the amount of changes grows (to facilitate handling these changes in parallel), in Alternator we create a potentially huge number of these shards: The number of CPU cores in the cluster multiplied by num_token (=256 by default).

It is very inefficient for the user to try to read from tens of thousands of shards often only to find out that most of them have no new data. So this design can force users to read much less often than they wish, and get a higher delay in handling changes. Another problem that @elcallio noticed in the past is that some libraries reading from DynamoDB Streams assume there is a low number of shards and use a separate thread for each shard - making these libraries blow up on ScyllaDB with its huge number of shards.

We also had users complaining about this - see for example https://groups.google.com/g/scylladb-dev/c/l68LPFmWVwQ/m/-eDAp3cYCgAJ

The ideal fix for this issue would be to improve our CDC backend to not have so many separate streams - such an improvement would also benefit users of CQL's CDC.

But if that does not happen, we could consider how to "hide" the large number of CDC streams behind a lower number of Alternator Streams shards. One thing we could do, for example, is to have a single stream shard per CPU core. This stream will merge the physical num_token partitions that exist in our CDC. To reduce the load on the server (and not just on the client), the server should be able to quickly notice CDC partitions that have no new events (a common thing when there's a huge number of these shards) without reading from the table. We could perhaps cache in memory the last event in each shard or even maintain a list of shards that have seen new events (a la epoll()).

@nyh nyh added symptom/performance Issues causing performance problems user request area/cdc area/alternator Alternator related Issues area/alternator-streams labels Mar 6, 2023
nyh added a commit to nyh/scylla that referenced this issue Mar 6, 2023
docs/alternator/compatibility.md mentions a known problem that
Alternator Streams are divided into too many "shards". This patch
add a link to a github issue to track our work on this issue - like
we did for most other differences mentioned in compatibility.md.

Refs scylladb#13080

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@ptrsmrn ptrsmrn added this to the 5.x milestone Mar 6, 2023
@Felix-zhoux
Copy link

One thing you could do is to reduce num-tokens but that hurts load balancing and may not be a good idea.

Hi @nyh ,I saw that you have such a reply in the maillist. Reducing num_tokens seems to be the only thing we can do at present. When all nodes have the same hardware configuration, reduce it appropriately and keep num_tokens greater than the number of vcpus of a single node. Will this still hurt load balancing?

denesb pushed a commit that referenced this issue Mar 7, 2023
docs/alternator/compatibility.md mentions a known problem that
Alternator Streams are divided into too many "shards". This patch
add a link to a github issue to track our work on this issue - like
we did for most other differences mentioned in compatibility.md.

Refs #13080

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13081
@nyh
Copy link
Contributor Author

nyh commented Feb 15, 2024

@kbr-scylla @tgrabiec @avikivity I'm wondering, would tablets be able to drastically reduce the number of CDC streams and help improve the usability of Alternator Streams? (of course, when CDC on tablets is actually implemented - see #16317).

I'm thinking: On one hand, tablets can allow us to replace the 256 vnodes we had per CPU by a much smaller number of tablets per CPU. For example, if we replace 256 vnodes by 4 tablets, it can reduce the number of CDC streams 64-fold.
On the other hand, tablets move around a lot more than the old vnodes, and these movements create new "generations" and complicate everything. Perhaps we should think about making an effort to: 1. Be less frequent about migrating tablets in a table using CDC, and. 2. Consider a technique or API where moving a single tablet doesn't feel to the application like a whole new generation where everything changed, but just one tablet moved - i.e., compare select() to poll()).

@kbr-scylla
Copy link
Contributor

@kbr-scylla @tgrabiec @avikivity I'm wondering, would tablets be able to drastically reduce the number of CDC streams and help improve the usability of Alternator Streams? (of course, when CDC on tablets is actually implemented - see #16317).

I don't have the numbers. But I think the number of tablets can also grow pretty large. Not sure if more than vnodes per node * shards per node * number of nodes as it is in the case of current CDC generations.

  1. Consider a technique or API where moving a single tablet doesn't feel to the application like a whole new generation where everything changed, but just one tablet moved - i.e., compare select() to poll()).

Definitely this. Moving a tablet should not affect CDC streams for other tablets IMO. So we should redesign the concept of generations for tablets.

Also it's not moving a tablet that would change the stream ID for this tablet (if we somehow manage to keep colocation while moving tablets -- the stream ID's tablet should have the same set of replicas as the corresponding base tablet). The event that causes new stream IDs to be created is tablet splitting or merging.

@avikivity
Copy link
Member

If moving a tablet doesn't affect CDC streams, we need one stream per tablet. This means the number of streams is ~100shards_in_cluster (or really 33shards_in_cluster if RF=3). I think it's too large.

Better have one stream per vcpu. This means that if we move a tablet we have to close two streams and open two new ones (as two shards are affected).

@kbr-scylla
Copy link
Contributor

Better have one stream per vcpu.

I don't think that's possible.

Streams must be chosen so log writes are colocated with base table writes. Therefore if two base table writes go to different replica sets, their corresponding log table writes need to go to different streams. Therefore two different tablets need to have in general two different corresponding streams because they have different replica sets.

As an optimization we could reuse a stream ID between multiple tablets if they have the exact same replica sets, but that's probably a condition that rapidly becomes false (as soon as at least one table in the set is migrated).

@avikivity
Copy link
Member

Better have one stream per vcpu.

I don't think that's possible.

Streams must be chosen so log writes are colocated with base table writes. Therefore if two base table writes go to different replica sets, their corresponding log table writes need to go to different streams. Therefore two different tablets need to have in general two different corresponding streams because they have different replica sets.

Right. So we'll have one stream per 5GB of data.

As an optimization we could reuse a stream ID between multiple tablets if they have the exact same replica sets, but that's probably a condition that rapidly becomes false (as soon as at least one table in the set is migrated).

I don't think it's so bad. If we could influence the load balancer to favor maintaining replica set similarity, then it could keep the number of streams low.

For example, if we have three nodes with two shards each, and add a node, we'd migrate shard 0 to shard 0 and shard 1 to shard 1. The number of distinct replica sets would rise from 2 to 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alternator Alternator related Issues area/alternator-streams area/cdc symptom/performance Issues causing performance problems user request
Projects
None yet
Development

No branches or pull requests

6 participants
@nyh @avikivity @Felix-zhoux @kbr-scylla @ptrsmrn and others