-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many "shards" in Alternator Streams #13080
Comments
docs/alternator/compatibility.md mentions a known problem that Alternator Streams are divided into too many "shards". This patch add a link to a github issue to track our work on this issue - like we did for most other differences mentioned in compatibility.md. Refs scylladb#13080 Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Hi @nyh ,I saw that you have such a reply in the maillist. Reducing num_tokens seems to be the only thing we can do at present. When all nodes have the same hardware configuration, reduce it appropriately and keep num_tokens greater than the number of vcpus of a single node. Will this still hurt load balancing? |
docs/alternator/compatibility.md mentions a known problem that Alternator Streams are divided into too many "shards". This patch add a link to a github issue to track our work on this issue - like we did for most other differences mentioned in compatibility.md. Refs #13080 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13081
@kbr-scylla @tgrabiec @avikivity I'm wondering, would tablets be able to drastically reduce the number of CDC streams and help improve the usability of Alternator Streams? (of course, when CDC on tablets is actually implemented - see #16317). I'm thinking: On one hand, tablets can allow us to replace the 256 vnodes we had per CPU by a much smaller number of tablets per CPU. For example, if we replace 256 vnodes by 4 tablets, it can reduce the number of CDC streams 64-fold. |
I don't have the numbers. But I think the number of tablets can also grow pretty large. Not sure if more than
Definitely this. Moving a tablet should not affect CDC streams for other tablets IMO. So we should redesign the concept of generations for tablets. Also it's not moving a tablet that would change the stream ID for this tablet (if we somehow manage to keep colocation while moving tablets -- the stream ID's tablet should have the same set of replicas as the corresponding base tablet). The event that causes new stream IDs to be created is tablet splitting or merging. |
If moving a tablet doesn't affect CDC streams, we need one stream per tablet. This means the number of streams is ~100shards_in_cluster (or really 33shards_in_cluster if RF=3). I think it's too large. Better have one stream per vcpu. This means that if we move a tablet we have to close two streams and open two new ones (as two shards are affected). |
I don't think that's possible. Streams must be chosen so log writes are colocated with base table writes. Therefore if two base table writes go to different replica sets, their corresponding log table writes need to go to different streams. Therefore two different tablets need to have in general two different corresponding streams because they have different replica sets. As an optimization we could reuse a stream ID between multiple tablets if they have the exact same replica sets, but that's probably a condition that rapidly becomes false (as soon as at least one table in the set is migrated). |
Right. So we'll have one stream per 5GB of data.
I don't think it's so bad. If we could influence the load balancer to favor maintaining replica set similarity, then it could keep the number of streams low. For example, if we have three nodes with two shards each, and add a node, we'd migrate shard 0 to shard 0 and shard 1 to shard 1. The number of distinct replica sets would rise from 2 to 8. |
docs/alternator/compatibility.md
lists the following difference between Alternator Streams and DynamoDB Streams:Indeed, whereas in DynamoDB the number of separate "shards" (confusingly, ScyllaDB calls these "CDC streams" and use the name "shards" for something else) is reasonably low - and only grows when the amount of changes grows (to facilitate handling these changes in parallel), in Alternator we create a potentially huge number of these shards: The number of CPU cores in the cluster multiplied by
num_token
(=256 by default).It is very inefficient for the user to try to read from tens of thousands of shards often only to find out that most of them have no new data. So this design can force users to read much less often than they wish, and get a higher delay in handling changes. Another problem that @elcallio noticed in the past is that some libraries reading from DynamoDB Streams assume there is a low number of shards and use a separate thread for each shard - making these libraries blow up on ScyllaDB with its huge number of shards.
We also had users complaining about this - see for example https://groups.google.com/g/scylladb-dev/c/l68LPFmWVwQ/m/-eDAp3cYCgAJ
The ideal fix for this issue would be to improve our CDC backend to not have so many separate streams - such an improvement would also benefit users of CQL's CDC.
But if that does not happen, we could consider how to "hide" the large number of CDC streams behind a lower number of Alternator Streams shards. One thing we could do, for example, is to have a single stream shard per CPU core. This stream will merge the physical num_token partitions that exist in our CDC. To reduce the load on the server (and not just on the client), the server should be able to quickly notice CDC partitions that have no new events (a common thing when there's a huge number of these shards) without reading from the table. We could perhaps cache in memory the last event in each shard or even maintain a list of shards that have seen new events (a la
epoll()
).The text was updated successfully, but these errors were encountered: