Too many "shards" in Alternator Streams #13080

nyh · 2023-03-06T07:43:26Z

docs/alternator/compatibility.md lists the following difference between Alternator Streams and DynamoDB Streams:

The number of separate "shards" in Alternator's streams is significantly larger than is typical on DynamoDB.

Indeed, whereas in DynamoDB the number of separate "shards" (confusingly, ScyllaDB calls these "CDC streams" and use the name "shards" for something else) is reasonably low - and only grows when the amount of changes grows (to facilitate handling these changes in parallel), in Alternator we create a potentially huge number of these shards: The number of CPU cores in the cluster multiplied by num_token (=256 by default).

It is very inefficient for the user to try to read from tens of thousands of shards often only to find out that most of them have no new data. So this design can force users to read much less often than they wish, and get a higher delay in handling changes. Another problem that @elcallio noticed in the past is that some libraries reading from DynamoDB Streams assume there is a low number of shards and use a separate thread for each shard - making these libraries blow up on ScyllaDB with its huge number of shards.

We also had users complaining about this - see for example https://groups.google.com/g/scylladb-dev/c/l68LPFmWVwQ/m/-eDAp3cYCgAJ

The ideal fix for this issue would be to improve our CDC backend to not have so many separate streams - such an improvement would also benefit users of CQL's CDC.

But if that does not happen, we could consider how to "hide" the large number of CDC streams behind a lower number of Alternator Streams shards. One thing we could do, for example, is to have a single stream shard per CPU core. This stream will merge the physical num_token partitions that exist in our CDC. To reduce the load on the server (and not just on the client), the server should be able to quickly notice CDC partitions that have no new events (a common thing when there's a huge number of these shards) without reading from the table. We could perhaps cache in memory the last event in each shard or even maintain a list of shards that have seen new events (a la epoll()).

The text was updated successfully, but these errors were encountered:

docs/alternator/compatibility.md mentions a known problem that Alternator Streams are divided into too many "shards". This patch add a link to a github issue to track our work on this issue - like we did for most other differences mentioned in compatibility.md. Refs scylladb#13080 Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Felix-zhoux · 2023-03-07T02:35:22Z

One thing you could do is to reduce num-tokens but that hurts load balancing and may not be a good idea.

Hi @nyh ,I saw that you have such a reply in the maillist. Reducing num_tokens seems to be the only thing we can do at present. When all nodes have the same hardware configuration, reduce it appropriately and keep num_tokens greater than the number of vcpus of a single node. Will this still hurt load balancing?

docs/alternator/compatibility.md mentions a known problem that Alternator Streams are divided into too many "shards". This patch add a link to a github issue to track our work on this issue - like we did for most other differences mentioned in compatibility.md. Refs #13080 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13081

nyh · 2024-02-15T07:14:49Z

@kbr-scylla @tgrabiec @avikivity I'm wondering, would tablets be able to drastically reduce the number of CDC streams and help improve the usability of Alternator Streams? (of course, when CDC on tablets is actually implemented - see #16317).

I'm thinking: On one hand, tablets can allow us to replace the 256 vnodes we had per CPU by a much smaller number of tablets per CPU. For example, if we replace 256 vnodes by 4 tablets, it can reduce the number of CDC streams 64-fold.
On the other hand, tablets move around a lot more than the old vnodes, and these movements create new "generations" and complicate everything. Perhaps we should think about making an effort to: 1. Be less frequent about migrating tablets in a table using CDC, and. 2. Consider a technique or API where moving a single tablet doesn't feel to the application like a whole new generation where everything changed, but just one tablet moved - i.e., compare select() to poll()).

kbr-scylla · 2024-02-15T12:01:05Z

@kbr-scylla @tgrabiec @avikivity I'm wondering, would tablets be able to drastically reduce the number of CDC streams and help improve the usability of Alternator Streams? (of course, when CDC on tablets is actually implemented - see #16317).

I don't have the numbers. But I think the number of tablets can also grow pretty large. Not sure if more than vnodes per node * shards per node * number of nodes as it is in the case of current CDC generations.

Consider a technique or API where moving a single tablet doesn't feel to the application like a whole new generation where everything changed, but just one tablet moved - i.e., compare select() to poll()).

Definitely this. Moving a tablet should not affect CDC streams for other tablets IMO. So we should redesign the concept of generations for tablets.

Also it's not moving a tablet that would change the stream ID for this tablet (if we somehow manage to keep colocation while moving tablets -- the stream ID's tablet should have the same set of replicas as the corresponding base tablet). The event that causes new stream IDs to be created is tablet splitting or merging.

avikivity · 2024-02-15T13:06:20Z

If moving a tablet doesn't affect CDC streams, we need one stream per tablet. This means the number of streams is ~100shards_in_cluster (or really 33shards_in_cluster if RF=3). I think it's too large.

Better have one stream per vcpu. This means that if we move a tablet we have to close two streams and open two new ones (as two shards are affected).

kbr-scylla · 2024-02-15T13:31:00Z

Better have one stream per vcpu.

I don't think that's possible.

Streams must be chosen so log writes are colocated with base table writes. Therefore if two base table writes go to different replica sets, their corresponding log table writes need to go to different streams. Therefore two different tablets need to have in general two different corresponding streams because they have different replica sets.

As an optimization we could reuse a stream ID between multiple tablets if they have the exact same replica sets, but that's probably a condition that rapidly becomes false (as soon as at least one table in the set is migrated).

avikivity · 2024-02-15T14:25:41Z

Better have one stream per vcpu.

I don't think that's possible.

Streams must be chosen so log writes are colocated with base table writes. Therefore if two base table writes go to different replica sets, their corresponding log table writes need to go to different streams. Therefore two different tablets need to have in general two different corresponding streams because they have different replica sets.

Right. So we'll have one stream per 5GB of data.

As an optimization we could reuse a stream ID between multiple tablets if they have the exact same replica sets, but that's probably a condition that rapidly becomes false (as soon as at least one table in the set is migrated).

I don't think it's so bad. If we could influence the load balancer to favor maintaining replica set similarity, then it could keep the number of streams low.

For example, if we have three nodes with two shards each, and add a node, we'd migrate shard 0 to shard 0 and shard 1 to shard 1. The number of distinct replica sets would rise from 2 to 8.

nyh added symptom/performance Issues causing performance problems user request area/cdc area/alternator Alternator related Issues area/alternator-streams labels Mar 6, 2023

nyh mentioned this issue Mar 6, 2023

docs/alternator: link to issue about too many stream shards #13081

Closed

ptrsmrn added this to the 5.x milestone Mar 6, 2023

nyh mentioned this issue Dec 11, 2023

GA Alternator Streams #16367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many "shards" in Alternator Streams #13080

Too many "shards" in Alternator Streams #13080

nyh commented Mar 6, 2023

Felix-zhoux commented Mar 7, 2023

nyh commented Feb 15, 2024

kbr-scylla commented Feb 15, 2024

avikivity commented Feb 15, 2024

kbr-scylla commented Feb 15, 2024

avikivity commented Feb 15, 2024

Too many "shards" in Alternator Streams #13080

Too many "shards" in Alternator Streams #13080

Comments

nyh commented Mar 6, 2023

Felix-zhoux commented Mar 7, 2023

nyh commented Feb 15, 2024

kbr-scylla commented Feb 15, 2024

avikivity commented Feb 15, 2024

kbr-scylla commented Feb 15, 2024

avikivity commented Feb 15, 2024