Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancing is useless unless we allow configuring the conn.maxStreams value #112

Closed
vladzcloudius opened this issue Feb 15, 2023 · 15 comments · Fixed by #113
Closed
Labels

Comments

@vladzcloudius
Copy link

vladzcloudius commented Feb 15, 2023

HEAD: c8cd0ba

Description
The hard codded value for "protocol" version greater than 2 is 32768: https://github.com/scylladb/gocql/blob/master/internal/streams/streams.go#L24

This makes #86 pretty much useless with a shard-aware driver because any concurrency above 100-200 per shard is already going to cause issues.

The issue is a follow up for the comment here: #86 (comment)

Two things:

  1. The current default is too high IMO.
  2. There should be a way to control the conn.maxStreams value by the user.
@vladzcloudius
Copy link
Author

@nuivall FYI

@vladzcloudius
Copy link
Author

BTW, I found this patch from Michal Matczuk in my local repo which didn't make into the master:

commit 6de342bab28cf0bc1a3380fdb563f93e77c7df9e (HEAD -> expose_max_streams, origin/mmt/expose_max_streams, mmt/expose_max_streams)
Author: Michał Matczuk <michal@scylladb.com>
Date:   Wed Oct 20 16:05:34 2021 +0200

    session: conn, add possibility to cap the nr. of streams per TCP connection
    
    To cap the number add MaxStreams value to ClusterConfig when initializing the driver.

diff --git a/cluster.go b/cluster.go
index 3b476fe..c6f64f6 100644
--- a/cluster.go
+++ b/cluster.go
@@ -51,6 +51,7 @@ type ClusterConfig struct {
        Port               int                                      // port (default: 9042)
        Keyspace           string                                   // initial keyspace (optional)
        NumConns           int                                      // number of connections per host (default: 2), this option has no effect when working with Scylla - instead, one connection for each shard will be created
+       MaxStreams         int                                      // maximal number of streams per TCP connection
        Consistency        Consistency                              // default consistency level (default: Quorum)
        Compressor         Compressor                               // compression algorithm (default: nil)
        Authenticator      Authenticator                            // authenticator (default: nil)
diff --git a/conn.go b/conn.go
index 262280b..0e9a81b 100644
--- a/conn.go
+++ b/conn.go
@@ -289,7 +289,7 @@ func (s *Session) dialWithoutObserver(ctx context.Context, host *HostInfo, cfg *
                errorHandler:  errorHandler,
                compressor:    cfg.Compressor,
                session:       s,
-               streams:       streams.New(cfg.ProtoVersion),
+               streams:       s.streamIDGenerator(cfg.ProtoVersion),
                host:          host,
                frameObserver: s.frameObserver,
                w: &deadlineWriter{
@@ -310,6 +310,13 @@ func (s *Session) dialWithoutObserver(ctx context.Context, host *HostInfo, cfg *
        return c, nil
 }
 
+func (s *Session) streamIDGenerator(protocol int) *streams.IDGenerator {
+       if s.cfg.MaxStreams > 0 {
+               return streams.NewLimited(s.cfg.MaxStreams)
+       }
+       return streams.New(protocol)
+}
+
 func (c *Conn) init(ctx context.Context) error {
        if c.session.cfg.AuthProvider != nil {
                var err error
diff --git a/internal/streams/streams.go b/internal/streams/streams.go
index ea43412..00fbbf3 100644
--- a/internal/streams/streams.go
+++ b/internal/streams/streams.go
@@ -24,7 +24,13 @@ func New(protocol int) *IDGenerator {
        if protocol > 2 {
                maxStreams = 32768
        }
+       return NewLimited(maxStreams)
+}
 
+func NewLimited(maxStreams int) *IDGenerator {
+       if maxStreams < 128 {
+               maxStreams = 128
+       }
        buckets := maxStreams / 64
        // reserve stream 0
        streams := make([]uint64, buckets)

@vladzcloudius
Copy link
Author

vladzcloudius commented Feb 15, 2023

Would it do the trick, @nuivall ?

@DoronArazii @tomer-sandler @avelanarius FYI

@vladzcloudius
Copy link
Author

@mykaul FYI - this is very urgent. Many users complain about this missing feature.

@avelanarius
Copy link
Member

BTW, I found this patch from Michal Matczuk in my local repo which didn't make into the master:

This is from this PR: #87

I'll review it and merge it if it looks ok.

@mykaul mykaul added the high label Feb 15, 2023
@avelanarius
Copy link
Member

avelanarius commented Feb 15, 2023

@vladzcloudius To double check: #87 implementation will immediately throw gocql: no streams available on connection if the number of inflight requests on some connection exceed the limit. In contrast, Java Driver has a limit of inflight requests on a connection, but if this limit is exceeded, it starts to queue the requests client-side (not sending them yet to Scylla). As far as I know and checked, there isn't any such "fail-safe" queue in gocql.

In some testing I did in the past (scylladb/scylladb#7748 (comment)) we saw that there were intermittent spikes of inflight requests on a connection. In such case the additional client-side queue helps the driver avoid throwing "overloaded" exception, while still maintaing an arbitrarly low number of inflight (sent to Scylla and awaiting response) requests.

If this is not a concern, the PR looks good to me. I would only change the default maxStreams = 32768 to a more reasonable value (for example 1024).

@vladzcloudius
Copy link
Author

@vladzcloudius To double check: #87 implementation will immediately throw gocql: no streams available on connection if the number of inflight requests on some connection exceed the limit. In contrast, Java Driver has a limit of inflight requests on a connection, but if this limit is exceeded, it starts to queue the requests client-side (not sending them yet to Scylla). As far as I know and checked, there isn't any such "fail-safe" queue in gocql.

In some testing I did in the past (scylladb/scylladb#7748 (comment)) we saw that there were intermittent spikes of inflight requests on a connection. In such case the additional client-side queue helps the driver avoid throwing "overloaded" exception, while still maintaing an arbitrarly low number of inflight (sent to Scylla and awaiting response) requests.

If this is not a concern, the PR looks good to me. I would only change the default maxStreams = 32768 to a more reasonable value (for example 1024).

All these are relevant concerns that should be eventually addressed.
#86 is supposed to reduce the phenomena you described, @avelanarius till all queues on all shards are overloaded, wouldn't it?

But regardless, I suggest to merge this and file a separate GH issue for a client side "queuing".

BTW, it's not obvious to me at all that the Java driver behavior is better than the current GoCQL one. The former can cause high client side latencies despite Scylla side ones would be low.

@nuivall
Copy link
Member

nuivall commented Feb 15, 2023

I think the patch should help (unsolved part of the issue is that there is still no limit to max connections with shard aware).

As for the default max streams value how it compares to other drivers? Perhaps it indeed should be lower.

For the queue I guess it can help to smooth things out, but in general long queues should not accumulate in the system - it's better to fail fast than create an unbalanced state which is then difficult to reason about.

Better idea would be to have a controller for both max streams and max connections (I guess one conn perc shard policy won't work well when combined with bigger clusters and distributed clients).

Another thing not mentioned is user's ability to react on overload errors. I haven't found how it is done in the gocql currently. This applies also to d71301a. There should be some exported error user can check (i.e. with error.Is(...)) or some other method.

@harel-z
Copy link

harel-z commented Feb 15, 2023

Please note that we need this fix asap to deal with an ongoing latency spike issue at Kiwi. The issue is related to an unbounded concurrency spikes and the bottlenecks they create.

avelanarius pushed a commit to avelanarius/gocql that referenced this issue Feb 15, 2023
Before this change, a connection had a hardcoded maximum number of
streams. Since this value was so large (32768 for CQL v3+) this
essentially meant an unbound concurrency.

This commit allows a user to configure this parameter by a newly
added MaxRequestsPerConn option.

The "MaxRequestsPerConn" naming was chosen to be very general
deliberately to allow us to safely introduce a more advanced behavior in
the future: where the concurrency limiting happens not at a IDGenerator
level and an additional client-side queue is introduced for requests
to wait to be sent to Scylla (in addition to inflight requests).

This commit deliberately does not change the defaults as to be
non-pessimizing for the users.

Fixes scylladb#112
@avelanarius
Copy link
Member

Pushed this PR: #113

I'd be comfortable with merging it "now" as I tried to make it as non-pessimizing and as safe as much as possible.

Some limitations of that PR:

  • Not well tested yet
  • Defaults not changed
  • maxStreams (MaxRequestsPerConn) has to be a multiple of 64. The issue is not only in this division: maxStreams / 64, but in the whole structure of the code which assumes that this is a multiple of 64 (their own custom implementation of bitset). Modifying the implementation will take some time and add small potential risk for bugs.

@vladzcloudius
Copy link
Author

Thanks, @avelanarius.
Do you want to keep this open till above limitations are addressed or you want a separate GH issue for them?

@avelanarius
Copy link
Member

Probably a separate GH issue will be better.

avelanarius pushed a commit to avelanarius/gocql that referenced this issue Feb 16, 2023
Before this change, a connection had a hardcoded maximum number of
streams. Since this value was so large (32768 for CQL v3+) this
essentially meant an unbound concurrency.

This commit allows a user to configure this parameter by a newly
added MaxRequestsPerConn option.

The "MaxRequestsPerConn" naming was chosen to be very general
deliberately to allow us to safely introduce a more advanced behavior in
the future: where the concurrency limiting happens not at a IDGenerator
level and an additional client-side queue is introduced for requests
to wait to be sent to Scylla (in addition to inflight requests).

This commit deliberately does not change the defaults as to be
non-pessimizing for the users.

Fixes scylladb#112
@vladzcloudius
Copy link
Author

maxStreams (MaxRequestsPerConn) has to be a multiple of 64. The issue is not only in this division: maxStreams / 64, but in the whole structure of the code which assumes that this is a multiple of 64 (their own custom implementation of bitset). Modifying the implementation will take some time and add small potential risk for bugs.

@avelanarius I posted comments on the avelanarius@8baecfb
I agree that the above restriction doesn't have to fixed right now - however it has to be properly enforced and documented.
Otherwise a user experience is going to be very frustrating.

@vladzcloudius
Copy link
Author

Actually, their version looks better in this regard: kiwicom#31

@nuivall
Copy link
Member

nuivall commented Feb 16, 2023

Actually, their version looks better in this regard: kiwicom#31

In that patch having both MaxStreamID and NumStreams I think is quite confusing.

Gor027 pushed a commit to Gor027/gocql that referenced this issue Feb 16, 2023
Before this change, a connection had a hardcoded maximum number of
streams. Since this value was so large (32768 for CQL v3+) this
essentially meant an unbound concurrency.

This commit allows a user to configure this parameter by a newly
added MaxRequestsPerConn option.

The "MaxRequestsPerConn" naming was chosen to be very general
deliberately to allow us to safely introduce a more advanced behavior in
the future: where the concurrency limiting happens not at a IDGenerator
level and an additional client-side queue is introduced for requests
to wait to be sent to Scylla (in addition to inflight requests).

This commit deliberately does not change the defaults as to be
non-pessimizing for the users.

Fixes scylladb#112
avelanarius pushed a commit that referenced this issue Feb 16, 2023
Before this change, a connection had a hardcoded maximum number of
streams. Since this value was so large (32768 for CQL v3+) this
essentially meant an unbound concurrency.

This commit allows a user to configure this parameter by a newly
added MaxRequestsPerConn option.

The "MaxRequestsPerConn" naming was chosen to be very general
deliberately to allow us to safely introduce a more advanced behavior in
the future: where the concurrency limiting happens not at a IDGenerator
level and an additional client-side queue is introduced for requests
to wait to be sent to Scylla (in addition to inflight requests).

This commit deliberately does not change the defaults as to be
non-pessimizing for the users.

Fixes #112
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants