New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pools: option to prefill #4769
pools: option to prefill #4769
Conversation
@tirsen @gak Here's the approach that @demmer proposed that can be used as stand-in until we figure out the course of action for #4697. In fact, if this addresses the core problem, we could just look at staying with this until we encounter newer issues in the future. PR is in draft state because tests need to be written. |
968ec28
to
a0ae7d7
Compare
@gak 's rewrite tried to address two issues:
Correct me if I'm wrong I don't know the connection pool deeply but AFAICT this PR doesn't address any of these issues: It does not address 1) as it will only prefill the pool up front, once connections expire due to idleness the pool is once again empty. It obviously does not address 2) because it will again fill the entire pool before it starts reusing connections. |
The idle pool closer actually opens a new connection after closing the expired one. So, the pool remains generally full. I thought of doing the same thing when a closed connection was returned to the pool, but felt that it wasn't worth it. It's rare enough that the overhead should get absorbed. |
@tirsen to your points:
I agree with you that it would be cleaner to change this PR so that we never returned a closed connection to the pool. @sougou it seems simple enough to change things so that whenever
Can you explain the use case why you would want to allocate N connections up front, allowing a burst up to M, but where you're willing to pay the cost of waiting to connect for the burst spillover? It really seems to me that if we're worried about connection latency then it'd be better to just preallocate the whole pool's worth. |
Background: this change allows one to create prefilled resource pools. This is useful when traffics suddenly shift while the pool is still empty. This causes a thundering herd of Open requests that can cause outage. There is a proposal to rewrite this pool to natively accommodate this feature. This change is a stand-in until that effort is completed. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
I've finished the work on this PR. The following actions now happen always, whether you've enabled the prefill feature or not:
The prefill feature is enabled by specifying a non-zero value to a new argument to the resource pool:
|
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@@ -170,9 +195,6 @@ func (rp *ResourcePool) get(ctx context.Context, wait bool) (resource Resource, | |||
select { | |||
case wrapper, ok = <-rp.resources: | |||
default: | |||
if !wait { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this intended? Or is is not used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I answered my own question. It was always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
rp.Put(r) | ||
}() | ||
} | ||
wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it always necessary to wait for all the prefill actions to complete here?
I am somewhat reticent to add a ton of optionality here, but especially given the unbounded context.TODO()
, this code as written might block startup for an arbitrary amount of time, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I'm not sure what the best approach would be. The problem with doing it asynchronously is that we'll get into trouble if that function hangs. Then we'll have to protect it from races with Close
, etc. The other option is to add a timeout to the context.
I'm thinking we should let people try this and observe failure modes. That may give us better clarity on the best way forward. I've added log messages in pool.go so we can collect some data about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one part of this change that still worries me a bit -- as written this could potentially block startup forever, which seems like a bad idea. It would also be efficient to parallelize this with the other startup tasks.
At the same time if users really want to make sure the connection pool is prewarmed before serving queries... then that seems like the thing we should give them.
All in all though... I wonder if adding a bounded timeout of something like 30 seconds by default would be sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems better than waiting potentially forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a 30 second timeout. for prefilling the pool. We'll see how that works out.
rp.idleClosed.Add(1) | ||
rp.active.Add(-1) | ||
} | ||
func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there some reason this needed to be wrapped in a function and not just included inline like it was before?
I find the extra layers of abstraction and the defer of putting the wrapper back into the resource pool to be more confusing this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's necessary for panic-proofing. If Close
or reopenResouce
panicked, then this defensive code will prevent an outage. Otherwise, it will cause vttablet to lock up in many unexpected ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that explains it 👍
if wrapper.resource != nil && idleTimeout > 0 && time.Until(wrapper.timeUsed.Add(idleTimeout)) < 0 { | ||
wrapper.resource.Close() | ||
rp.idleClosed.Add(1) | ||
rp.reopenResource(&wrapper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a cleaner implementation that would obviate the need for the manually refactored reopenResource
would be for the idle closer to actually just call get
and put
internally.
This would require actually keeping the wait
parameter that you removed, but it seems like changing this to call get
to obtain the resource, then closing it, then calling put
to return it back means that we only would need to do the reopen in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer the current approach because I've always been uncomfortable with the complexity of get
. I was so happy to get rid of that wait
. Now you want me to reintroduce it? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK -- what if instead we didn't reintroduce wait
, but instead we just call in with a short context deadline.
In the usual case we will immediately get the connection since the whole thing is going to only grab what it thinks is available
.
Then we use ErrTimeout to indicate that we've gone through all we need to, instead of the "stop early" case above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing I don't like about this approach (both before and after your change) is that it duplicates a lot of the logic related to the stats and reopening the resource, etc.
If we use the existing get() / put() interface then closeIdleResources
just operates like any other client to the pool -- it just grabs a connection to quickly close it, then rely on put() to reopen it and all the stats management would be done in get/put.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I actually wish I had done that originally tbh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another related point which just came up in an internal discussion -- it would be nice to have vitess implement a clean mysql shutdown flow for the mysql protocol connections.
Currently our logs are filled up with things like:
2019-05-28T16:11:52.895744-08:00 163 [Note] Aborted connection 163 to db: 'vt_byuser' user: 'vt_app_a' host: 'localhost' (Got an error reading communication packets)
That's because Close() simply shuts down the underlying tcp socket.
I think instead we could bound the clean shutdown handshake by a relatively short context deadline (say 1-2 seconds) after which we summarily close the socket anyway.
That, to me, biases again for using the regular Get/Put interface for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that we can't use get
here because it's the wrapper that has the timeUsed
metadata. We could look at changing get
to return the wrapper instead, or an empty one if no resource was available. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this in and do the refactor later if we still want it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be worthwhile to change get
to return the possibly-nil-or-empty wrapper and thereby DRY this up and make the idle closer a bit cleaner.
@@ -123,7 +123,7 @@ func TestOpen(t *testing.T) { | |||
} | |||
r.Close() | |||
p.Put(nil) | |||
if count.Get() != 4 { | |||
if count.Get() != 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message below doesn't match up with this expectation...
Also we should clarify with a comment that calling Put
reopens the resource
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
In staging we want to have multiple apps that share the same underlying physical database because Aurora clusters are expensive. Most of the time they are idle but we do want them to be able to burst every once in a while. I think this is a valid use case but we can workaround it by simply allocating fewer apps to an Aurora cluster. Unfortunately this does make Vitess much more expensive than using MySQL without Vitess since the JDBC connection pool we're using (Hikari) does support this feature. |
I didn't know about this requirement. If this is the case, we need a different approach. The reason is because we need a way to shrink the pool back to a smaller size when things go idle. Otherwise, you'll eventually run out of connections as each app bursts beyond the preallocated limit. The original conn-pool implementation used to prefer reusing available connections. But we found out it was production-unfriendly because things broke unexpectedly during bursts. So, we changed it to always fill the pool. But it makes sense for a testing setup. I'm thinking we could dynamically expand or shrink the pool size by calling |
We don't need to solve this now. What we have here solves most of our most pressing needs. Shrinking is optimizing for cost but that's of lower priority right now. |
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@@ -37,6 +37,8 @@ var ( | |||
|
|||
// ErrTimeout is returned if a resource get times out. | |||
ErrTimeout = errors.New("resource pool timed out") | |||
|
|||
prefillTimeout = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May not be enough based on our testing. On a cold start opening 600 connections takes about ~38s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe make this configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ Yet another flag. I'll begrudgingly add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is getting to be a lot, but I just don't see how we can make a one-size-fits-all solution here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say let's go with 30 seconds for now. If it actually becomes a problem in real life then we can address it by making it configurable then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm satisfied with this as-is.
Background: this change allows one to create prefilled resource
pools. This is useful when traffics suddenly shift while the
pool is still empty. This causes a thundering herd of Open requests
that can cause outage.
There is a proposal to rewrite this pool to natively accommodate
this feature. This change is a stand-in until that effort is
completed.
Signed-off-by: Sugu Sougoumarane ssougou@gmail.com