Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pools: option to prefill #4769

Merged
merged 6 commits into from Jun 17, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
81 changes: 64 additions & 17 deletions go/pools/resource_pool.go
Expand Up @@ -21,6 +21,7 @@ package pools
import (
"errors"
"fmt"
"sync"
"time"

"golang.org/x/net/context"
Expand All @@ -36,6 +37,8 @@ var (

// ErrTimeout is returned if a resource get times out.
ErrTimeout = errors.New("resource pool timed out")

prefillTimeout = 30 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not be enough based on our testing. On a cold start opening 600 connections takes about ~38s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe make this configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ Yet another flag. I'll begrudgingly add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is getting to be a lot, but I just don't see how we can make a one-size-fits-all solution here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say let's go with 30 seconds for now. If it actually becomes a problem in real life then we can address it by making it configurable then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

)

// Factory is a function that can be used to create a resource.
Expand Down Expand Up @@ -77,9 +80,12 @@ type resourceWrapper struct {
// maxCap specifies the extent to which the pool can be resized
// in the future through the SetCapacity function.
// You cannot resize the pool beyond maxCap.
// If a resource is unused beyond idleTimeout, it's discarded.
// If a resource is unused beyond idleTimeout, it's replaced
// with a new one.
// An idleTimeout of 0 means that there is no timeout.
func NewResourcePool(factory Factory, capacity, maxCap int, idleTimeout time.Duration) *ResourcePool {
// A non-zero value of prefillParallism causes the pool to be pre-filled.
// The value specifies how many resources can be opened in parallel.
func NewResourcePool(factory Factory, capacity, maxCap int, idleTimeout time.Duration, prefillParallelism int) *ResourcePool {
if capacity <= 0 || maxCap <= 0 || capacity > maxCap {
panic(errors.New("invalid/out of range capacity"))
}
Expand All @@ -94,6 +100,35 @@ func NewResourcePool(factory Factory, capacity, maxCap int, idleTimeout time.Dur
rp.resources <- resourceWrapper{}
}

ctx, cancel := context.WithTimeout(context.TODO(), prefillTimeout)
defer cancel()
if prefillParallelism != 0 {
sem := sync2.NewSemaphore(prefillParallelism, 0 /* timeout */)
var wg sync.WaitGroup
for i := 0; i < capacity; i++ {
wg.Add(1)
go func() {
defer wg.Done()
_ = sem.Acquire()
defer sem.Release()

// If context has expired, give up.
select {
case <-ctx.Done():
return
default:
}

r, err := rp.Get(ctx)
if err != nil {
return
}
rp.Put(r)
}()
}
wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always necessary to wait for all the prefill actions to complete here?

I am somewhat reticent to add a ton of optionality here, but especially given the unbounded context.TODO(), this code as written might block startup for an arbitrary amount of time, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I'm not sure what the best approach would be. The problem with doing it asynchronously is that we'll get into trouble if that function hangs. Then we'll have to protect it from races with Close, etc. The other option is to add a timeout to the context.

I'm thinking we should let people try this and observe failure modes. That may give us better clarity on the best way forward. I've added log messages in pool.go so we can collect some data about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one part of this change that still worries me a bit -- as written this could potentially block startup forever, which seems like a bad idea. It would also be efficient to parallelize this with the other startup tasks.

At the same time if users really want to make sure the connection pool is prewarmed before serving queries... then that seems like the thing we should give them.

All in all though... I wonder if adding a bounded timeout of something like 30 seconds by default would be sufficient here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems better than waiting potentially forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a 30 second timeout. for prefilling the pool. We'll see how that works out.

}

if idleTimeout != 0 {
rp.idleTimer = timer.NewTimer(idleTimeout / 10)
rp.idleTimer.Start(rp.closeIdleResources)
Expand Down Expand Up @@ -131,14 +166,16 @@ func (rp *ResourcePool) closeIdleResources() {
return
}

if wrapper.resource != nil && idleTimeout > 0 && time.Until(wrapper.timeUsed.Add(idleTimeout)) < 0 {
wrapper.resource.Close()
wrapper.resource = nil
rp.idleClosed.Add(1)
rp.active.Add(-1)
}
func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there some reason this needed to be wrapped in a function and not just included inline like it was before?

I find the extra layers of abstraction and the defer of putting the wrapper back into the resource pool to be more confusing this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary for panic-proofing. If Close or reopenResouce panicked, then this defensive code will prevent an outage. Otherwise, it will cause vttablet to lock up in many unexpected ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that explains it 👍

defer func() { rp.resources <- wrapper }()

if wrapper.resource != nil && idleTimeout > 0 && time.Until(wrapper.timeUsed.Add(idleTimeout)) < 0 {
wrapper.resource.Close()
rp.idleClosed.Add(1)
rp.reopenResource(&wrapper)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a cleaner implementation that would obviate the need for the manually refactored reopenResource would be for the idle closer to actually just call get and put internally.

This would require actually keeping the wait parameter that you removed, but it seems like changing this to call get to obtain the resource, then closing it, then calling put to return it back means that we only would need to do the reopen in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer the current approach because I've always been uncomfortable with the complexity of get. I was so happy to get rid of that wait. Now you want me to reintroduce it? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK -- what if instead we didn't reintroduce wait, but instead we just call in with a short context deadline.
In the usual case we will immediately get the connection since the whole thing is going to only grab what it thinks is available.

Then we use ErrTimeout to indicate that we've gone through all we need to, instead of the "stop early" case above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I don't like about this approach (both before and after your change) is that it duplicates a lot of the logic related to the stats and reopening the resource, etc.

If we use the existing get() / put() interface then closeIdleResources just operates like any other client to the pool -- it just grabs a connection to quickly close it, then rely on put() to reopen it and all the stats management would be done in get/put.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I actually wish I had done that originally tbh)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another related point which just came up in an internal discussion -- it would be nice to have vitess implement a clean mysql shutdown flow for the mysql protocol connections.

Currently our logs are filled up with things like:
2019-05-28T16:11:52.895744-08:00 163 [Note] Aborted connection 163 to db: 'vt_byuser' user: 'vt_app_a' host: 'localhost' (Got an error reading communication packets)

That's because Close() simply shuts down the underlying tcp socket.

I think instead we could bound the clean shutdown handshake by a relatively short context deadline (say 1-2 seconds) after which we summarily close the socket anyway.

That, to me, biases again for using the regular Get/Put interface for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that we can't use get here because it's the wrapper that has the timeUsed metadata. We could look at changing get to return the wrapper instead, or an empty one if no resource was available. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this in and do the refactor later if we still want it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be worthwhile to change get to return the possibly-nil-or-empty wrapper and thereby DRY this up and make the idle closer a bit cleaner.

}
}()

rp.resources <- wrapper
}
}

Expand All @@ -153,10 +190,10 @@ func (rp *ResourcePool) Get(ctx context.Context) (resource Resource, err error)
span.Annotate("available", rp.available.Get())
span.Annotate("active", rp.active.Get())
defer span.Finish()
return rp.get(ctx, true)
return rp.get(ctx)
}

func (rp *ResourcePool) get(ctx context.Context, wait bool) (resource Resource, err error) {
func (rp *ResourcePool) get(ctx context.Context) (resource Resource, err error) {
// If ctx has already expired, avoid racing with rp's resource channel.
select {
case <-ctx.Done():
Expand All @@ -170,9 +207,6 @@ func (rp *ResourcePool) get(ctx context.Context, wait bool) (resource Resource,
select {
case wrapper, ok = <-rp.resources:
default:
if !wait {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intended? Or is is not used anywhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I answered my own question. It was always true.

return nil, nil
}
startTime := time.Now()
select {
case wrapper, ok = <-rp.resources:
Expand Down Expand Up @@ -204,13 +238,16 @@ func (rp *ResourcePool) get(ctx context.Context, wait bool) (resource Resource,
// Put will return a resource to the pool. For every successful Get,
// a corresponding Put is required. If you no longer need a resource,
// you will need to call Put(nil) instead of returning the closed resource.
// The will eventually cause a new resource to be created in its place.
// This will cause a new resource to be created in its place.
func (rp *ResourcePool) Put(resource Resource) {
var wrapper resourceWrapper
if resource != nil {
wrapper = resourceWrapper{resource, time.Now()}
wrapper = resourceWrapper{
resource: resource,
timeUsed: time.Now(),
}
} else {
rp.active.Add(-1)
rp.reopenResource(&wrapper)
}
select {
case rp.resources <- wrapper:
Expand All @@ -221,6 +258,16 @@ func (rp *ResourcePool) Put(resource Resource) {
rp.available.Add(1)
}

func (rp *ResourcePool) reopenResource(wrapper *resourceWrapper) {
if r, err := rp.factory(); err == nil {
wrapper.resource = r
wrapper.timeUsed = time.Now()
} else {
wrapper.resource = nil
rp.active.Add(-1)
}
}

// SetCapacity changes the capacity of the pool.
// You can use it to shrink or expand, but not beyond
// the max capacity. If the change requires the pool
Expand Down
116 changes: 97 additions & 19 deletions go/pools/resource_pool_test.go
Expand Up @@ -57,7 +57,7 @@ func TestOpen(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 6, 6, time.Second)
p := NewResourcePool(PoolFactory, 6, 6, time.Second, 0)
p.SetCapacity(5)
var resources [10]Resource

Expand Down Expand Up @@ -122,9 +122,10 @@ func TestOpen(t *testing.T) {
t.Errorf("Unexpected error %v", err)
}
r.Close()
// A nil Put should cause the resource to be reopened.
p.Put(nil)
if count.Get() != 4 {
t.Errorf("Expecting 4, received %d", count.Get())
if count.Get() != 5 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message below doesn't match up with this expectation...
Also we should clarify with a comment that calling Put reopens the resource

t.Errorf("Expecting 5, received %d", count.Get())
}
for i := 0; i < 5; i++ {
r, err := p.Get(ctx)
Expand Down Expand Up @@ -194,11 +195,44 @@ func TestOpen(t *testing.T) {
}
}

func TestPrefill(t *testing.T) {
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 5, 5, time.Second, 1)
defer p.Close()
if p.Active() != 5 {
t.Errorf("p.Active(): %d, want 5", p.Active())
}
p = NewResourcePool(FailFactory, 5, 5, time.Second, 1)
defer p.Close()
if p.Active() != 0 {
t.Errorf("p.Active(): %d, want 0", p.Active())
}
}

func TestPrefillTimeout(t *testing.T) {
lastID.Set(0)
count.Set(0)
saveTimeout := prefillTimeout
prefillTimeout = 1 * time.Millisecond
defer func() { prefillTimeout = saveTimeout }()

start := time.Now()
p := NewResourcePool(SlowFailFactory, 5, 5, time.Second, 1)
defer p.Close()
if elapsed := time.Since(start); elapsed > 20*time.Millisecond {
t.Errorf("elapsed: %v, should be around 10ms", elapsed)
}
if p.Active() != 0 {
t.Errorf("p.Active(): %d, want 0", p.Active())
}
}

func TestShrinking(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 5, 5, time.Second)
p := NewResourcePool(PoolFactory, 5, 5, time.Second, 0)
var resources [10]Resource
// Leave one empty slot in the pool
for i := 0; i < 4; i++ {
Expand Down Expand Up @@ -337,7 +371,7 @@ func TestClosing(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 5, 5, time.Second)
p := NewResourcePool(PoolFactory, 5, 5, time.Second, 0)
var resources [10]Resource
for i := 0; i < 5; i++ {
r, err := p.Get(ctx)
Expand Down Expand Up @@ -391,7 +425,7 @@ func TestIdleTimeout(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 1, 1, 10*time.Millisecond)
p := NewResourcePool(PoolFactory, 1, 1, 10*time.Millisecond, 0)
defer p.Close()

r, err := p.Get(ctx)
Expand All @@ -414,10 +448,10 @@ func TestIdleTimeout(t *testing.T) {
if p.IdleClosed() != 0 {
t.Errorf("Expecting 0, received %d", p.IdleClosed())
}
time.Sleep(20 * time.Millisecond)
time.Sleep(15 * time.Millisecond)

if count.Get() != 0 {
t.Errorf("Expecting 0, received %d", count.Get())
if count.Get() != 1 {
t.Errorf("Expecting 1, received %d", count.Get())
}
if p.IdleClosed() != 1 {
t.Errorf("Expecting 1, received %d", p.IdleClosed())
Expand All @@ -438,7 +472,7 @@ func TestIdleTimeout(t *testing.T) {

// sleep to let the idle closer run while all resources are in use
// then make sure things are still as we expect
time.Sleep(20 * time.Millisecond)
time.Sleep(15 * time.Millisecond)
if lastID.Get() != 2 {
t.Errorf("Expecting 2, received %d", count.Get())
}
Expand Down Expand Up @@ -468,7 +502,7 @@ func TestIdleTimeout(t *testing.T) {
p.SetIdleTimeout(1000 * time.Millisecond)
p.Put(r)

time.Sleep(20 * time.Millisecond)
time.Sleep(15 * time.Millisecond)
if lastID.Get() != 2 {
t.Errorf("Expecting 2, received %d", count.Get())
}
Expand All @@ -479,24 +513,51 @@ func TestIdleTimeout(t *testing.T) {
t.Errorf("Expecting 1, received %d", p.IdleClosed())
}

// Get and Put to refresh timeUsed
r, err = p.Get(ctx)
if err != nil {
t.Errorf("Unexpected error %v", err)
}
p.Put(r)
p.SetIdleTimeout(10 * time.Millisecond)
time.Sleep(20 * time.Millisecond)
if lastID.Get() != 2 {
t.Errorf("Expecting 2, received %d", count.Get())
time.Sleep(15 * time.Millisecond)
if lastID.Get() != 3 {
t.Errorf("Expecting 3, received %d", lastID.Get())
}
if count.Get() != 0 {
if count.Get() != 1 {
t.Errorf("Expecting 1, received %d", count.Get())
}
if p.IdleClosed() != 2 {
t.Errorf("Expecting 2, received %d", p.IdleClosed())
}
}

func TestIdleTimeoutCreateFail(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 1, 1, 10*time.Millisecond, 0)
defer p.Close()
r, err := p.Get(ctx)
if err != nil {
t.Fatal(err)
}
// Change the factory before putting back
// to prevent race with the idle closer, who will
// try to use it.
p.factory = FailFactory
p.Put(r)
time.Sleep(15 * time.Millisecond)
if p.Active() != 0 {
t.Errorf("p.Active(): %d, want 0", p.Active())
}
}

func TestCreateFail(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(FailFactory, 5, 5, time.Second)
p := NewResourcePool(FailFactory, 5, 5, time.Second, 0)
defer p.Close()
if _, err := p.Get(ctx); err.Error() != "Failed" {
t.Errorf("Expecting Failed, received %v", err)
Expand All @@ -508,11 +569,28 @@ func TestCreateFail(t *testing.T) {
}
}

func TestCreateFailOnPut(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 5, 5, time.Second, 0)
defer p.Close()
_, err := p.Get(ctx)
if err != nil {
t.Fatal(err)
}
p.factory = FailFactory
p.Put(nil)
if p.Active() != 0 {
t.Errorf("p.Active(): %d, want 0", p.Active())
}
}

func TestSlowCreateFail(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(SlowFailFactory, 2, 2, time.Second)
p := NewResourcePool(SlowFailFactory, 2, 2, time.Second, 0)
defer p.Close()
ch := make(chan bool)
// The third Get should not wait indefinitely
Expand All @@ -534,7 +612,7 @@ func TestTimeout(t *testing.T) {
ctx := context.Background()
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 1, 1, time.Second)
p := NewResourcePool(PoolFactory, 1, 1, time.Second, 0)
defer p.Close()
r, err := p.Get(ctx)
if err != nil {
Expand All @@ -553,7 +631,7 @@ func TestTimeout(t *testing.T) {
func TestExpired(t *testing.T) {
lastID.Set(0)
count.Set(0)
p := NewResourcePool(PoolFactory, 1, 1, time.Second)
p := NewResourcePool(PoolFactory, 1, 1, time.Second, 0)
defer p.Close()
ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(-1*time.Second))
r, err := p.Get(ctx)
Expand Down
4 changes: 2 additions & 2 deletions go/vt/dbconnpool/connection_pool.go
Expand Up @@ -145,7 +145,7 @@ func (cp *ConnectionPool) Open(info *mysql.ConnParams, mysqlStats *stats.Timings
defer cp.mu.Unlock()
cp.info = info
cp.mysqlStats = mysqlStats
cp.connections = pools.NewResourcePool(cp.connect, cp.capacity, cp.capacity, cp.idleTimeout)
cp.connections = pools.NewResourcePool(cp.connect, cp.capacity, cp.capacity, cp.idleTimeout, 0)
// Check if we need to resolve a hostname (The Host is not just an IP address).
if cp.resolutionFrequency > 0 && net.ParseIP(info.Host) == nil {
cp.hostIsNotIP = true
Expand All @@ -156,7 +156,7 @@ func (cp *ConnectionPool) Open(info *mysql.ConnParams, mysqlStats *stats.Timings
defer cp.wg.Done()
for {
select {
case _ = <-cp.ticker.C:
case <-cp.ticker.C:
cp.refreshdns()
case <-cp.stop:
return
Expand Down