You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current worker implementation will only grab a lease for a single shard per shard sync interval. This seems like a bug. Based on the commit message (e2a945d), the intent was to "prevent on host tak[ing] more shard[s] than it's configuration allowed". However, the result is that only a single shard is leased per interval. This causes start up times to balloon as more shards are introduced.
Reproduction steps
A basic worker on a stream with multiple shards exhibits this behavior:
package main
import (
"os""fmt""os/signal""github.com/vmware/vmware-go-kcl-v2/clientlibrary/config""github.com/vmware/vmware-go-kcl-v2/clientlibrary/interfaces""github.com/vmware/vmware-go-kcl-v2/clientlibrary/worker"
)
typeRecordProcessorstruct {}
typeRecordProcessorFactorystruct {}
func (f*RecordProcessorFactory) CreateProcessor() interfaces.IRecordProcessor {
return&RecordProcessor{}
}
func (p*RecordProcessor) ProcessRecords(input*interfaces.ProcessRecordsInput) {}
func (p*RecordProcessor) Initialize(input*interfaces.InitializationInput) {}
func (p*RecordProcessor) Shutdown(input*interfaces.ShutdownInput) {}
funcmain() {
// Separately, I have no idea why, but the library seems incapable of figuring out the// Kinesis service endpoint on it's own. Not specifying it manually results in errors// where it seemingly is trying to use an empty string as a service endpoint, but that's// probably a problem for a separate issue.cfg:=config.NewKinesisClientLibConfig("test", "caleb-testing", "us-east-2", "worker")
cfg.KinesisEndpoint="https://kinesis.us-east-2.amazonaws.com"kcl:=worker.NewWorker(&RecordProcessorFactory{}, cfg)
iferr:=kcl.Start(); err!=nil {
fmt.Printf("[!] failed to start kcl worker: %v\n", err)
return
}
deferkcl.Shutdown()
signals:=make(chan os.Signal, 1)
signal.Notify(signals, os.Interrupt, os.Kill)
forrangesignals {
break
}
return
}
Expected behavior
A worker should lease as many shards as it can up to MaxLeasesPerWorker on every shard sync.
Additional context
I believe the solution is to refactor the lease loop (ref) to look something like this:
// max number of lease has not been reached yetfor_, shard:=rangew.shardStatus {
// Don't take out work leases than allowedifcounter>=w.kclConfig.MaxLeasesForWorker {
break
}
// already owner of the shardifshard.GetLeaseOwner() ==w.workerID {
continue
}
err:=w.checkpointer.FetchCheckpoint(shard)
iferr!=nil {
// checkpoint may not exist yet is not an error condition.iferr!=chk.ErrSequenceIDNotFound {
log.Warnf("Couldn't fetch checkpoint: %+v", err)
// move on to next shardcontinue
}
}
// The shard is closed and we have processed all recordsifshard.GetCheckpoint() ==chk.ShardEnd {
continue
}
varstealShardboolifw.kclConfig.EnableLeaseStealing&&shard.ClaimRequest!="" {
upcomingStealingInterval:=time.Now().UTC().Add(time.Duration(w.kclConfig.LeaseStealingIntervalMillis) *time.Millisecond)
ifshard.GetLeaseTimeout().Before(upcomingStealingInterval) &&!shard.IsClaimRequestExpired(w.kclConfig) {
ifshard.ClaimRequest==w.workerID {
stealShard=truelog.Debugf("Stealing shard: %s", shard.ID)
} else {
log.Debugf("Shard being stolen: %s", shard.ID)
continue
}
}
}
err=w.checkpointer.GetLease(shard, w.workerID)
iferr!=nil {
// cannot get lease on the shardif!errors.As(err, &chk.ErrLeaseNotAcquired{}) {
log.Errorf("Cannot get lease: %+v", err)
}
continue
}
ifstealShard {
log.Debugf("Successfully stole shard: %+v", shard.ID)
w.shardStealInProgress=false
}
// log metrics on got leasew.mService.LeaseGained(shard.ID)
w.waitGroup.Add(1)
gofunc(shard*par.ShardStatus) {
deferw.waitGroup.Done()
iferr:=w.newShardConsumer(shard).getRecords(); err!=nil {
log.Errorf("Error in getRecords: %+v", err)
}
}(shard)
// Increase the number of leases we havecounter++
}
The text was updated successfully, but these errors were encountered:
Describe the bug
The current worker implementation will only grab a lease for a single shard per shard sync interval. This seems like a bug. Based on the commit message (e2a945d), the intent was to "prevent on host tak[ing] more shard[s] than it's configuration allowed". However, the result is that only a single shard is leased per interval. This causes start up times to balloon as more shards are introduced.
Reproduction steps
A basic worker on a stream with multiple shards exhibits this behavior:
Expected behavior
A worker should lease as many shards as it can up to
MaxLeasesPerWorker
on every shard sync.Additional context
I believe the solution is to refactor the lease loop (ref) to look something like this:
The text was updated successfully, but these errors were encountered: