Improve checkpoint publication scheduling#940
Conversation
|
I'm wondering if we can do this with no visible API changes... Inside each of the publishCheckpoint routines, rather than polling frequently, we could instead do a better job of scheduling attempts so they have more chance to succeed; e.g. when we check when the last CP was published and discover that it's too new, we now know exactly what the interval must be before the next one can be published. Would something along the lines of the sketch below work? const retry := 200*time.Milliseconds // whatever
t := time.NewTicker(retry)
for {
select {
case <-ctx.Done():
// whatever
case <-t.C:
}
checkpointAge := ...
if !shouldPublish {
t.Reset(minInterval-checkpointAge)
continue
}
if err := publishCheckpoint(); err != nil {
t.Reset(retry)
continue
}
t.Reset(minInterval)
} |
|
Ah yes 100%, that was my original design but I had understood you didn't want do to that, sorry. I'll get back to it. |
Ok, well great! |
|
I've done something along those lines, have a look. Instead of fetching the last checkpoint update time in I have not implemented a custom retry interval, I think we can just default back to the user provided value, it keeps things simpler. I also took that as an opportunity to decrease |
| nextPublication := pubInterval - time.Since(publishedAt) | ||
| if nextPublication <= 0 { | ||
| t.Reset(time.Millisecond) // Schedule a checkpoint update immediately. | ||
| } else { | ||
| t.Reset(nextPublication) | ||
| } |
There was a problem hiding this comment.
Could the 3 instances of this be simplified to t.Reset(max(time.Millisecond, pubInterval - time.Since(publishedAt)))?
| } | ||
|
|
||
| return otel.TraceErr(ctx, "tessera.storage.gcp.publishCheckpoint", tracer, func(ctx context.Context, span trace.Span) error { | ||
| if err := otel.TraceErr(ctx, "tessera.storage.gcp.publishCheckpoint", tracer, func(ctx context.Context, span trace.Span) error { |
There was a problem hiding this comment.
You could change otel.TraceErr to otel.Trace2 and return (time.Time, error) from the inner func directly to reduce the readability hit from adding the outer if here. (same in other implementations)
This PR schedules checkpoint publication operations as early as possible, as per `checkpoint_interval and the last publication timestamp.