New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish throttles catchup #2292
Publish throttles catchup #2292
Conversation
src/catchup/ApplyCheckpointWork.cpp
Outdated
auto done = lm.getLastClosedLedgerNum() == mCheckpointRange.mLast; | ||
auto lcl = mApp.getLedgerManager().getLastClosedLedgerNum(); | ||
|
||
// In case we just closed a ledger that unblocked publishing of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we wait at all if we're done applying the checkpoint?
src/catchup/ApplyCheckpointWork.cpp
Outdated
if (result && mWaitForPublish) | ||
{ | ||
auto ledger = hm.prevCheckpointLedger(lcl); | ||
if (ledger == lcl && hm.publishQueueLength() > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the condition on the size of the publish queue enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I think of it we should move this code into a precondition and use ConditionalWork
to wrap ApplyCheckpointWork
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the condition on the size of the publish queue enough?
The sequence of events is as follows: a ledger that is the last one in a checkpoint is closed, triggering snapshot being queued for publish. Then ResolveSnapshotWork
waits for the next ledger (aka first ledger in the next checkpoint) to be closed in order to proceed. So the "publish queue is not empty" condition is not enough here, as we need to also close the next ledger to unblock ResolveSnapshotWork
before going into WORK_WAITING
state.
now that I think of it we should move this code into a precondition and use ConditionalWork to wrap ApplyCheckpointWork
Right, I think initially this was a bit tricky to do, since I did not allow any drift in catchup vs publish. But with the min/max queue size to trigger wait/applying that you suggested in the other comment, this should be cleaner to do with ConditionalWork (since we're waiting for checkpoint far enough in the past to complete publishing, so I think it's not going to have the ResolveSnapshotWork
problem I mentioned above)
src/catchup/ApplyCheckpointWork.h
Outdated
@@ -56,14 +56,16 @@ class ApplyCheckpointWork : public BasicWork | |||
medida::Meter& mApplyLedgerFailure; | |||
|
|||
bool mFilesOpen{false}; | |||
bool const mWaitForPublish; | |||
std::unique_ptr<VirtualTimer> mPublishTimer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically it's really mWaitForPublishQueueTimer
src/catchup/ApplyCheckpointWork.cpp
Outdated
@@ -137,6 +140,13 @@ ApplyCheckpointWork::applyHistoryOfSingleLedger() | |||
return false; | |||
} | |||
|
|||
if (header.ledgerSeq > mLedgerRange.mLast) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you have a bug, this should never happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see: that's because you're calling applyHistoryOfSingleLedger
when coming back from a wait for publish
even though you may be done
(see my other comment)
src/catchup/ApplyCheckpointWork.cpp
Outdated
auto& hm = mApp.getHistoryManager(); | ||
if (mPublishTimer) | ||
{ | ||
if (hm.publishQueueLength() > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use constants instead of 0
in both locations, and take the opportunity to define the two thresholds (0 is too aggressive, I think):
- increase the size of the queue needed to trigger a wait - let's say 32
- define the size of the queue to unblock applying ledgers - let's say 16 for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the way ConditionalWork is implemented, it'd require changing state inside the lambda to know whether to wait or apply when publish queue size it in between 16 and 32 (which I think is quite error-prone and also doesn't comply with ConditionFn contract). I added the lower boundary of 16 for now to preserve the const-ness of lambda, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by changing state inside the lambda @marta-lokhova ? You mean the conditional is stateful? I don't think this is a problem: we do this all the time when we capture a pointer to a class.
7e856c2
to
bb4451b
Compare
bb4451b
to
4cec6af
Compare
r+ 4cec6af |
Publish throttles catchup Reviewed-by: MonsieurNicolas
Update offline catchup flow to avoid creating huge backlog of checkpoints to publish