Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish throttles catchup #2292

Merged
merged 3 commits into from Oct 11, 2019

Conversation

marta-lokhova
Copy link
Contributor

Update offline catchup flow to avoid creating huge backlog of checkpoints to publish

auto done = lm.getLastClosedLedgerNum() == mCheckpointRange.mLast;
auto lcl = mApp.getLedgerManager().getLastClosedLedgerNum();

// In case we just closed a ledger that unblocked publishing of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we wait at all if we're done applying the checkpoint?

if (result && mWaitForPublish)
{
auto ledger = hm.prevCheckpointLedger(lcl);
if (ledger == lcl && hm.publishQueueLength() > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the condition on the size of the publish queue enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I think of it we should move this code into a precondition and use ConditionalWork to wrap ApplyCheckpointWork

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the condition on the size of the publish queue enough?

The sequence of events is as follows: a ledger that is the last one in a checkpoint is closed, triggering snapshot being queued for publish. Then ResolveSnapshotWork waits for the next ledger (aka first ledger in the next checkpoint) to be closed in order to proceed. So the "publish queue is not empty" condition is not enough here, as we need to also close the next ledger to unblock ResolveSnapshotWork before going into WORK_WAITING state.

now that I think of it we should move this code into a precondition and use ConditionalWork to wrap ApplyCheckpointWork

Right, I think initially this was a bit tricky to do, since I did not allow any drift in catchup vs publish. But with the min/max queue size to trigger wait/applying that you suggested in the other comment, this should be cleaner to do with ConditionalWork (since we're waiting for checkpoint far enough in the past to complete publishing, so I think it's not going to have the ResolveSnapshotWork problem I mentioned above)

@@ -56,14 +56,16 @@ class ApplyCheckpointWork : public BasicWork
medida::Meter& mApplyLedgerFailure;

bool mFilesOpen{false};
bool const mWaitForPublish;
std::unique_ptr<VirtualTimer> mPublishTimer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it's really mWaitForPublishQueueTimer

@@ -137,6 +140,13 @@ ApplyCheckpointWork::applyHistoryOfSingleLedger()
return false;
}

if (header.ledgerSeq > mLedgerRange.mLast)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you have a bug, this should never happen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see: that's because you're calling applyHistoryOfSingleLedger when coming back from a wait for publish even though you may be done (see my other comment)

auto& hm = mApp.getHistoryManager();
if (mPublishTimer)
{
if (hm.publishQueueLength() > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use constants instead of 0 in both locations, and take the opportunity to define the two thresholds (0 is too aggressive, I think):

  • increase the size of the queue needed to trigger a wait - let's say 32
  • define the size of the queue to unblock applying ledgers - let's say 16 for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the way ConditionalWork is implemented, it'd require changing state inside the lambda to know whether to wait or apply when publish queue size it in between 16 and 32 (which I think is quite error-prone and also doesn't comply with ConditionFn contract). I added the lower boundary of 16 for now to preserve the const-ness of lambda, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by changing state inside the lambda @marta-lokhova ? You mean the conditional is stateful? I don't think this is a problem: we do this all the time when we capture a pointer to a class.

@MonsieurNicolas MonsieurNicolas added this to In progress in v12.1.0 via automation Oct 10, 2019
@MonsieurNicolas
Copy link
Contributor

r+ 4cec6af

latobarita added a commit that referenced this pull request Oct 11, 2019
Publish throttles catchup

Reviewed-by: MonsieurNicolas
@latobarita latobarita merged commit 4cec6af into stellar:master Oct 11, 2019
v12.1.0 automation moved this from In progress to Done Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
v12.1.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants