Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Reuse Stellar-Core on-disk DB in online mode #4471

Merged
merged 4 commits into from
Jul 26, 2022

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Jul 21, 2022

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

This commit changes the behaviour of stellarCoreRunner when using an on-disk DB in online mode to check if existing storage dir contains the DB in a state that allows Captive Core to start without rebuilding Stellar-Core state. In short, it checks (by using stellar-core offline-info command) if the LCL of Stellar-Core matches the requested ledger in startFrom.

Close #4454.

Why

While applying state from buckets was relatively fast in memory mode of Captive Core it can be extremely slow when using disk. This change allows reusing existing state in most cases.

Known limitations

[TODO or N/A]

@bartekn bartekn requested a review from a team July 21, 2022 15:44
@bartekn bartekn marked this pull request as ready for review July 21, 2022 15:44
if err != nil {
r.log.Infof("Error running offline-info: %v, removing existing storage-dir contents", err)
removeStorageDir = true
} else if uint32(info.Info.Ledger.Num) != from {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know how core maintains info.Info.Ledger.Num , i.e. does it only bump it when it knows the meta record for that sequence was read off the pipe? wondering if info.Info.Ledger.Num will tend to be farther ahead than from which represents the last sequence that horizon read off the pipe(and serialized to history), if it does drift asynchronously from meta pipe reader activity(horizon), then this condition won't get hit much, right, result being it ends up in same routine of new-db/catchup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my best knowledge and some experimenting it seems that Stellar-Core only closes the ledger once it's read from meta pipe. This leaves us with two cases:

  • Horizon is catching up (after restart or state build) - it this case bufferedLedgerMetaReader can read ledgers from meta pipe upfront which will make the Horizon to be behind. In this case, when Horizon is stopped with ledgers in the buffer the solution in this PR will not work because the ledger sequences in will not match on restart. We can try removing bufferedLedgerMetaReader in online mode but I'm not sure about performance of this change. We can explore it in a separate PR.
  • Horizon is ingesting latest ledgers - in this case the bufferedLedgerMetaReader will contain up to one ledger but if Horizon is shutdown gracefully it will process this ledger before shutting down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that's interesting, meaning there's only one ledger of data present in that pipe at any time, sounds like core writer blocks until it's empty, which is the signal that prior ledger was read, but, this at least recovers from any out-of-sync case and worst outcome is it does the same as current day of full removal first and init first.

return errors.Wrap(err, "error initializing core db")
// Check if on-disk core DB exists and what's the LCL there. If not what
// we need remove storage dir and start from scratch.
removeStorageDir := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worthwhile to add a unit test in stellar_core_runner_test.go to assert this new outcome?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick update on this: I'm working on refactoring stellarCoreRunner to allow writing better unit tests. I'll have a new commit ready by the end of today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually while refactoring I changed some other parts of stellarCoreRunner that seemed inconsistent. Would you mind 👍 this PR (if there is nothing else that requires changes) and I'll open another PR with refactoring and tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sreuland follow up PR: #4480

Copy link
Contributor

@sreuland sreuland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice solution with minimal coding!

@bartekn bartekn merged commit 4850d22 into stellar:master Jul 26, 2022
@bartekn bartekn deleted the reuse-core-on-disk-db-online-mode branch July 26, 2022 10:47
sreuland pushed a commit to sreuland/go that referenced this pull request Aug 7, 2022
This commit changes the behaviour of `stellarCoreRunner` when using an on-disk
DB in online mode to check if existing storage dir contains the DB in a state
that allows Captive Core to start without rebuilding Stellar-Core state. In
short, it checks (by using `stellar-core offline-info` command) if the LCL of
Stellar-Core matches the requested ledger in `startFrom`.

This was done because while applying state from buckets was relatively fast in
memory mode of Captive Core it can be extremely slow when using disk. This
change allows reusing existing state in most cases.

Close stellar#4454.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

/services/horizon/ingest: captive core on-disk ingestion, optimize catchup times
2 participants