Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support/historyarchive: Provide mechanism for retrying XDR stream errors #1899

Merged
merged 6 commits into from
Nov 25, 2019

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented Nov 1, 2019

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

If a SingleLedgerStateReader fails to read a single bucket entry the reading process is aborted.
This commit provides a retry mechanism which creates new XDR streams in case of XDR stream errors.

Close #1649

Why

XdrStream.ReadOne method returns this error from time to time:

stream error: stream ID 311; PROTOCOL_ERROR

This happens when the stream connection times out. To make ingestion pipelines more robust we should provide a way to retry when encountering these errors.

Known limitations

Assume we have an XDR stream for a very large bucket and we encounter a transient error halfway through. Unfortunately, when we create the new XDR stream, we need to consume half of the bucket file so that we resume from the same point of failure. This means that we waste time and bandwidth re-downloading the part of the bucket file we have already successfully processed. We cannot avoid re-downloading because the history archive buckets are GZIP compressed and it is not possible to seek to a random position in a gizp compressed file (see https://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives ).

If it were possible to seek to any position within a gzip file we would be able to start the new XDR stream from the point of failure without any wasted bandwidth using an HTTP range request (see https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests )

@cla-bot cla-bot bot added the cla: yes label Nov 1, 2019
@tamirms tamirms force-pushed the retry-stream-err branch 11 times, most recently from 4ab8df8 to 1df1212 Compare November 1, 2019 15:57
If a SingleLedgerStateReader fails to read a single bucket entry the reading process is aborted.
This commit provides a retry mechanism which creates new XDR streams in case of XDR stream errors.
Copy link
Member

@leighmcculloch leighmcculloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inability to seek is unfortunate, looks good and sounds like a good idea to support retries 👏.

exp/ingest/io/single_ledger_state_reader.go Outdated Show resolved Hide resolved
support/historyarchive/xdrstream.go Show resolved Hide resolved
support/historyarchive/xdrstream.go Outdated Show resolved Hide resolved
.circleci/config.yml Outdated Show resolved Hide resolved
@tamirms tamirms changed the base branch from release-horizon-v0.23.0 to release-horizon-v0.24.0 November 13, 2019 15:48
"github.com/stellar/go/xdr"
)

var log = logpkg.DefaultLogger.WithField("service", "expingest")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't use logger because it can interfere with other log package used by ingest package user. That's why we added #1510. I think that returning an error after maxStreamRetries is enough.

Copy link
Contributor Author

@tamirms tamirms Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it to log = logpkg.DefaultLogger.WithField("component", "SingleLedgerStateReader")

I would like to keep the log statements because it could be useful to know which type of errors cause the stream to disconnect.
Also, we only log whenever we have to retry the xdr stream and that condition shouldn't happen so often. So I don't think it will end up bloating our log stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about our log stream but the future users that use exp/ingest in their apps. Personally, I wouldn't want any of packages in my my app to log anything to the output without my consent. @leighmcculloch what's your opinion on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Go have an equivalent of SLF4J? I don't think logging from a library is unreasonable, as long as it's not imposing a logger on the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to leave logging, I think the best way would be to pass an object that implements Logger interface with Info, Error, ..., methods. Unfortunately, there is no standard library interface like this (see: golang/go#13182).

break
}

stream.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle err here. Can lead to undetected memory/fd leak otherwise.

exp/ingest/io/single_ledger_state_reader.go Outdated Show resolved Hide resolved
exp/ingest/io/single_ledger_state_reader.go Show resolved Hide resolved
exp/ingest/main.go Outdated Show resolved Hide resolved
tamirms and others added 2 commits November 21, 2019 22:31
Co-Authored-By: Bartek Nowotarski <bartek@nowotarski.info>
@tamirms
Copy link
Contributor Author

tamirms commented Nov 21, 2019

@bartekn could you take another look? thanks!

Copy link
Contributor

@bartekn bartekn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except logging but I don't want to block merging because of this. Great we have retry code finally, less error notifications expected 😄.

@tamirms tamirms merged commit 423a117 into stellar:release-horizon-v0.24.0 Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants