Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in txsub.System.Tick() and tx_bad_seq errors #815

Merged
merged 4 commits into from Jan 17, 2019

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Jan 16, 2019

This commit fixes two issues in txsub package. First, issue was a
deadlock in Tick() function:

sys.tickMutex.Lock()
if sys.tickInProgress {
  logger.Debug("ticking in progress")
  return
}
sys.tickInProgress = true
sys.tickMutex.Unlock()

When two Tick() methods were called simultanously one of them could
Lock() the mutex without calling Unlock() properly. This could
basically break the txsub system as sequence number could not be
updated.

The second issue was a problem with App initialization.
App.networkPassphrase was set to test network passphrase by default.
If txsub system was initialized before updating the network passphrase
using connected stellar-core, the system would not be able to
correctly find the transaction result in a database as calculated hash
was different than the actual hash of the transaction.

Close #305.

This commit fixes to issues in `txsub` package. First, issue was a
deadlock in `Tick()` function:

```go
sys.tickMutex.Lock()
if sys.tickInProgress {
  logger.Debug("ticking in progress")
  return
}
sys.tickInProgress = true
sys.tickMutex.Unlock()
```

When two `Tick()` methods were called simultanously one of them could
`Lock()` the mutex without calling `Unlock()` it properly. This could
basically break the `txsub` system as sequence number could not be
updated.

The second issue was a problem with `App` initialization.
`App.networkPassphrase` was set to test network passphrase by default.
If `txsub` system was initialized before updating the network passphrase
using connected `stellar-core`, the system would not be able to
correctly find the transaction result in a database as calculated hash
was different than the actual hash of the transaction.
@bartekn bartekn added this to the Horizon patch releases milestone Jan 16, 2019
Copy link
Contributor

@tomquisel tomquisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great fix to me, Bartek! I had a few questions, I know this is a rush fix, so feel free to ignore suggestions in the interest of getting it out the door.

log.Error("Network passhprase of stellar-core does not match Horizon configuration. Exiting...")
os.Exit(1)
}

a.coreVersion = resp.Info.Build
a.networkPassphrase = resp.Info.Network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed now, right?

// Check if NetworkPassphrase is different, if so exit Horizon as it can break the
// state of the application.
if resp.Info.Network != a.networkPassphrase {
log.Error("Network passhprase of stellar-core does not match Horizon configuration. Exiting...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change passhprase to passphrase, and maybe improve error message to print the two passphrase values.

@@ -64,7 +64,7 @@ func NewApp(config Config) (*App, error) {

result := &App{config: config}
result.horizonVersion = app.Version()
result.networkPassphrase = build.TestNetwork.Passphrase
result.networkPassphrase = config.NetworkPassphrase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove App.networkPassphrase and just use App.Config.NetworkPassphrase as the single source of truth?

@@ -178,7 +187,8 @@ func (sys *System) Tick(ctx context.Context) {
// Make sure Tick is not run concurrently
sys.tickMutex.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might accomplish this locking with two functions:

SetTickInProgress:

        sys.tickMutex.Lock()
        defer sys.tickMutex.Unlock()
	if sys.tickInProgress {
		logger.Debug("ticking in progress")
		return false
	}
	sys.tickInProgress = true
        return true

UnsetTickInProgress:

        sys.tickMutex.Lock()
        defer sys.tickMutex.Unlock()
        sys.tickInProgress = false 

and then call those from Tick (which checks the results of SetTickInProgress and exits if false, and calls UnsetTickInProgress() in defer.

I think that'd make this a little safer from this kind of bug in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The ultimate solution is to remove global ticking service completely (#613). It caused a lot of concurrency problems in the past (including this one) and can be implemented much easier using infinite loop.

if viper.GetString("network-passphrase") == "" {
stdLog.Fatal("Invalid config: network-passphrase is blank. Please specify --network-passphrase on the command line or set the NETWORK_PASSPHRASE environment variable.")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, is it a breaking change to make this required? It seems like it might be, and it'd be better to leave network-passphrase optional + default to testnet as we do currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, set default value to testnet passphrase but in the next minor we need to make this empty by default. Ex. if you run ingestion node without web server and when stellar-core is not running it should be explicitly set by Horizon admin (#817).

@oryband
Copy link
Contributor

oryband commented Jan 17, 2019

I've tested this branch in an environment where I can reproduce the issue and the problem is solved.

@bartekn
Copy link
Contributor Author

bartekn commented Jan 17, 2019

@oryband thanks for checking this. Would you mind checking this again in your environment (I pushed one last commit with updates).
@tomquisel all good points, fixed and added 2 comments.

@oryband
Copy link
Contributor

oryband commented Jan 17, 2019

@bartekn I tested again, and it works. Thanks for sorting this issue out! 🏆

@oryband
Copy link
Contributor

oryband commented Jan 17, 2019

@bartekn can you release a new stable version that includes this branch?

@bartekn
Copy link
Contributor Author

bartekn commented Jan 17, 2019

@oryband today.

@oryband
Copy link
Contributor

oryband commented Jan 17, 2019

WHOOHOO.

Copy link
Member

@ire-and-curses ire-and-curses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bartekn bartekn merged commit 1a536e3 into master Jan 17, 2019
@bartekn bartekn deleted the txsub-improvements branch January 17, 2019 18:22
Copy link
Contributor

@howardtw howardtw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: maybe it's worth breaking PRs fixing two issues into two smaller ones in the future? e.g. one for the tick, one for the passphrase

@@ -170,25 +179,40 @@ func (sys *System) submitOnce(ctx context.Context, env string) SubmissionResult
return sr
}

// setTickInProgress sets `tickInProgress` to `true` if it's not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we set tickInProgress to true only if it is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! #824

@@ -73,12 +74,20 @@ func (sys *System) Submit(ctx context.Context, env string) (result <-chan Result
// check the configured result provider for an existing result
r := sys.Results.ResultByHash(ctx, info.Hash)

if r.Err != ErrNoResults {
if r.Err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with the logic here but what is the reason that we have an Err field in r as oppose to have ResultByHash return r and err directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also used as an object returned by channel.

oryband added a commit to kinecosystem/go that referenced this pull request Feb 7, 2019
will resolve conflicts in following commits

* tag 'horizon-v0.15.4':
  Horizon 0.15.4 CHANGELOG (stellar#818)
  Fix deadlock in `txsub.System.Tick()` and `tx_bad_seq` errors (stellar#815)
  Add information about test scenarios to docs (stellar#813)
  trades endpoint does not have resolution (stellar#733)
  fix links and update orderbook field names (stellar#750)
  Doc: flesh out docs for developers (stellar#804)
  Fix graceful shutdown (stellar#803)
  Log client name and version (stellar#727)
  Use go vet in CI (stellar#611)
  Fix panic when missing an argument for horizon db backfill (stellar#806)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Horizon timeout after transaction is added to ledger
5 participants