Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bench-tps: flush tx queue when too old #6201

Merged

Conversation

@CriesofCarrots
Copy link
Contributor

commented Oct 1, 2019

Problem

Bench-tps stops sending transactions to a cluster after about a minute. Removing thread-batch-sleep-ms argument improves or resolves the problem.
Cluster metrics show a surge of BlockhashNotFound errors before the clusters stops receiving transactions.

Analysis:
It appears that the generate_txs() thread is outrunning the do_tx_transfers() threads, such that when transactions start satisfying this condition linked below (too old), the send transaction loop never catches up, and just continually removes old transactions from the VecDeque.

if now > tx.1 && now - tx.1 > 1000 * 30 {

Summary of Changes

  • Flush the SharedTransactions list when do_tx_transfers() reaches old transactions.
  • Adjust the "old" threshold to minimize BlockhashNotFound errors on the cluster. (This will not necessarily prevent all BlockhashNotFound errors, depending on the speed of a client and the speed of communication between the client and the cluster.)
  • Also reorganize dependencies in file

Toward #6163
Resolves bullet 2

@CriesofCarrots CriesofCarrots requested review from mvines and sagar-solana Oct 1, 2019
@CriesofCarrots CriesofCarrots force-pushed the CriesofCarrots:bench-tps-old-blockhash branch from 82327ca to e3ba4b3 Oct 1, 2019
Copy link
Contributor

left a comment

🎉

@sakridge

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

Should we throttle generation until the sending catches up?

@CriesofCarrots CriesofCarrots added the v0.19 label Oct 1, 2019
@codecov

This comment has been minimized.

Copy link

commented Oct 1, 2019

Codecov Report

Merging #6201 into master will increase coverage by <.1%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           master   #6201     +/-   ##
========================================
+ Coverage    70.5%   70.6%   +<.1%     
========================================
  Files         244     244             
  Lines       43859   43823     -36     
========================================
- Hits        30962   30940     -22     
+ Misses      12897   12883     -14
@CriesofCarrots

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2019

Should we throttle generation until the sending catches up?

I think that's not really necessary with this approach, but it would offer an alternate solution.
When do_tx_transfers() encounters an old transaction, it could grab the lock on SharedTransactions until either it's back into current transactions or has cleared the queue

@sakridge

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

This seems like maybe it helps with the specific hardware/--tx_count parameter, but is it robust against different configurations?

@CriesofCarrots

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2019

This seems like maybe it helps with the specific hardware/--tx_count parameter, but is it robust against different configurations?

Incidentally, tx_count doesn't seem to have a huge correlation with this issue, because it's the same multipler on do_tx_transfers as on generate_tx. tx_count=10 fails in the same way as tx_count=1000

@sakridge

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

I'm thinking that if we are just going to throw the messages away, we should not generate them in the first place.

@CriesofCarrots

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2019

Fair enough. I'll see about making that adjustment.

@CriesofCarrots CriesofCarrots changed the title Bench-tps: Bench-tps: flush tx queue when too old Oct 1, 2019
@sakridge

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

I guess if this works, then we can be fine with it for now.

@mergify mergify bot dismissed sagar-solana’s stale review Oct 1, 2019

Pull request has been modified.

@CriesofCarrots

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2019

I guess if this works, then we can be fine with it for now.

Good deal. The generate-transaction throttling was causing a choppy transmission that I wasn't too thrilled about.

@CriesofCarrots CriesofCarrots force-pushed the CriesofCarrots:bench-tps-old-blockhash branch from 8859f24 to d472b2c Oct 1, 2019
@CriesofCarrots CriesofCarrots merged commit 2c6599c into solana-labs:master Oct 1, 2019
12 checks passed
12 checks passed
Rule: v0.19 backport (backport) Backports have been created
Details
Summary 2 rules match and 6 potential rules
Details
buildkite/solana Build #12495 passed (49 minutes, 57 seconds)
Details
buildkite/solana/bench Passed (15 minutes, 54 seconds)
Details
buildkite/solana/checks Passed (1 minute, 38 seconds)
Details
buildkite/solana/coverage Passed (31 minutes, 16 seconds)
Details
buildkite/solana/local-cluster Passed (14 minutes, 55 seconds)
Details
buildkite/solana/pipeline-upload Passed (9 seconds)
Details
buildkite/solana/shellcheck Passed (25 seconds)
Details
buildkite/solana/stable Passed (21 minutes, 35 seconds)
Details
buildkite/solana/stable-perf Passed (9 minutes, 23 seconds)
Details
ci-gate Pull Request accepted for CI pipeline
mergify bot pushed a commit that referenced this pull request Oct 1, 2019
* Flush transaction VecDeque  when hit old transactions

* Fixup too-old threshold

(cherry picked from commit 2c6599c)
solana-grimes added a commit that referenced this pull request Oct 1, 2019
@CriesofCarrots CriesofCarrots deleted the CriesofCarrots:bench-tps-old-blockhash branch Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.