Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve bench-tps stability #7537

Merged
merged 5 commits into from Dec 19, 2019
Merged

Conversation

@jstarry
Copy link
Member

jstarry commented Dec 17, 2019

Problem

Bench-TPS clients get blocked by RPC calls to get the recent blockhash for each new set of transactions

Summary of Changes

  • Create a new thread for polling for the latest blockhash
  • Rotate destination keys using a deque to avoid duplicate transactions
  • Create enough keypairs such that we can avoid "account in use" errors when the sleep time is less than block time or when the network is struggling to keep up.
  • Add --keypair-multiplier arg to specify how many times more keypairs we need than transactions
  • Changed balance ping-ponging logic to accommodate keypair multiplier

Notes

  • Users will need to adjust their mental models on how tx_count and thread-batch-sleep-ms map to TPS after this change. From my testing, it's a lot more intuitive now and roughly maps to:

TPS = tx_count * (1000 / thread-batch-sleep-ms)

  • More keypairs will be generated by default after this change, so more lamports need to be airdropped to the funding account than before

Links

https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?orgId=2&from=1576614229516&to=1576617829516&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=testnet-dev-jstarry&var-hostid=All

@jstarry jstarry force-pushed the jstarry:bench-tps-throughput branch from cdde98d to f2e71d3 Dec 17, 2019
@jstarry jstarry force-pushed the jstarry:bench-tps-throughput branch from f2e71d3 to 1bb3c2a Dec 17, 2019
jstarry added 2 commits Dec 17, 2019
@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 17, 2019

Codecov Report

Merging #7537 into master will decrease coverage by 7.7%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           master   #7537     +/-   ##
========================================
- Coverage    78.2%   70.4%   -7.8%     
========================================
  Files         244     244             
  Lines       51259   56882   +5623     
========================================
- Hits        40090   40086      -4     
- Misses      11169   16796   +5627
@jstarry jstarry marked this pull request as ready for review Dec 17, 2019
@jstarry jstarry requested review from sakridge and sagar-solana Dec 17, 2019
@sagar-solana sagar-solana requested a review from pgarg66 Dec 17, 2019
@jstarry jstarry force-pushed the jstarry:bench-tps-throughput branch from 59b3181 to 7ac7fa8 Dec 18, 2019
@jstarry jstarry added the v0.21 label Dec 18, 2019
@sakridge

This comment has been minimized.

Copy link
Member

sakridge commented Dec 18, 2019

Code looks fine, it would be nice to see that this doesn't affect TPS in all the cases that we test in. High TPS with colo, gce w/GPU, etc and what kind of improvement it brings to whatever configuration it improves.

Copy link
Contributor

pgarg66 left a comment

General question: in case get_recent_blockhash thread is waiting to get the next blockhash, would multiple rounds of generate_txs reuse the same blockhash? I see that you are rotating the dest_keypair_chunks. That would guarantee that same blockhash would not result into duplicate transactions.

if blockhash_time.elapsed().as_secs() > 30 {
panic!("Blockhash is not updating");
}
sleep(Duration::from_millis(100));

This comment has been minimized.

Copy link
@pgarg66

pgarg66 Dec 18, 2019

Contributor

Maybe, use 50ms sleep here. The slot duration in v21 was ~250ms. Now we are seeing between 400/500ms. The 50ms granularity would cover more variations in slot time.

This comment has been minimized.

Copy link
@jstarry

jstarry Dec 18, 2019

Author Member

Done

@jstarry

This comment has been minimized.

Copy link
Member Author

jstarry commented Dec 18, 2019

General question: in case get_recent_blockhash thread is waiting to get the next blockhash, would multiple rounds of generate_txs reuse the same blockhash? I see that you are rotating the dest_keypair_chunks. That would guarantee that same blockhash would not result into duplicate transactions.

Yes, exactly. Since the to account will be rotated each iteration, the signatures won't repeat.

@jstarry

This comment has been minimized.

Copy link
Member Author

jstarry commented Dec 19, 2019

@sakridge I tested on a GCE gpu enabled testnet as well as a colo GPU testnet and for the most part was only able to match our previous performance tests. So looks like there isn't really a throughput improvement with this change.

On colo, I pushed up to Highest TPS: 135729, Average TPS: 76871 with tx_count=140000

It seems like the main win from this PR is to decrease the variability in waiting for a blockhash and also cutting down on account in use errors. I think these changes make TPS runs a bit more stable, but at the very least haven't been making things worse.

@jstarry jstarry changed the title Improve and stabilize bench-tps throughput Improve bench-tps stability Dec 19, 2019
@jstarry jstarry merged commit 01f44f5 into solana-labs:master Dec 19, 2019
13 checks passed
13 checks passed
Rule: v0.21 backport (backport) Backports have been created
Details
Summary 2 rules match and 2 potential rules
Details
buildkite/solana Build #16842 passed (36 minutes, 2 seconds)
Details
buildkite/solana/bench Passed (10 minutes, 56 seconds)
Details
buildkite/solana/checks Passed (1 minute, 39 seconds)
Details
buildkite/solana/coverage Passed (10 minutes, 18 seconds)
Details
buildkite/solana/local-cluster Passed (18 minutes, 10 seconds)
Details
buildkite/solana/move Passed (5 seconds)
Details
buildkite/solana/pipeline-upload Passed (3 seconds)
Details
buildkite/solana/shellcheck Passed (30 seconds)
Details
buildkite/solana/stable Passed (32 minutes, 2 seconds)
Details
buildkite/solana/stable-perf Passed (10 minutes, 32 seconds)
Details
ci-gate Pull Request accepted for CI pipeline
@jstarry jstarry deleted the jstarry:bench-tps-throughput branch Dec 19, 2019
mergify bot pushed a commit that referenced this pull request Dec 19, 2019
* Improve bench-tps throughput

* Fix tests

* Fix more tests

* Fix move test

* Drop blockhash poll sleep interval

(cherry picked from commit 01f44f5)
solana-grimes added a commit that referenced this pull request Dec 19, 2019
automerge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.