Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loadgen not working as expected #2199

Closed
MonsieurNicolas opened this issue Jul 16, 2019 · 2 comments · Fixed by #2220
Closed

Loadgen not working as expected #2199

MonsieurNicolas opened this issue Jul 16, 2019 · 2 comments · Fixed by #2220
Assignees
Labels
Projects

Comments

@MonsieurNicolas
Copy link
Contributor

The built-in load generator used for testing has a few problems.

There is a design flow in that after txs get submitted it moves to a completely different phase waitTillComplete, assuming that transactions were submitted succesfuly and just need to be processed by the network:

  1. a big problem there is that if nothing was actually submitted (which can happen, see below), that phase will succeed
  2. if some txs are dropped, there is no way it can recover and an observer of loadgen has no way of quantifying the amount of dropped transactions

A few more details:

  1. in general, it doesn't handle when transactions get dropped by the validator's queue for whatever reason. The logic there that "just retries" is too simplistic to recover.
  2. it doesn't handle ADD_STATUS_TRY_AGAIN_LATER (banned transactions). I suspect this should be handled the same way that we deal with 1 anyways

Here are my recommendations:

  1. load generator should only succeed when work is complete. An observer (such as test automation), only needs to wait for loadgen.run.complete to be set without having to know how to check for completion. We leave the option of "timing out" to the observer (so "stopping" loadgen" should work in this situation)
  2. We already have logic that uses the list of all accounts used in the simulation to sign and generate proper sequence numbers, we can expand on this to a process that can guarantee progress instead.

Potential updated way of generating load:

  1. start simulation: setup world
    a. synchronize all source accounts used during the entire simulation run (mAccounts)
    b. compute the list of pairs expected <simulation_account_ID=uint64, sequence_number> of expected sequence numbers for each source account at the end of the simulation. For "create", it's the single account, something like (rounding) current_seq_num+nbAccount/batchSize, for payments we can assume round robin over the set of mAccounts so something like current_seq_num+nTx/nAccounts
    c. initialize the lists of accounts done and backlog to empty
  2. load generation step (inject txPerStep transactions), loop is something like:
    a. generate backlog if needed (ie, backlog is empty)
    i. iterate over expected, remove accounts that already have the right sequence number; otherwise add to backlog
    ii. if expected is empty, simulation is "done"
    ii. shuffle backlog
    b. pick an account from backlog, generate and submit one transaction for that account
    i. the logic that we need here is something like we have right now: duplicate -> skip (ie, pick the next account from backlog) on error, synchronize account
    ii. NB: generate has to be fully deterministic based on source account and sequence number (this is already the case)
    iii. we have to break if no progress was made even after rebuilding the backlog (can happen with create if the account gets banned)

This may generate a few extra transactions at the end, but this should not really matter (as they would be rejected by the validator with duplicate). We get "retries" for free between steps.

@MonsieurNicolas MonsieurNicolas added this to To do in v11.4.0 via automation Jul 16, 2019
@marta-lokhova marta-lokhova self-assigned this Aug 7, 2019
@marta-lokhova
Copy link
Contributor

@MonsieurNicolas is it valid to expect that transactions are banned only when the system is overloaded? (i.e., too many txs are submitted and stellar-core can't keep up; note that I don't mean banned/dropped due to being invalid, since that just points to a bug in loadgen) If so, would it be simpler to just mark loadgen as "failed" and let the user decide what they want to do, instead of trying to recover?
In case of acceptance tests, we should not expect it to fail (though it should retry anyway), since we're generating a tiny amount of load at a very slow txrate. In case of benchmarking, I wouldn't want loadgen to recover, but rather fail, so we can reason about a point in time when the system became overloaded under the stress test without loadgen intervening.

@MonsieurNicolas
Copy link
Contributor Author

So fast fail the loadgen if the system ends up banning transactions? I think this could work in this context

@MonsieurNicolas MonsieurNicolas moved this from To do to In progress in v11.4.0 Aug 13, 2019
v11.4.0 automation moved this from In progress to Done Aug 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
v11.4.0
  
Done
Development

Successfully merging a pull request may close this issue.

2 participants