Skip to content
Mitch Kelley edited this page May 8, 2019 · 2 revisions

Decisions

Pod dependencies

  • Supergloo
  • Prometheus
  • Prometheus Alert Manager

Experiment consists of

  • [note] Design for cron job (for now) - no fancy timing/lifecycle
  • Set of faults
  • Stop condition
  • Timeout

CRDs

  • Experiment

TODO

Watch prometheus for stop condition

Reconciler: Terminate experiment when fail trigger condition met Terminate experiment when time out expires Create Faults from experiment

Init: Sort by state

  • if running, do reconcile loop
    • create routing rules
    • write the start time to the exp
    • start timeout tick
  • running -> monitor (nothing)
  • completed -> cleanup res

Watch for failure conditions:

  • sets completed state
  • running -> failed

Load tester

  • various modes:
    • database cache overloaded
    • DDOS
    • regular traffic

CLI features:

  • watch flag for getting status of running experiment

Prometheus in Helm

Future consideration: exclude stale resources from reconciler (to avoid having to reconcile 10MM dead resources)

Clone this wiki locally