Skip to content

Latest commit

 

History

History
93 lines (91 loc) · 3.42 KB

Wed06__Building.Twitters.Next-Gen.Alerting.System__by_Megan.Kanne.Justin.Nguyen.Dan.Sotolongo.md

File metadata and controls

93 lines (91 loc) · 3.42 KB

Building Twitter’s Next-Gen Alerting System

@sortalongo @megankanne @justanguyen Slides: https://docs.google.com/presentation/d/1K4e7uB0wBzrlq5EctCvWEyok0gTOz-1Sh9Ya-o6NeRw/edit?usp=sharing

  • Twitter has a new alerting system
  • took 2 years to get here.
  • #talk-megan-justin-dan
  • Observability team's mission:
    • Scaleable
    • Robust
    • realtime
  • Not responsible for each teams monitoring, just the system they use.
    • not the watchers on the wall
    • just the wall
  • Manhattan KV Store (never heard of it)
  • "Observability of Twitter" Second result on google
  • 300m metrics per min -> 4.3b metrics per minute
  • Old: alert -> rule -> monitor
    • The good:
      • simple config language
      • large body of examples
      • easy to write, commit, upload new alerts
    • The bad:
      • simple config language
      • large body of (bad) examples
      • easy to write, commit, upload new alerts... huge bit-rot
    • json-ish language.
    • Stapled on moustache for templating
    • Ownership of config was confusing.
      • Orphaned configs
    • Not enough effort to help people do things right.
    • alert config disjoint from dashboard config.
    • If it's important enough to alert, it's important enough to chart it
  • The new system:
    • Goals:
      • Improve config
      • drop loss of visibility
      • making alerts: simplicity
    • Combined alert and dashboarding.
      • Alerts are now a predicate on a dashboard.
    • New alert is written in python
      • included a shared library
      • Configs must pass tests
      • Adding server side validating
      • Make sure queries aren't too expensive
    • on-call reliability:
      • Alerts needed to come out any available zone.
      • New arch schedules fairly.
      • Alerting system time to detect improved by 30%
    • Human reasoning:
      • Monitoring isn't unit testing... it's closer to integration tests ###
      • Set an alert within context: Global, peer, and local. any changes At any tier
    • Empower the humans receiving the alerts
    • Distributed systems
      • Availability and scale numbers are bigger than a single system
      • Hard to get right.
      • Hard to keep consistency
      • Structural complexity
    • Sharding is hard and important
      • Sharded by alert key.
    • rely on existing tools
      • zookeeper, manhattan KV, etc.
    • Engineering principles:
      • abstract cleanly
      • Minimize coordination
      • parallize the work
      • distribute load evenly
      • do it all FAST
    • Sharding rulesets
      • Code of honor
      • Uneven loading by sharind by ruleset
      • Solved by over-provisioning
    • User support is important.
      • UI needs to offer help
      • CLI needs to offer help
      • Config libs need comments
      • Helping out:
        • Strong validation
        • helpful error messages
        • Extensive documentation
      • Strong user guides
      • deep FAQs
      • Third line:
        • Dedicated support on-call rotation
      • User collaboration to help migration
        • Some excited to try new things
        • Some busy with other things
      • Goal wasn't to get people to migrate
        • Goal was to make twitter better.
        • Don't treat people like mules (carrots and sticks)
        • Needed a better compromise (which might mean delays in roadmap)