Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring #1060

Closed
snarfed opened this issue May 19, 2024 · 8 comments
Closed

Monitoring #1060

snarfed opened this issue May 19, 2024 · 8 comments

Comments

@snarfed
Copy link
Owner

snarfed commented May 19, 2024

We have a monitoring dashboard, https://console.cloud.google.com/monitoring/dashboards/builder/4f0ac7cc-258d-4058-8208-fafa32088378?project=bridgy-federated , but it's degraded. Time to take another pass at it!

@snarfed snarfed added now infra and removed now labels May 19, 2024
@snarfed snarfed changed the title Firehose monitoring Monitoring Jun 19, 2024
@snarfed snarfed added the now label Jun 19, 2024
@snarfed
Copy link
Owner Author

snarfed commented Jun 30, 2024

Also: alerts on task queue length.

@snarfed
Copy link
Owner Author

snarfed commented Jul 18, 2024

Here's a draft of what I'd like:

Monitoring

app

  • requests
  • errors
  • latency, p50 90 99
  • /r, /convert, webfinger requests, by domain
  • outbound HTTP requests by domain
  • instances

router

  • tasks by queue
  • task responses, errors
  • task latency
  • already seen activity ids
  • unsupported activity types
    • ...also in protocols like AP that short circuit them
  • receive tasks by protocol, source domain
  • send tasks by protocol, destination domain
  • new users
  • user deactivations
  • activities for blocked, opted out, limited, blocklisted users
  • CPU, memory

atproto-hub

  • firehose client:
    • total events
    • our events: from our users, to our users
  • firehose server:
    • emitted events
    • connected clients
  • CPU, memory

scoreboard

  • users
  • activities processed
  • activities emitted (do we collapse AP inbox deliveries? or separate them?)

Alerts

  • app error rate
  • app latency
  • task queue length, errors, rate drops
  • atproto firehose:
    • last connection from Bluesky relay over ~75m ago
    • emitted events rate dropoff
  • app instances sustained spike (can we limited to billed instances?)
  • router, atproto-hub CPU
  • router, atproto-hub serving instances sustained != 1

snarfed added a commit to snarfed/lexrpc that referenced this issue Jul 18, 2024
snarfed added a commit to snarfed/webutil that referenced this issue Jul 18, 2024
snarfed added a commit that referenced this issue Jul 18, 2024
snarfed added a commit that referenced this issue Jul 18, 2024
@qazmlp
Copy link

qazmlp commented Jul 19, 2024

We have a monitoring dashboard, https://console.cloud.google.com/monitoring/dashboards/builder/4f0ac7cc-258d-4058-8208-fafa32088378?project=bridgy-federated , but it's degraded. Time to take another pass at it!

That dashboard is private, by the way.

Not sure if you're planning to have a public status page at all/if the privacy setting there is intentional, but I thought I'd mention it in case it's accidental.

@snarfed
Copy link
Owner Author

snarfed commented Jul 19, 2024

Hey, yes, not accidental. Not sure if I can make it public, but I'd like to!

@snarfed
Copy link
Owner Author

snarfed commented Jul 19, 2024

@snarfed
Copy link
Owner Author

snarfed commented Jul 20, 2024

Got a first pass at a new dashboard that I'm reasonably happy with.

dashboard

@snarfed
Copy link
Owner Author

snarfed commented Jul 20, 2024

...and a first pass at alerts:

alerts

@snarfed
Copy link
Owner Author

snarfed commented Jul 21, 2024

There are a few still outstanding, but this is pretty much done!

@snarfed snarfed closed this as completed Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants