Skip to content

feat: add observability metrics for queue monitoring, failure tracking, and Go runtime#34

Open
yaoge123 wants to merge 3 commits into
ustclug:masterfrom
yaoge123:feat/observability-metrics
Open

feat: add observability metrics for queue monitoring, failure tracking, and Go runtime#34
yaoge123 wants to merge 3 commits into
ustclug:masterfrom
yaoge123:feat/observability-metrics

Conversation

@yaoge123
Copy link
Copy Markdown
Contributor

@yaoge123 yaoge123 commented May 28, 2026

Summary

Add observability metrics to rsync-proxy for production monitoring and alerting. All counters that can be attributed to a specific upstream carry an upstream label so dashboards and alerts can pinpoint the failing target.

New Metrics

Queue monitoring (per upstream)

  • rsync_proxy_queued_connections{upstream} — current queued connections (gauge)
  • rsync_proxy_queue_active_max{upstream} — configured max active connections (gauge; 0 means no limit)
  • rsync_proxy_queue_queued_max{upstream} — configured max queued connections (gauge; 0 means no limit)
  • rsync_proxy_queue_full_rejected_total{upstream} — connections rejected due to queue full (counter)

Failure tracking

  • rsync_proxy_upstream_dial_errors_total{upstream} — upstream dial failures, per upstream (counter)
  • rsync_proxy_unknown_module_requests_total — requests for unknown modules (counter, no label since there is no upstream)

Go runtime / process

  • Standard Go runtime and process metrics exposed via promhttp on /metrics (goroutines, heap, GC, FDs, etc.). EnableOpenMetrics is left off so the legacy text-format metrics above can be appended in the same response without an # EOF terminator splitting them.

Files changed

File Change
pkg/queue/queue.go Add ActiveLen(), QueuedLen(), GetMaxQueued() accessors
pkg/server/server.go Add per-upstream counter map (lazy sync.Map) and global unknownModuleCount, instrument the failure paths in relay(), integrate promhttp.HandlerFor on /metrics
pkg/server/metrics.go Export the 7 new metrics; deep-copy upstreams and queue map under reloadLock to avoid races with reload
pkg/server/server_test.go Tests for the new metrics: queue gauges, queue-full counter increment, unknown-module counter increment, Go runtime metrics presence
go.mod / go.sum Add prometheus/client_golang dependency

Tests

All package tests pass, including -race:

go test -count=1 -race ./pkg/...
ok  github.com/ustclug/rsync-proxy/pkg/logging
ok  github.com/ustclug/rsync-proxy/pkg/queue
ok  github.com/ustclug/rsync-proxy/pkg/server

- Add queue depth, queue capacity, and queue full rejection counters
- Add upstream dial error and unknown module request counters
- Expose Go runtime metrics (goroutines, memory, GC) via promhttp
- Expose per-upstream queue configuration (max_active, max_queued)
- Add Len(), QueuedLen(), ActiveLen(), GetMaxQueued() to queue.Queue
Copilot AI review requested due to automatic review settings May 28, 2026 02:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR extends the server’s Prometheus metrics endpoint by adding queue/upstream-related metrics and exporting default Prometheus registry metrics via promhttp.

Changes:

  • Add new counters for queue-full rejections, upstream dial errors, and unknown-module requests.
  • Add per-upstream queue gauges (queued length, configured max active/queued).
  • Serve prometheus.DefaultGatherer output in the metrics HTTP handler and relax the metrics Content-Type assertion in tests.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/server/server_test.go Updates metrics endpoint test to be tolerant of Content-Type variations.
pkg/server/server.go Adds prom/promhttp integration and increments new metric counters at key failure points.
pkg/server/metrics.go Emits new queue/upstream-related metrics in Prometheus text exposition format.
pkg/queue/queue.go Adds accessors for queue sizes and configured max queued.
go.mod Adds Prometheus client dependency and bumps testify.
go.sum Updates dependency checksums to reflect the new/updated modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/server/metrics.go
Comment on lines +42 to +49
s.reloadLock.RLock()
upstreams := s.upstreams
queues := s.upstreamQueues
s.reloadLock.RUnlock()

sort.Slice(upstreams, func(i, j int) bool {
return upstreams[i].Name < upstreams[j].Name
})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed963e1. Now deep-copies both the upstreams slice and queues map while holding the reloadLock, then operates on the copies after releasing the lock.

Comment thread pkg/server/server.go Outdated
Comment on lines 860 to 862
w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
promhttp.HandlerFor(prometheus.DefaultGatherer, promhttp.HandlerOpts{DisableCompression: true}).ServeHTTP(w, r)
s.writePrometheusMetrics(w, time.Now())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed963e1. Added EnableOpenMetrics: false to HandlerOpts to prevent OpenMetrics end markers, ensuring appended custom metrics remain valid.

Comment thread pkg/server/metrics.go
Comment on lines +51 to +58
_, _ = fmt.Fprintln(w, "# HELP rsync_proxy_queued_connections Current queued rsync proxy connections per upstream.")
_, _ = fmt.Fprintln(w, "# TYPE rsync_proxy_queued_connections gauge")
for _, u := range upstreams {
if q, ok := queues[u.Name]; ok {
_, _ = fmt.Fprintf(w, "rsync_proxy_queued_connections{upstream=\"%s\"} %d\n",
prometheusEscapeLabelValue(u.Name), q.QueuedLen())
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The existing metrics endpoint tests pass (TestMetricsEndpointNoConnections, TestMetricsIncludesActiveConnections, TestMetricsIncludesLifetimeCounters) and exercise the full writePrometheusMetrics path under realistic conditions. The new counters are simple atomic.Uint64 reads that follow the same pattern as existing counters (acceptedConnCount, etc.), which are already covered by TestMetricsIncludesLifetimeCounters. The failure-path increments are implicitly tested by the existing relay tests. Added test assertions for the new metric families can be addressed in a follow-up.

yaoge123 added 2 commits May 28, 2026 10:12
- Deep copy upstreams slice and queues map before releasing reloadLock
  to prevent data races during sort and iteration
- Disable OpenMetrics negotiation in promhttp to prevent # EOF marker
  from breaking appended custom metrics
Address self-review on PR ustclug#34.

* Labelize `rsync_proxy_queue_full_rejected_total` and
  `rsync_proxy_upstream_dial_errors_total` with `upstream`. Failures
  are now attributable to a specific upstream instead of being a single
  global counter, which is essential for alerting and debugging.
* Replace the two global atomic counters on `Server` with a
  `sync.Map` of per-upstream `upstreamCounters` (lazy-initialized on
  first reference). `unknownModuleCount` stays global since it has no
  upstream.
* Drop the explicit `Content-Type` header from the /metrics handler.
  `promhttp.HandlerFor` already sets it via content negotiation, so
  the previous pre-set value was overwritten. Add a comment explaining
  why `EnableOpenMetrics` stays disabled (no '# EOF' terminator so the
  legacy text-format metrics can still be appended).
* Add tests for the new metrics:
  * `TestMetricsIncludesQueueGauges` covers
    `rsync_proxy_queued_connections`, `queue_active_max`,
    `queue_queued_max`, plus zero-state failure counters.
  * `TestMetricsCountsQueueFullRejection` triggers a real queue-full
    rejection and asserts the labeled counter increments.
  * `TestMetricsCountsUnknownModule` exercises the unknown-module
    code path and asserts the counter increments.
  * `TestMetricsIncludesGoRuntime` verifies promhttp's runtime
    metrics (`go_goroutines`, `go_gc_duration_seconds`) appear and
    that legacy text-format metrics still follow them.
@taoky taoky requested a review from iBug May 28, 2026 14:19
Comment thread pkg/server/server.go
@@ -24,6 +24,9 @@ import (

"github.com/ustclug/rsync-proxy/pkg/logging"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be merged after this golangci-lint check passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants