Fix bug: Wrong UDP Average Connect Time metric #1593

josecelano · 2025-06-19T09:31:33Z

Fix bug: Wrong UDP Average Connect Time metric.

Context

Subtasks

Read, recalculate, and update average should be atomic.
Average should be calculated for a time series (label set). We are mixing series: we update the average using a label set (segregate average), but count the requests for the average globally (not segregated).
Add a new metric to count requests for the moving average. This counter is also increased atomically when the new average is updated.
Fix global (all trackers, no labels) value for the average. It should be the average of all average samples, not the SUM.
- Add new average aggregate function to the metrics package.
Add tests:
- With two times series. Two trackers running on different ports (6868, 6969)
- For race conditions, running multiple requests in parallel. To ensure the average is accurate after many iterations.

How to test

Run the tracker: cargo run
Simulate one announce request per UDP server

cargo run -p torrust-tracker-client --bin udp_tracker_client announce udp://127.0.0.1:6868 000620bbc6c52d5a96d98f6c0f1dfa523a40df82 | jq

cargo run -p torrust-tracker-client --bin udp_tracker_client announce udp://127.0.0.1:6969 000620bbc6c52d5a96d98f6c0f1dfa523a40df82 | jq

Check metrics

In the labelled metrics, there should be two metric samples like these:

curl -s "http://localhost:1212/api/v1/metrics?token=MyAccessToken&format=prometheus"

# HELP udp_tracker_server_performance_avg_processing_time_ns Average time to process a UDP request in nanoseconds
# TYPE udp_tracker_server_performance_avg_processing_time_ns gauge
udp_tracker_server_performance_avg_processing_time_ns{request_kind="announce",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6868",server_binding_protocol="udp"} 54773
udp_tracker_server_performance_avg_processing_time_ns{request_kind="connect",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6868",server_binding_protocol="udp"} 40326
udp_tracker_server_performance_avg_processing_time_ns{request_kind="announce",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6969",server_binding_protocol="udp"} 66063.71428571429
udp_tracker_server_performance_avg_processing_time_ns{request_kind="connect",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6969",server_binding_protocol="udp"} 43497.71428571428

In the global aggregated metrics:

curl -s "http://localhost:1212/api/v1/metrics?token=MyAccessToken&format=prometheus"

The values should be the average of the server's averages:

udp_avg_connect_processing_time_ns 41911
udp_avg_announce_processing_time_ns 60418
udp_avg_scrape_processing_time_ns 0

41911 = (40326 + 43497.71428571428)/2 = 41911,857142857
60418 = (54773 + 66063.71428571429)/2 = 60418,357142857

The values are rounded because we use a u64 for the global aggregated metrics.

…time calculation from Repository to Metrics

codecov · 2025-06-19T09:47:12Z

Codecov Report

Attention: Patch coverage is 97.59036% with 20 lines in your changes missing coverage. Please review.

Project coverage is 85.52%. Comparing base (6d96650) to head (b423bf6).
Report is 18 commits behind head on develop.

Files with missing lines	Patch %	Lines
...kages/udp-tracker-server/src/statistics/metrics.rs	96.45%	9 Missing and 1 partial ⚠️
...es/udp-tracker-server/src/statistics/repository.rs	95.83%	6 Missing and 2 partials ⚠️
packages/metrics/src/metric/aggregate/avg.rs	99.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1593      +/-   ##
===========================================
+ Coverage    85.13%   85.52%   +0.39%     
===========================================
  Files          287      289       +2     
  Lines        22306    22842     +536     
  Branches     22306    22842     +536     
===========================================
+ Hits         18990    19536     +546     
+ Misses        2993     2986       -7     
+ Partials       323      320       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…time metric and update atomic It also fixes a division by zero bug when the metrics is updated before the counter for number of conenctions has been increased. It only avoid the division by zero. I will propoerly fixed with independent request counter for the moving average calculation.

… time metric and update atomic It also fixes a division by zero bug when the metrics is updated before the counter for number of conenctions has been increased. It only avoid the division by zero. I will propoerly fixed with independent request counter for the moving average calculation.

…ime metric and update atomic It also fixes a division by zero bug when the metrics is updated before the counter for number of conenctions has been increased. It only avoid the division by zero. I will propoerly fixed with independent request counter for the moving average calculation.

…e series We can't count the total number of UDP requests while calculating the moving average but updating it only for a concrete label set (time series). Averages are calculate for each label set. They could be aggregated by caclulating the average for all time series.

…n moving average calculation Add a new metric `UDP_TRACKER_SERVER_PERFORMANCE_PROCESSED_REQUESTS_TOTAL` to track requests processed specifically for performance metrics, eliminating race conditions in the moving average calculation. **Changes:** - Add new metric constant `UDP_TRACKER_SERVER_PERFORMANCE_PROCESSED_REQUESTS_TOTAL` - Update `recalculate_udp_avg_processing_time_ns()` to use dedicated counter instead of accepted requests total - Add `udp_processed_requests_total()` method to retrieve the new metric value - Add `increment_udp_processed_requests_total()` helper method - Update metric descriptions to include the new counter **Problem Fixed:** Previously, the moving average calculation used the accepted requests counter that could be updated independently, causing race conditions where the same request count was used for multiple calculations. The new implementation increments its own dedicated counter atomically during the calculation, ensuring consistency. **Behavior Change:** The counter now starts at 0 and gets incremented to 1 on the first calculation call, then uses proper moving average formula for subsequent calls. This eliminates division by zero issues and provides more accurate moving averages. **Tests Updated:** Updated repository tests to reflect the new atomic behavior where the processed requests counter is managed specifically for moving average calculations. Fixes race conditions in UDP request processing time metrics while maintaining backward compatibility of all public APIs.

Implements a new aggregate function for calculating averages of metric samples that match specific label criteria, complementing the existing Sum aggregation. - **metrics/src/metric/aggregate/avg.rs**: New metric-level average trait and implementations - `Avg` trait with `avg()` method for calculating averages - Implementation for `Metric<Counter>` returning `f64` - Implementation for `Metric<Gauge>` returning `f64` - Comprehensive unit tests with edge cases (empty samples, large values, etc.) - **metrics/src/metric_collection/aggregate/avg.rs**: New collection-level average trait - `Avg` trait for `MetricCollection` and `MetricKindCollection<T>` - Delegates to metric-level implementations - Handles mixed counter/gauge collections by trying counters first, then gauges - Returns `None` for non-existent metrics - Comprehensive test suite covering various scenarios - **metrics/src/metric/aggregate/mod.rs**: Export new `avg` module - **metrics/src/metric_collection/aggregate/mod.rs**: Export new `avg` module - **metrics/README.md**: Add example usage of the new `Avg` trait in the aggregation section - **Type Safety**: Returns appropriate types (`f64` for both counters and gauges) - **Label Filtering**: Supports filtering samples by label criteria like existing `Sum` - **Edge Case Handling**: Returns `0.0` for empty sample sets - **Performance**: Uses iterator chains for efficient sample processing - **Comprehensive Testing**: 205 tests pass including new avg functionality ```rust use torrust_tracker_metrics::metric_collection::aggregate::Avg; // Calculate average of all matching samples let avg_value = metrics.avg(&metric_name, &label_criteria); ``` The implementation follows the same patterns as the existing `Sum` aggregate function, ensuring consistency in the codebase and maintaining the same level of type safety and performance characteristics.

Improve AI-generated code. Moves the collect_matching_samples helper method from individual aggregate implementations to the generic Metric<T> implementation, making it reusable across all aggregate functions. - Add collect_matching_samples method to Metric<T> for filtering samples by label criteria - Remove code duplication between Sum and Avg aggregate implementations - Improve code organization by centralizing sample collection logic - Maintain backward compatibility and all existing functionality This refactoring improves maintainability by providing a single, well-tested implementation of sample filtering that can be used by current and future aggregate functions.

Division by zero issues was solved. It can't happen now becuase we increase the counter at the beggining of the function. ```rust #[allow(clippy::cast_precision_loss)] pub fn recalculate_udp_avg_processing_time_ns( &mut self, req_processing_time: Duration, label_set: &LabelSet, now: DurationSinceUnixEpoch, ) -> f64 { self.increment_udp_processed_requests_total(label_set, now); let processed_requests_total = self.udp_processed_requests_total(label_set) as f64; let previous_avg = self.udp_avg_processing_time_ns(label_set); let req_processing_time = req_processing_time.as_nanos() as f64; // Moving average: https://en.wikipedia.org/wiki/Moving_average let new_avg = previous_avg as f64 + (req_processing_time - previous_avg as f64) / processed_requests_total; tracing::debug!( "Recalculated UDP average processing time for labels {}: {} ns (previous: {} ns, req_processing_time: {} ns, request_processed_total: {})", label_set, new_avg, previous_avg, req_processing_time, processed_requests_total ); self.update_udp_avg_processing_time_ns(new_avg, label_set, now); new_avg } ```

…etrics When calculating aggregated values for processing time metrics across multiple servers, we need to use the average (.avg()) instead of sum (.sum()) because the metric samples are already averages per server. Using sum() on pre-averaged values would produce incorrect results, as it would add up the averages rather than computing the true average across all servers. Changes: - Add new *_averaged() methods that use .avg() for proper aggregation - Update services.rs to use the corrected averaging methods - Import Avg trait for metric collection averaging functionality Fixes incorrect metric aggregation for: - udp_avg_connect_processing_time_ns - udp_avg_announce_processing_time_ns - udp_avg_scrape_processing_time_ns"

da2ce7 · 2025-06-20T13:03:49Z

wow! big work!

josecelano · 2025-06-20T13:58:20Z

wow! big work!

Hey @da2ce7, I'm getting better at using AI models, and the models are also improving significantly. For example:

This is what the model did to add the new average aggregate function to the metrics package:

384b887

And this is what I changed from its implementation:

8fbcf90

I decided to commit it separately just to document things that the AI is not good at yet.

Adds a comprehensive unit test to validate thread safety when updating UDP_TRACKER_SERVER_PERFORMANCE_AVG_PROCESSING_TIME_NS metrics under concurrent load. The test: - Spawns 200 concurrent tasks (100 per server) simulating two UDP servers - Server 1: cycles through [1000, 2000, 3000, 4000, 5000] ns processing times - Server 2: cycles through [2000, 3000, 4000, 5000, 6000] ns processing times - Validates request counts, average calculations, and metric relationships - Uses tolerance-based assertions (±50ns) to account for moving average calculation variations in concurrent environments - Ensures thread safety and mathematical correctness of the metrics system This test helps ensure the UDP tracker server's metrics collection remains accurate and thread-safe under high-concurrency scenarios.

…cs race condition test Restructures the race condition test to follow clear Arrange-Act-Assert pattern and eliminates code duplication through helper function extraction. The test maintains identical functionality while being more maintainable, readable, and following DRY principles. All 200 concurrent tasks still validate thread safety and mathematical correctness of the metrics system.

josecelano · 2025-06-20T17:00:09Z

ACK b423bf6

refactor(udp-tracker-server): [torrust#1589] move average processing …

e6c05b6

…time calculation from Repository to Metrics

josecelano temporarily deployed to coverage June 19, 2025 09:31 — with GitHub Actions Inactive

josecelano requested a review from da2ce7 June 19, 2025 09:31

josecelano added the Bug Incorrect Behavior label Jun 19, 2025

josecelano linked an issue Jun 19, 2025 that may be closed by this pull request

Overhaul stats: Wrong UDP Average Connect Time metric #1589

Closed

3 tasks

josecelano added 3 commits June 19, 2025 11:19

josecelano temporarily deployed to coverage June 19, 2025 11:06 — with GitHub Actions Inactive

josecelano added 2 commits June 19, 2025 12:27

refactor: [torrust#1589] remvoe duplicate code

164de92

josecelano temporarily deployed to coverage June 19, 2025 13:00 — with GitHub Actions Inactive

josecelano self-assigned this Jun 19, 2025

josecelano temporarily deployed to coverage June 19, 2025 21:29 — with GitHub Actions Inactive

josecelano force-pushed the 1589-increase-moving-average-for-processing-time-atomically branch from 6aaa4e6 to ed5f1e6 Compare June 20, 2025 06:39

josecelano temporarily deployed to coverage June 20, 2025 06:39 — with GitHub Actions Inactive

josecelano added 2 commits June 20, 2025 07:58

josecelano temporarily deployed to coverage June 20, 2025 07:31 — with GitHub Actions Inactive

josecelano added 3 commits June 20, 2025 08:32

chore: remove deprecated comment

f402b02

fix: format

ba3d8a9

josecelano temporarily deployed to coverage June 20, 2025 09:25 — with GitHub Actions Inactive

josecelano force-pushed the 1589-increase-moving-average-for-processing-time-atomically branch from e4ebb50 to cd57f7a Compare June 20, 2025 09:38

josecelano temporarily deployed to coverage June 20, 2025 09:39 — with GitHub Actions Inactive

josecelano added 2 commits June 20, 2025 10:43

refactor: [torrust#1589] make methods private

4c082fa

refactor: [torrust#1589] rename methods and remove unused code

a9acca5

josecelano temporarily deployed to coverage June 20, 2025 10:33 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage June 20, 2025 15:52 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage June 20, 2025 16:51 — with GitHub Actions Inactive

josecelano marked this pull request as ready for review June 20, 2025 16:52

josecelano force-pushed the 1589-increase-moving-average-for-processing-time-atomically branch from 69e4670 to b423bf6 Compare June 20, 2025 16:59

josecelano temporarily deployed to coverage June 20, 2025 16:59 — with GitHub Actions Inactive

josecelano changed the title ~~Wrong UDP Average Connect Time metric~~ Fix bug: Wrong UDP Average Connect Time metric Jun 20, 2025

josecelano merged commit 7b1c190 into torrust:develop Jun 20, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug: Wrong UDP Average Connect Time metric #1593

Fix bug: Wrong UDP Average Connect Time metric #1593

Uh oh!

josecelano commented Jun 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 19, 2025 •

edited

Loading

Uh oh!

da2ce7 commented Jun 20, 2025

Uh oh!

josecelano commented Jun 20, 2025 •

edited

Loading

Uh oh!

josecelano commented Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix bug: Wrong UDP Average Connect Time metric #1593

Fix bug: Wrong UDP Average Connect Time metric #1593

Uh oh!

Conversation

josecelano commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Subtasks

How to test

Uh oh!

codecov bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

da2ce7 commented Jun 20, 2025

Uh oh!

josecelano commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josecelano commented Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

josecelano commented Jun 19, 2025 •

edited

Loading

codecov bot commented Jun 19, 2025 •

edited

Loading

josecelano commented Jun 20, 2025 •

edited

Loading