Skip to content

Conversation

@josecelano
Copy link
Member

@josecelano josecelano commented Jun 19, 2025

Fix bug: Wrong UDP Average Connect Time metric.

Context

See #1589.

Subtasks

  • Read, recalculate, and update average should be atomic.
  • Average should be calculated for a time series (label set). We are mixing series: we update the average using a label set (segregate average), but count the requests for the average globally (not segregated).
  • Add a new metric to count requests for the moving average. This counter is also increased atomically when the new average is updated.
  • Fix global (all trackers, no labels) value for the average. It should be the average of all average samples, not the SUM.
    • Add new average aggregate function to the metrics package.
  • Add tests:
    • With two times series. Two trackers running on different ports (6868, 6969)
    • For race conditions, running multiple requests in parallel. To ensure the average is accurate after many iterations.

How to test

  1. Run the tracker: cargo run
  2. Simulate one announce request per UDP server
cargo run -p torrust-tracker-client --bin udp_tracker_client announce udp://127.0.0.1:6868 000620bbc6c52d5a96d98f6c0f1dfa523a40df82 | jq

cargo run -p torrust-tracker-client --bin udp_tracker_client announce udp://127.0.0.1:6969 000620bbc6c52d5a96d98f6c0f1dfa523a40df82 | jq
  1. Check metrics

In the labelled metrics, there should be two metric samples like these:

curl -s "http://localhost:1212/api/v1/metrics?token=MyAccessToken&format=prometheus"

# HELP udp_tracker_server_performance_avg_processing_time_ns Average time to process a UDP request in nanoseconds
# TYPE udp_tracker_server_performance_avg_processing_time_ns gauge
udp_tracker_server_performance_avg_processing_time_ns{request_kind="announce",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6868",server_binding_protocol="udp"} 54773
udp_tracker_server_performance_avg_processing_time_ns{request_kind="connect",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6868",server_binding_protocol="udp"} 40326
udp_tracker_server_performance_avg_processing_time_ns{request_kind="announce",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6969",server_binding_protocol="udp"} 66063.71428571429
udp_tracker_server_performance_avg_processing_time_ns{request_kind="connect",server_binding_address_ip_family="inet",server_binding_address_ip_type="plain",server_binding_ip="0.0.0.0",server_binding_port="6969",server_binding_protocol="udp"} 43497.71428571428

In the global aggregated metrics:

curl -s "http://localhost:1212/api/v1/metrics?token=MyAccessToken&format=prometheus"

The values should be the average of the server's averages:

udp_avg_connect_processing_time_ns 41911
udp_avg_announce_processing_time_ns 60418
udp_avg_scrape_processing_time_ns 0

41911 = (40326 + 43497.71428571428)/2 = 41911,857142857
60418 = (54773 + 66063.71428571429)/2 = 60418,357142857

The values are rounded because we use a u64 for the global aggregated metrics.

@josecelano josecelano requested a review from da2ce7 June 19, 2025 09:31
@josecelano josecelano added the Bug Incorrect Behavior label Jun 19, 2025
@josecelano josecelano linked an issue Jun 19, 2025 that may be closed by this pull request
3 tasks
@codecov
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

Attention: Patch coverage is 97.59036% with 20 lines in your changes missing coverage. Please review.

Project coverage is 85.52%. Comparing base (6d96650) to head (b423bf6).
Report is 18 commits behind head on develop.

Files with missing lines Patch % Lines
...kages/udp-tracker-server/src/statistics/metrics.rs 96.45% 9 Missing and 1 partial ⚠️
...es/udp-tracker-server/src/statistics/repository.rs 95.83% 6 Missing and 2 partials ⚠️
packages/metrics/src/metric/aggregate/avg.rs 99.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1593      +/-   ##
===========================================
+ Coverage    85.13%   85.52%   +0.39%     
===========================================
  Files          287      289       +2     
  Lines        22306    22842     +536     
  Branches     22306    22842     +536     
===========================================
+ Hits         18990    19536     +546     
+ Misses        2993     2986       -7     
+ Partials       323      320       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…time metric and update atomic

It also fixes a division by zero bug when the metrics is updated before
the counter for number of conenctions has been increased.

It only avoid the division by zero. I will propoerly fixed with
independent request counter for the moving average calculation.
… time metric and update atomic

It also fixes a division by zero bug when the metrics is updated before
the counter for number of conenctions has been increased.

It only avoid the division by zero. I will propoerly fixed with
independent request counter for the moving average calculation.
…ime metric and update atomic

It also fixes a division by zero bug when the metrics is updated before
the counter for number of conenctions has been increased.

It only avoid the division by zero. I will propoerly fixed with
independent request counter for the moving average calculation.
…e series

We can't count the total number of UDP requests while calculating the
moving average but updating it only for a concrete label set (time
series).

Averages are calculate for each label set. They could be aggregated by
caclulating the average for all time series.
…n moving average calculation

Add a new metric `UDP_TRACKER_SERVER_PERFORMANCE_PROCESSED_REQUESTS_TOTAL` to track
requests processed specifically for performance metrics, eliminating race conditions
in the moving average calculation.

**Changes:**
- Add new metric constant `UDP_TRACKER_SERVER_PERFORMANCE_PROCESSED_REQUESTS_TOTAL`
- Update `recalculate_udp_avg_processing_time_ns()` to use dedicated counter instead of accepted requests total
- Add `udp_processed_requests_total()` method to retrieve the new metric value
- Add `increment_udp_processed_requests_total()` helper method
- Update metric descriptions to include the new counter

**Problem Fixed:**
Previously, the moving average calculation used the accepted requests counter that could be
updated independently, causing race conditions where the same request count was used for
multiple calculations. The new implementation increments its own dedicated counter atomically
during the calculation, ensuring consistency.

**Behavior Change:**
The counter now starts at 0 and gets incremented to 1 on the first calculation call,
then uses proper moving average formula for subsequent calls. This eliminates division
by zero issues and provides more accurate moving averages.

**Tests Updated:**
Updated repository tests to reflect the new atomic behavior where the processed requests
counter is managed specifically for moving average calculations.

Fixes race conditions in UDP request processing time metrics while maintaining
backward compatibility of all public APIs.
@josecelano josecelano force-pushed the 1589-increase-moving-average-for-processing-time-atomically branch from 6aaa4e6 to ed5f1e6 Compare June 20, 2025 06:39
Implements a new aggregate function for calculating averages of metric samples
that match specific label criteria, complementing the existing Sum aggregation.

- **metrics/src/metric/aggregate/avg.rs**: New metric-level average trait and implementations
  - `Avg` trait with `avg()` method for calculating averages
  - Implementation for `Metric<Counter>` returning `f64`
  - Implementation for `Metric<Gauge>` returning `f64`
  - Comprehensive unit tests with edge cases (empty samples, large values, etc.)

- **metrics/src/metric_collection/aggregate/avg.rs**: New collection-level average trait
  - `Avg` trait for `MetricCollection` and `MetricKindCollection<T>`
  - Delegates to metric-level implementations
  - Handles mixed counter/gauge collections by trying counters first, then gauges
  - Returns `None` for non-existent metrics
  - Comprehensive test suite covering various scenarios

- **metrics/src/metric/aggregate/mod.rs**: Export new `avg` module
- **metrics/src/metric_collection/aggregate/mod.rs**: Export new `avg` module

- **metrics/README.md**: Add example usage of the new `Avg` trait in the aggregation section

- **Type Safety**: Returns appropriate types (`f64` for both counters and gauges)
- **Label Filtering**: Supports filtering samples by label criteria like existing `Sum`
- **Edge Case Handling**: Returns `0.0` for empty sample sets
- **Performance**: Uses iterator chains for efficient sample processing
- **Comprehensive Testing**: 205 tests pass including new avg functionality

```rust
use torrust_tracker_metrics::metric_collection::aggregate::Avg;

// Calculate average of all matching samples
let avg_value = metrics.avg(&metric_name, &label_criteria);
```

The implementation follows the same patterns as the existing `Sum` aggregate function,
ensuring consistency in the codebase and maintaining the same level of type safety
and performance characteristics.
Improve AI-generated code.

Moves the collect_matching_samples helper method from individual aggregate
implementations to the generic Metric<T> implementation, making it reusable
across all aggregate functions.

- Add collect_matching_samples method to Metric<T> for filtering samples
  by label criteria
- Remove code duplication between Sum and Avg aggregate implementations
- Improve code organization by centralizing sample collection logic
- Maintain backward compatibility and all existing functionality

This refactoring improves maintainability by providing a single, well-tested
implementation of sample filtering that can be used by current and future
aggregate functions.
Division by zero issues was solved. It can't happen now becuase we
increase the counter at the beggining of the function.

```rust
    #[allow(clippy::cast_precision_loss)]
    pub fn recalculate_udp_avg_processing_time_ns(
        &mut self,
        req_processing_time: Duration,
        label_set: &LabelSet,
        now: DurationSinceUnixEpoch,
    ) -> f64 {
        self.increment_udp_processed_requests_total(label_set, now);

        let processed_requests_total = self.udp_processed_requests_total(label_set) as f64;
        let previous_avg = self.udp_avg_processing_time_ns(label_set);
        let req_processing_time = req_processing_time.as_nanos() as f64;

        // Moving average: https://en.wikipedia.org/wiki/Moving_average
        let new_avg = previous_avg as f64 + (req_processing_time - previous_avg as f64) / processed_requests_total;

        tracing::debug!(
            "Recalculated UDP average processing time for labels {}: {} ns (previous: {} ns, req_processing_time: {} ns, request_processed_total: {})",
            label_set,
            new_avg,
            previous_avg,
            req_processing_time,
            processed_requests_total
        );

        self.update_udp_avg_processing_time_ns(new_avg, label_set, now);

        new_avg
    }
```
…etrics

When calculating aggregated values for processing time metrics across
multiple servers, we need to use the average (.avg()) instead of sum
(.sum()) because the metric samples are already averages per server.

Using sum() on pre-averaged values would produce incorrect results,
as it would add up the averages rather than computing the true average
across all servers.

Changes:
- Add new *_averaged() methods that use .avg() for proper aggregation
- Update services.rs to use the corrected averaging methods
- Import Avg trait for metric collection averaging functionality

Fixes incorrect metric aggregation for:
- udp_avg_connect_processing_time_ns
- udp_avg_announce_processing_time_ns
- udp_avg_scrape_processing_time_ns"
@josecelano josecelano force-pushed the 1589-increase-moving-average-for-processing-time-atomically branch from e4ebb50 to cd57f7a Compare June 20, 2025 09:38
@da2ce7
Copy link
Contributor

da2ce7 commented Jun 20, 2025

wow! big work!

@josecelano
Copy link
Member Author

josecelano commented Jun 20, 2025

wow! big work!

Hey @da2ce7, I'm getting better at using AI models, and the models are also improving significantly. For example:

This is what the model did to add the new average aggregate function to the metrics package:

384b887

And this is what I changed from its implementation:

8fbcf90

I decided to commit it separately just to document things that the AI is not good at yet.

Adds a comprehensive unit test to validate thread safety when updating
UDP_TRACKER_SERVER_PERFORMANCE_AVG_PROCESSING_TIME_NS metrics under
concurrent load.

The test:
- Spawns 200 concurrent tasks (100 per server) simulating two UDP servers
- Server 1: cycles through [1000, 2000, 3000, 4000, 5000] ns processing times
- Server 2: cycles through [2000, 3000, 4000, 5000, 6000] ns processing times
- Validates request counts, average calculations, and metric relationships
- Uses tolerance-based assertions (±50ns) to account for moving average
  calculation variations in concurrent environments
- Ensures thread safety and mathematical correctness of the metrics system

This test helps ensure the UDP tracker server's metrics collection remains
accurate and thread-safe under high-concurrency scenarios.
@josecelano josecelano marked this pull request as ready for review June 20, 2025 16:52
…cs race condition test

Restructures the race condition test to follow clear Arrange-Act-Assert pattern
and eliminates code duplication through helper function extraction.

The test maintains identical functionality while being more maintainable,
readable, and following DRY principles. All 200 concurrent tasks still
validate thread safety and mathematical correctness of the metrics system.
@josecelano josecelano force-pushed the 1589-increase-moving-average-for-processing-time-atomically branch from 69e4670 to b423bf6 Compare June 20, 2025 16:59
@josecelano
Copy link
Member Author

ACK b423bf6

@josecelano josecelano changed the title Wrong UDP Average Connect Time metric Fix bug: Wrong UDP Average Connect Time metric Jun 20, 2025
@josecelano josecelano merged commit 7b1c190 into torrust:develop Jun 20, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect Behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Overhaul stats: Wrong UDP Average Connect Time metric

2 participants