-
Notifications
You must be signed in to change notification settings - Fork 49
Fix bug: Wrong UDP Average Connect Time metric #1593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug: Wrong UDP Average Connect Time metric #1593
Conversation
…time calculation from Repository to Metrics
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1593 +/- ##
===========================================
+ Coverage 85.13% 85.52% +0.39%
===========================================
Files 287 289 +2
Lines 22306 22842 +536
Branches 22306 22842 +536
===========================================
+ Hits 18990 19536 +546
+ Misses 2993 2986 -7
+ Partials 323 320 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…time metric and update atomic It also fixes a division by zero bug when the metrics is updated before the counter for number of conenctions has been increased. It only avoid the division by zero. I will propoerly fixed with independent request counter for the moving average calculation.
… time metric and update atomic It also fixes a division by zero bug when the metrics is updated before the counter for number of conenctions has been increased. It only avoid the division by zero. I will propoerly fixed with independent request counter for the moving average calculation.
…ime metric and update atomic It also fixes a division by zero bug when the metrics is updated before the counter for number of conenctions has been increased. It only avoid the division by zero. I will propoerly fixed with independent request counter for the moving average calculation.
…e series We can't count the total number of UDP requests while calculating the moving average but updating it only for a concrete label set (time series). Averages are calculate for each label set. They could be aggregated by caclulating the average for all time series.
…n moving average calculation Add a new metric `UDP_TRACKER_SERVER_PERFORMANCE_PROCESSED_REQUESTS_TOTAL` to track requests processed specifically for performance metrics, eliminating race conditions in the moving average calculation. **Changes:** - Add new metric constant `UDP_TRACKER_SERVER_PERFORMANCE_PROCESSED_REQUESTS_TOTAL` - Update `recalculate_udp_avg_processing_time_ns()` to use dedicated counter instead of accepted requests total - Add `udp_processed_requests_total()` method to retrieve the new metric value - Add `increment_udp_processed_requests_total()` helper method - Update metric descriptions to include the new counter **Problem Fixed:** Previously, the moving average calculation used the accepted requests counter that could be updated independently, causing race conditions where the same request count was used for multiple calculations. The new implementation increments its own dedicated counter atomically during the calculation, ensuring consistency. **Behavior Change:** The counter now starts at 0 and gets incremented to 1 on the first calculation call, then uses proper moving average formula for subsequent calls. This eliminates division by zero issues and provides more accurate moving averages. **Tests Updated:** Updated repository tests to reflect the new atomic behavior where the processed requests counter is managed specifically for moving average calculations. Fixes race conditions in UDP request processing time metrics while maintaining backward compatibility of all public APIs.
6aaa4e6 to
ed5f1e6
Compare
Implements a new aggregate function for calculating averages of metric samples that match specific label criteria, complementing the existing Sum aggregation. - **metrics/src/metric/aggregate/avg.rs**: New metric-level average trait and implementations - `Avg` trait with `avg()` method for calculating averages - Implementation for `Metric<Counter>` returning `f64` - Implementation for `Metric<Gauge>` returning `f64` - Comprehensive unit tests with edge cases (empty samples, large values, etc.) - **metrics/src/metric_collection/aggregate/avg.rs**: New collection-level average trait - `Avg` trait for `MetricCollection` and `MetricKindCollection<T>` - Delegates to metric-level implementations - Handles mixed counter/gauge collections by trying counters first, then gauges - Returns `None` for non-existent metrics - Comprehensive test suite covering various scenarios - **metrics/src/metric/aggregate/mod.rs**: Export new `avg` module - **metrics/src/metric_collection/aggregate/mod.rs**: Export new `avg` module - **metrics/README.md**: Add example usage of the new `Avg` trait in the aggregation section - **Type Safety**: Returns appropriate types (`f64` for both counters and gauges) - **Label Filtering**: Supports filtering samples by label criteria like existing `Sum` - **Edge Case Handling**: Returns `0.0` for empty sample sets - **Performance**: Uses iterator chains for efficient sample processing - **Comprehensive Testing**: 205 tests pass including new avg functionality ```rust use torrust_tracker_metrics::metric_collection::aggregate::Avg; // Calculate average of all matching samples let avg_value = metrics.avg(&metric_name, &label_criteria); ``` The implementation follows the same patterns as the existing `Sum` aggregate function, ensuring consistency in the codebase and maintaining the same level of type safety and performance characteristics.
Improve AI-generated code. Moves the collect_matching_samples helper method from individual aggregate implementations to the generic Metric<T> implementation, making it reusable across all aggregate functions. - Add collect_matching_samples method to Metric<T> for filtering samples by label criteria - Remove code duplication between Sum and Avg aggregate implementations - Improve code organization by centralizing sample collection logic - Maintain backward compatibility and all existing functionality This refactoring improves maintainability by providing a single, well-tested implementation of sample filtering that can be used by current and future aggregate functions.
Division by zero issues was solved. It can't happen now becuase we
increase the counter at the beggining of the function.
```rust
#[allow(clippy::cast_precision_loss)]
pub fn recalculate_udp_avg_processing_time_ns(
&mut self,
req_processing_time: Duration,
label_set: &LabelSet,
now: DurationSinceUnixEpoch,
) -> f64 {
self.increment_udp_processed_requests_total(label_set, now);
let processed_requests_total = self.udp_processed_requests_total(label_set) as f64;
let previous_avg = self.udp_avg_processing_time_ns(label_set);
let req_processing_time = req_processing_time.as_nanos() as f64;
// Moving average: https://en.wikipedia.org/wiki/Moving_average
let new_avg = previous_avg as f64 + (req_processing_time - previous_avg as f64) / processed_requests_total;
tracing::debug!(
"Recalculated UDP average processing time for labels {}: {} ns (previous: {} ns, req_processing_time: {} ns, request_processed_total: {})",
label_set,
new_avg,
previous_avg,
req_processing_time,
processed_requests_total
);
self.update_udp_avg_processing_time_ns(new_avg, label_set, now);
new_avg
}
```
…etrics When calculating aggregated values for processing time metrics across multiple servers, we need to use the average (.avg()) instead of sum (.sum()) because the metric samples are already averages per server. Using sum() on pre-averaged values would produce incorrect results, as it would add up the averages rather than computing the true average across all servers. Changes: - Add new *_averaged() methods that use .avg() for proper aggregation - Update services.rs to use the corrected averaging methods - Import Avg trait for metric collection averaging functionality Fixes incorrect metric aggregation for: - udp_avg_connect_processing_time_ns - udp_avg_announce_processing_time_ns - udp_avg_scrape_processing_time_ns"
e4ebb50 to
cd57f7a
Compare
|
wow! big work! |
Hey @da2ce7, I'm getting better at using AI models, and the models are also improving significantly. For example: This is what the model did to add the new average aggregate function to the And this is what I changed from its implementation: I decided to commit it separately just to document things that the AI is not good at yet. |
Adds a comprehensive unit test to validate thread safety when updating UDP_TRACKER_SERVER_PERFORMANCE_AVG_PROCESSING_TIME_NS metrics under concurrent load. The test: - Spawns 200 concurrent tasks (100 per server) simulating two UDP servers - Server 1: cycles through [1000, 2000, 3000, 4000, 5000] ns processing times - Server 2: cycles through [2000, 3000, 4000, 5000, 6000] ns processing times - Validates request counts, average calculations, and metric relationships - Uses tolerance-based assertions (±50ns) to account for moving average calculation variations in concurrent environments - Ensures thread safety and mathematical correctness of the metrics system This test helps ensure the UDP tracker server's metrics collection remains accurate and thread-safe under high-concurrency scenarios.
…cs race condition test Restructures the race condition test to follow clear Arrange-Act-Assert pattern and eliminates code duplication through helper function extraction. The test maintains identical functionality while being more maintainable, readable, and following DRY principles. All 200 concurrent tasks still validate thread safety and mathematical correctness of the metrics system.
69e4670 to
b423bf6
Compare
|
ACK b423bf6 |
Fix bug: Wrong UDP Average Connect Time metric.
Context
See #1589.
Subtasks
SUM.metricspackage.6868,6969)How to test
cargo runannouncerequest per UDP serverIn the labelled metrics, there should be two metric samples like these:
curl -s "http://localhost:1212/api/v1/metrics?token=MyAccessToken&format=prometheus"
In the global aggregated metrics:
curl -s "http://localhost:1212/api/v1/metrics?token=MyAccessToken&format=prometheus"
The values should be the average of the server's averages:
41911 = (40326 + 43497.71428571428)/2 = 41911,857142857
60418 = (54773 + 66063.71428571429)/2 = 60418,357142857
The values are rounded because we use a u64 for the global aggregated metrics.