Skip to content

[WIP] Enhance metrics configuration, add health monitoring tests, security hardening, perf optimizations & loads of tests#463

Merged
axkurcom merged 25 commits intotelemt:flow-secfrom
DavidOsipov:pr-sec-1
Mar 18, 2026
Merged

[WIP] Enhance metrics configuration, add health monitoring tests, security hardening, perf optimizations & loads of tests#463
axkurcom merged 25 commits intotelemt:flow-secfrom
DavidOsipov:pr-sec-1

Conversation

@DavidOsipov
Copy link
Copy Markdown
Contributor

@DavidOsipov DavidOsipov commented Mar 17, 2026

Произвёл rebase и потенциальный фикс для #457

Улучшения конфигурации эндпоинта метрик:

  • Добавлена новая опция metrics_listen в ServerConfig и config.toml, позволяющая привязать эндпоинт метрик к определенному адресу и порту (например, "0.0.0.0:9090"). Эта опция имеет приоритет над устаревшей опцией metrics_port. Реализация обеспечивает правильный приоритет и обработку ошибок в случае недействительного адреса (Cargo.toml, config.toml, src/config/types.rs, src/maestro/runtime_tasks.rs, src/metrics.rs).

Надежность пользовательских подключений и резервирования IP:

  • Произведен рефакторинг логики подключения пользователей и резервирования IP-адресов: добавлена новая RAII-структура UserConnectionReservation, гарантирующая, что слоты подключений и IP-адреса всегда освобождаются при прерывании или переключении (cutover) задачи релея. Это предотвращает утечки ресурсов и обеспечивает более надежное соблюдение пользовательских лимитов (src/proxy/client.rs).
    Улучшения тестирования:

  • Добавлены регрессионные и интеграционные тесты для проверки того, что слоты пользовательских подключений и зарезервированные IP-адреса корректно освобождаются при прерываниях и переключениях релея, а также того, что новая логика привязки метрик работает как задумано (src/proxy/client_security_tests.rs, src/main.rs).

Управление прямыми релей-подключениями (Direct relay):

  • Обновлена логика прямого релея: теперь для управления текущим количеством подключений используется новый подход на основе аренды (lease-based), что повышает надежность и обеспечивает симметрию с логикой управления пользовательскими подключениями (src/proxy/direct_relay.rs).

В этом PR специально целая куча тестов (скорее всего больше 80% от изменений в коде), чтобы находить скрытые баги.

kutovoys and others added 13 commits March 17, 2026 12:58
- Bump telemt dependency version from 3.3.15 to 3.3.19.
- Add `metrics_listen` option to `config.toml` for specifying a custom address for the metrics endpoint.
- Update `ServerConfig` struct to include `metrics_listen` and adjust logic in `spawn_metrics_if_configured` to prioritize this new option over `metrics_port`.
- Enhance error handling for invalid listen addresses in metrics setup.
feat: add metrics_listen option for metrics endpoint bind address
- Introduced adversarial tests to validate the behavior of the health monitoring system under various conditions, including the management of draining writers.
- Implemented integration tests to ensure the health monitor correctly handles expired and empty draining writers.
- Added regression tests to verify the functionality of the draining writers' cleanup process, ensuring it adheres to the defined thresholds and budgets.
- Updated the module structure to include the new test files for better organization and maintainability.
Add health monitoring tests for draining writers
Copilot AI review requested due to automatic review settings March 17, 2026 15:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves operational reliability and observability by tightening metrics endpoint binding configuration and hardening connection/resource accounting so limits don’t get “stuck” after task aborts/cutovers, with added regression/integration tests around middle-proxy health monitoring and connection gauges.

Changes:

  • Add server.metrics_listen (takes precedence over metrics_port) and update metrics task spawning/binding logic accordingly.
  • Introduce RAII “lease” guards for route connection gauges and refactor relay paths to use them, with security/regression tests for abort/cutover cases.
  • Rework middle-proxy draining-writer reaping to be budgeted per health tick and add extensive health regression/integration/adversarial tests.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
Cargo.toml Bump crate version to 3.3.20.
Cargo.lock Lockfile updated for version bump.
config.toml Document new metrics_listen option.
docs/fronting-splitting/TLS-F-TCP-S.ru.md Fix TOML examples formatting/quoting in docs.
src/config/types.rs Add ServerConfig.metrics_listen with docs and default.
src/maestro/runtime_tasks.rs Enforce precedence/validation for metrics_listen when spawning metrics task.
src/metrics.rs Add listen parameter and bind metrics to a specific address when configured.
src/stats/mod.rs Add RouteConnectionLease RAII guard + Stats::acquire_*_connection_lease() APIs.
src/stats/connection_lease_security_tests.rs Tests ensuring route gauge leases are balanced under panic/abort/concurrency.
src/proxy/direct_relay.rs Use direct connection lease instead of manual increment/decrement.
src/proxy/direct_relay_security_tests.rs Regression tests for gauge release on abort/cutover in direct relay.
src/proxy/middle_relay.rs Use ME connection lease instead of manual increment/decrement.
src/proxy/middle_relay_security_tests.rs Regression tests for gauge release on abort/cutover in middle relay.
src/proxy/client.rs Add UserConnectionReservation RAII guard for user slot + IP reservation, and refactor auth handler to use it.
src/proxy/client_security_tests.rs Regression tests for releasing user slot and IP reservation on abort/cutover; add synthetic_local_addr tests.
src/main.rs Include ip_tracker_regression_tests under cfg(test).
src/ip_tracker_regression_tests.rs Add broad regression/stress tests for UserIpTracker behavior.
src/transport/middle_proxy/mod.rs Register new health test modules under cfg(test).
src/transport/middle_proxy/health.rs Budgeted draining-writer close logic + new drain close budget function + tests updates.
src/transport/middle_proxy/health_regression_tests.rs New regression tests for draining-writer reaping behavior and invariants.
src/transport/middle_proxy/health_integration_tests.rs New integration tests around me_health_monitor draining behavior across cycles.
src/transport/middle_proxy/health_adversarial_tests.rs New adversarial/stress tests for draining backlog handling and invariants.

You can also share your feedback on Copilot code review. Take the survey.

…lease method, improve cleanup on drop, and implement tests for immediate release and concurrent handling
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves operational robustness and observability by (1) adding a dedicated metrics_listen configuration option (with correct precedence over the legacy metrics_port) and (2) hardening connection/resource accounting via RAII guards and expanded health-monitoring cleanup, backed by new regression/integration/adversarial tests.

Changes:

  • Added server.metrics_listen (TOML + config types) and updated metrics startup/binding logic to prefer a single explicit listen address over the legacy dual-stack metrics_port behavior.
  • Introduced RAII-based leases for route connection gauges and user connection/IP reservations to ensure counters/reservations are released on abort/cutover/panic paths.
  • Reworked middle-proxy draining writer cleanup to be snapshot-based and budgeted per health cycle, and added extensive tests for health-monitor behavior and cleanup invariants.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/maestro/runtime_tasks.rs Implements precedence/validation for metrics_listen and passes it through to metrics serving.
src/metrics.rs Adds optional single-address bind path for metrics endpoint.
src/config/types.rs Adds ServerConfig.metrics_listen with documentation and defaults.
config.toml Documents the new metrics_listen option.
src/stats/mod.rs Adds RouteConnectionLease and Stats::acquire_*_connection_lease() for RAII gauge management.
src/stats/connection_lease_security_tests.rs Adds tests ensuring leases release on drop/panic/abort and under concurrency.
src/proxy/direct_relay.rs Switches direct relay connection gauge tracking to RAII lease.
src/proxy/middle_relay.rs Switches middle relay connection gauge tracking to RAII lease.
src/proxy/direct_relay_security_tests.rs Adds regression tests ensuring gauge release on abort/cutover for direct relay.
src/proxy/middle_relay_security_tests.rs Adds regression tests ensuring gauge release on abort/cutover for middle relay.
src/proxy/client.rs Adds UserConnectionReservation RAII guard and refactors limit/IP reservation handling to use it.
src/proxy/client_security_tests.rs Adds tests for reservation release on abort/cutover and explicit release semantics.
src/transport/middle_proxy/health.rs Makes draining-writer reap logic snapshot-based and adds a per-cycle close budget helper.
src/transport/middle_proxy/mod.rs Registers new health test modules under #[cfg(test)].
src/transport/middle_proxy/health_regression_tests.rs Adds targeted regression tests for draining writer reap behavior.
src/transport/middle_proxy/health_integration_tests.rs Adds integration tests that run the health monitor and validate convergence.
src/transport/middle_proxy/health_adversarial_tests.rs Adds adversarial/churn/storm tests for budgeted draining cleanup behavior.
src/main.rs Wires in new ip_tracker_regression_tests under #[cfg(test)].
src/ip_tracker_regression_tests.rs Adds extensive regression tests for IP tracker policies and concurrency behavior.
docs/fronting-splitting/TLS-F-TCP-S.ru.md Updates TOML examples formatting/quoting.
Cargo.toml / Cargo.lock Bumps crate version to 3.3.20.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses connection-limit/resource-leak scenarios (issue #457) by making connection/IP accounting more robust under task abort/cutover, improves metrics endpoint binding configuration, and adds extensive regression/integration coverage around these behaviors.

Changes:

  • Add server.metrics_listen (with precedence over metrics_port) and update metrics spawning/binding behavior accordingly.
  • Introduce RAII “lease/reservation” patterns to ensure route-connection gauges and per-user connection/IP reservations are released on early-return, abort, and cutover.
  • Add a large suite of new regression/integration/adversarial tests for middle-proxy health draining, IP tracker behavior, and abort/cutover cleanup paths.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/maestro/runtime_tasks.rs Implements metrics_listen precedence and passes listen target into metrics server.
src/metrics.rs Adds single-address binding mode when metrics_listen is set; keeps dual-stack binding for metrics_port.
src/config/types.rs Extends ServerConfig with metrics_listen and documents precedence vs metrics_port.
config.toml Documents new metrics_listen option in the sample config.
src/stats/mod.rs Adds RouteConnectionLease RAII type and Stats::{acquire_direct_connection_lease, acquire_me_connection_lease} helpers.
src/proxy/direct_relay.rs Switches direct-route gauge tracking from manual inc/dec to RAII lease.
src/proxy/middle_relay.rs Switches middle-route gauge tracking from manual inc/dec to RAII lease.
src/proxy/client.rs Adds UserConnectionReservation RAII to ensure per-user slot + IP reservation cleanup on abort/cutover/errors.
src/transport/middle_proxy/health.rs Refactors draining-writer reap logic to use registry snapshots, adds close budgeting, and makes helpers visible to test modules.
src/transport/middle_proxy/mod.rs Registers new middle-proxy health test modules under #[cfg(test)].
src/transport/middle_proxy/health_regression_tests.rs Adds targeted regression tests for draining-writer reaping behavior/state cleanup.
src/transport/middle_proxy/health_integration_tests.rs Adds monitor-loop integration tests for drain backlog behavior.
src/transport/middle_proxy/health_adversarial_tests.rs Adds stress/adversarial tests to validate bounded draining behavior and warn-state invariants.
src/stats/connection_lease_security_tests.rs Adds tests ensuring connection leases are released on panic/unwind, abort, and concurrency churn.
src/proxy/direct_relay_security_tests.rs Adds abort/cutover tests verifying direct-route gauge release.
src/proxy/middle_relay_security_tests.rs Adds abort/cutover tests verifying middle-route gauge release.
src/proxy/client_security_tests.rs Adds tests verifying per-user slot + IP reservation release on abort/cutover/error and explicit release semantics.
src/main.rs Wires in new IP-tracker regression test module under #[cfg(test)].
src/ip_tracker_regression_tests.rs Adds extensive regression/stress coverage for UserIpTracker behavior across modes and concurrency.
docs/fronting-splitting/TLS-F-TCP-S.ru.md Updates TOML examples formatting/quoting.
Cargo.lock Bumps crate version entry (3.3.193.3.20).

You can also share your feedback on Copilot code review. Take the survey.

…d IP cleanup and implement tests for user expiration and connection limits
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses connection-limit/resource-leak scenarios (Issue #457) by improving draining-writer cleanup and making connection/IP accounting more robust, while also enhancing metrics endpoint configuration via a new metrics_listen option.

Changes:

  • Added metrics_listen (with priority over metrics_port) and updated metrics startup/binding logic accordingly.
  • Introduced RAII-style “lease/reservation” patterns to ensure connection gauges, per-user connection slots, and IP reservations are released on normal exit, cutover, and task abort.
  • Added extensive regression/integration/adversarial tests for ME draining-writer reaping, route gauge leases, and IP-tracker behaviors.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/transport/middle_proxy/mod.rs Registers new health monitoring test modules under cfg(test).
src/transport/middle_proxy/health.rs Refactors draining-writer reaping; adds close budgeting and exposes helpers for tests.
src/transport/middle_proxy/health_regression_tests.rs Adds regression tests for draining-writer reaping and warn-state cleanup.
src/transport/middle_proxy/health_integration_tests.rs Adds integration tests validating monitor-driven cleanup across cycles.
src/transport/middle_proxy/health_adversarial_tests.rs Adds stress/adversarial tests for bounded cleanup and convergence properties.
src/stats/mod.rs Adds RouteConnectionLease and acquisition helpers for direct/ME connection gauges.
src/stats/connection_lease_security_tests.rs Adds tests ensuring leases balance under panic/abort/concurrency.
src/proxy/middle_relay.rs Switches ME connection gauge tracking to lease-based RAII.
src/proxy/middle_relay_security_tests.rs Adds abort/cutover tests ensuring ME gauge lease is released.
src/proxy/direct_relay.rs Switches direct connection gauge tracking to lease-based RAII.
src/proxy/direct_relay_security_tests.rs Adds abort/cutover tests ensuring direct gauge lease is released.
src/proxy/client.rs Adds UserConnectionReservation RAII to reliably release user slot + IP reservation.
src/proxy/client_security_tests.rs Adds extensive tests for abort/cutover/release/drop behaviors and IP limits.
src/metrics.rs Adds optional listen binding path (single address) for metrics endpoint.
src/maestro/runtime_tasks.rs Implements metrics_listen precedence and passes through to metrics server.
src/config/types.rs Adds ServerConfig.metrics_listen and documents precedence/behavior.
config.toml Documents the new metrics_listen option.
src/main.rs Includes new ip_tracker_regression_tests module under cfg(test).
src/ip_tracker_regression_tests.rs Adds regression/stress tests for IP tracker policies and state cleanup.
docs/fronting-splitting/TLS-F-TCP-S.ru.md Fixes TOML examples formatting/quoting in docs.
Cargo.lock Bumps telemt version to 3.3.20.

You can also share your feedback on Copilot code review. Take the survey.

…ests: update warning messages, add new tests for draining writers, and improve state management
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances observability and robustness around connection/health management by adding a dedicated metrics bind option and introducing RAII-based accounting/cleanup to prevent leaked connection state, alongside extensive regression/integration/adversarial tests.

Changes:

  • Add server.metrics_listen (takes precedence over metrics_port) and update metrics startup/binding logic accordingly.
  • Introduce lease/RAII patterns for connection gauges and user/IP reservations to ensure cleanup on abort/cutover/error paths.
  • Expand middle-proxy health monitoring logic (budgeted draining cleanup) and add comprehensive tests for health, leases, and IP tracker behavior.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/transport/middle_proxy/mod.rs Registers new middle-proxy health test modules under cfg(test).
src/transport/middle_proxy/health.rs Adds budgeted draining-writer close logic and exposes helpers for tests.
src/transport/middle_proxy/health_regression_tests.rs Regression tests for draining-writer reap logic invariants and warn-state behavior.
src/transport/middle_proxy/health_integration_tests.rs Integration tests exercising me_health_monitor draining behavior across cycles.
src/transport/middle_proxy/health_adversarial_tests.rs Adversarial/churn tests for draining cleanup invariants and close budgeting.
src/stats/mod.rs Adds RouteConnectionLease and acquire_*_connection_lease() for RAII gauge accounting.
src/stats/connection_lease_security_tests.rs Tests for lease correctness under panic, abort, concurrency, and disarm behavior.
src/proxy/middle_relay.rs Switches ME route gauge accounting to RAII lease.
src/proxy/middle_relay_security_tests.rs Adds abort/cutover tests ensuring ME gauge is released reliably.
src/proxy/direct_relay.rs Switches direct route gauge accounting to RAII lease.
src/proxy/direct_relay_security_tests.rs Adds abort/cutover tests ensuring direct gauge is released reliably.
src/proxy/client.rs Introduces UserConnectionReservation RAII for per-user slot + IP reservation cleanup; adds synthetic_local_addr().
src/proxy/client_security_tests.rs Adds extensive tests for abort/cutover/error cleanup and reservation churn invariants.
src/metrics.rs Adds listen: Option<String> support to bind metrics to a specific address/port when configured.
src/maestro/runtime_tasks.rs Implements metrics_listen precedence and passes listen target into metrics::serve.
src/main.rs Adds ip_tracker_regression_tests module under cfg(test).
src/ip_tracker_regression_tests.rs Adds regression/stress tests for UserIpTracker policies and concurrency behavior.
src/config/types.rs Adds ServerConfig.metrics_listen and documents precedence over metrics_port.
docs/fronting-splitting/TLS-F-TCP-S.ru.md Updates TOML examples with explicit tables/quoting for correctness.
config.toml Documents metrics_listen option in the example config.
Cargo.lock Bumps telemt version to 3.3.20.

You can also share your feedback on Copilot code review. Take the survey.

…or cleaner writer removal logic and update related functions to enhance efficiency in handling closed writers.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves runtime observability and reliability around connection/resource accounting, while adding extensive regression/integration tests to prevent leaks and fingerprinting regressions.

Changes:

  • Added server.metrics_listen (with precedence over metrics_port) to bind the metrics endpoint to an explicit IP:PORT, including invalid-address handling.
  • Refactored middle-proxy draining writer cleanup to make “empty writer” removal atomic, add a CPU-scaled close budget, and improve health-monitor correctness under churn.
  • Introduced RAII-based connection/accounting leases (route gauges + per-user connection/IP reservations) and added broad security/regression tests for abort/cutover/timeout scenarios.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/config/types.rs Adds metrics_listen to ServerConfig and documents precedence over metrics_port.
config.toml Documents the new metrics_listen option in the example config.
src/maestro/runtime_tasks.rs Implements precedence logic for metrics_listen vs metrics_port and passes binding info to metrics task.
src/metrics.rs Adds single-address metrics binding when metrics_listen is set; retains dual-stack fallback for metrics_port.
src/transport/middle_proxy/registry.rs Adds atomic unregister_writer_if_empty to prevent stale empty snapshots from removing active writers.
src/transport/middle_proxy/pool_writer.rs Uses atomic empty-writer removal path before falling back to closing bound clients.
src/transport/middle_proxy/health.rs Refactors draining reaping to use registry activity snapshots + close budgets; exposes helper for tests.
src/transport/middle_proxy/mod.rs Wires in new health regression/integration/adversarial test modules.
src/transport/middle_proxy/health_regression_tests.rs Adds regression coverage for draining reaping invariants and cleanup behavior.
src/transport/middle_proxy/health_integration_tests.rs Adds integration tests around me_health_monitor convergence over cycles.
src/transport/middle_proxy/health_adversarial_tests.rs Adds adversarial churn tests to validate invariants under randomized/flip scenarios.
src/stats/mod.rs Adds RouteConnectionLease RAII for direct/middle current-connection gauges.
src/stats/connection_lease_security_tests.rs Tests lease behavior under panic, abort storms, and concurrency races.
src/proxy/direct_relay.rs Switches direct route gauge accounting to RAII lease.
src/proxy/middle_relay.rs Switches middle route gauge accounting to RAII lease.
src/proxy/client.rs Adds UserConnectionReservation RAII to ensure user slot + IP reservation cleanup on abort/cutover.
src/proxy/masking.rs Adds connect/outcome timing budgets to reduce fingerprinting variance in masking paths.
src/proxy/handshake.rs Applies configured anti-fingerprint delay consistently across more reject paths (and TLS success path).
src/main.rs Adds test module for IP-tracker regressions.
src/ip_tracker_regression_tests.rs Adds broad regression coverage for unique-IP tracking policies and concurrency.
src/proxy/*_security_tests.rs Adds/extends security tests for abort/cutover cleanup, timing budgets, and limit enforcement.
docs/fronting-splitting/TLS-F-TCP-S.ru.md Fixes TOML examples formatting/quoting for documentation correctness.
Cargo.lock Bumps crate version to 3.3.20.
Comments suppressed due to low confidence (1)

src/maestro/runtime_tasks.rs:325

  • spawn_metrics_if_configured parses metrics_listen into SocketAddr and then passes the original string down to metrics::serve, which parses it again. To avoid double parsing (and potential inconsistencies if formatting changes), consider passing the parsed SocketAddr through to metrics::serve instead of the raw String.
    // metrics_listen takes precedence; fall back to metrics_port for backward compat.
    let metrics_target: Option<(u16, Option<String>)> =
        if let Some(ref listen) = config.server.metrics_listen {
            match listen.parse::<std::net::SocketAddr>() {
                Ok(addr) => Some((addr.port(), Some(listen.clone()))),
                Err(e) => {
                    startup_tracker
                        .skip_component(
                            COMPONENT_METRICS_START,
                            Some(format!("invalid metrics_listen \"{}\": {}", listen, e)),
                        )
                        .await;
                    None
                }
            }
        } else {
            config.server.metrics_port.map(|p| (p, None))
        };

    if let Some((port, listen)) = metrics_target {
        let fallback_label = format!("port {}", port);
        let label = listen.as_deref().unwrap_or(&fallback_label);
        startup_tracker
            .start_component(
                COMPONENT_METRICS_START,
                Some(format!("spawn metrics endpoint on {}", label)),
            )
            .await;
        let stats = stats.clone();
        let beobachten = beobachten.clone();
        let config_rx_metrics = config_rx.clone();
        let ip_tracker_metrics = ip_tracker.clone();
        let whitelist = config.server.metrics_whitelist.clone();
        tokio::spawn(async move {
            metrics::serve(
                port,
                listen,
                stats,
                beobachten,
                ip_tracker_metrics,
                config_rx_metrics,
                whitelist,
            )
            .await;

You can also share your feedback on Copilot code review. Take the survey.

- Introduced `copy_with_idle_timeout` function to handle reading and writing with an idle timeout.
- Updated the proxy masking logic to use the new idle timeout function.
- Added tests to verify that idle relays are closed by the idle timeout before the global relay timeout.
- Ensured that connect refusal paths respect the masking budget and that responses followed by silence are cut off by the idle timeout.
- Added tests for adversarial scenarios where clients may attempt to drip-feed data beyond the idle timeout.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves operational robustness around connection accounting and Middle Proxy writer health cleanup, expands test coverage for health/abort/cutover scenarios, and adds a more flexible metrics bind configuration.

Changes:

  • Add atomic “remove writer if empty” path in the Middle Proxy registry/pool and rework draining-writer reap logic with a per-tick close budget.
  • Introduce RAII-style leases for route connection gauges and RAII reservations for user connection/IP tracking, plus extensive regression/integration/adversarial tests.
  • Add server.metrics_listen (takes precedence over metrics_port) and update metrics serving/binding paths and config examples.

Reviewed changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/transport/middle_proxy/registry.rs Adds atomic unregister_writer_if_empty and exposes writer activity snapshot for health logic.
src/transport/middle_proxy/pool_writer.rs Adds remove_writer_if_empty to pair atomic registry cleanup with pool removal.
src/transport/middle_proxy/health.rs Refactors draining-writer reap to use activity snapshots + close budget; exports helpers for tests.
src/transport/middle_proxy/mod.rs Wires in new health test modules under cfg(test).
src/transport/middle_proxy/health_regression_tests.rs Regression tests for draining reap behavior and invariants.
src/transport/middle_proxy/health_integration_tests.rs Integration tests around me_health_monitor convergence/draining.
src/transport/middle_proxy/health_adversarial_tests.rs Adversarial/churn tests for warning-state and backlog invariants.
src/stats/mod.rs Adds RouteConnectionLease RAII for direct/ME connection gauges.
src/stats/connection_lease_security_tests.rs Verifies leases balance under drop/panic/abort/concurrency.
src/proxy/direct_relay.rs Switches direct route gauge to lease-based accounting.
src/proxy/direct_relay_security_tests.rs Adds abort/cutover tests ensuring direct route gauge is released.
src/proxy/middle_relay.rs Switches ME route gauge to lease-based accounting.
src/proxy/middle_relay_security_tests.rs Adds abort/cutover tests ensuring ME route gauge is released.
src/proxy/client.rs Introduces UserConnectionReservation RAII to release user slots + IP reservations reliably.
src/ip_tracker_regression_tests.rs Adds broad regression coverage for IP tracker enforcement and concurrency behavior.
src/proxy/masking.rs Adds idle-timeout aware relay copying and timing-budget helpers for masking behavior.
src/proxy/masking_security_tests.rs Adds timing/idle/budget regression tests for masking behavior.
src/proxy/handshake.rs Adds saturation-aware preauth throttle logic and applies delay budgets on reject/success paths.
src/proxy/handshake_security_tests.rs Expands coverage for delay budgets and saturation/throttle behavior.
src/protocol/tls.rs Tightens allowed time skew constants used by TLS anti-replay validation.
src/protocol/tls_security_tests.rs Adds targeted tests for replay-window and TLS record parsing/clamping behaviors.
src/metrics.rs Adds listen: Option<String> support to bind metrics to a specific address when configured.
src/maestro/runtime_tasks.rs Makes metrics_listen override metrics_port and improves startup labels/error reporting.
src/config/types.rs Adds ServerConfig.metrics_listen with docs and default initialization.
src/config/defaults.rs Changes defaults (e.g., replay window and server hello delay) affecting handshake behavior.
src/cli.rs Updates example config snippet for replay window value.
config.toml Documents the new metrics_listen option in the sample config.
docs/fronting-splitting/TLS-F-TCP-S.ru.md Fixes TOML examples (adds tables/quotes) for censorship-related settings.
src/main.rs Includes new test module under cfg(test).
Cargo.lock Bumps telemt version in lockfile.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 75 to 77
pub(crate) fn default_replay_window_secs() -> u64 {
1800
120
}
@@ -27,8 +27,8 @@ pub const TLS_DIGEST_POS: usize = 11;
pub const TLS_DIGEST_HALF_LEN: usize = 16;

/// Time skew limits for anti-replay (in seconds)
- Simplified eviction candidate selection in `auth_probe_record_failure_with_state` by tracking the oldest candidate directly.
- Enhanced the handling of stale entries to ensure newcomers are tracked even under capacity constraints.
- Added tests to verify behavior under stress conditions and ensure newcomers are correctly managed.
- Updated `decode_user_secrets` to prioritize preferred users based on SNI hints.
- Introduced new tests for TLS SNI handling and replay protection mechanisms.
- Improved deduplication hash stability and collision resistance in middle relay logic.
- Refined cutover handling in route mode to ensure consistent error messaging and session management.
- Updated `direct_relay_security_tests.rs` to ensure sanitized paths are correctly validated against resolved paths.
- Added tests for symlink handling in `unknown_dc_log_path_revalidation` to prevent symlink target escape vulnerabilities.
- Modified `handshake.rs` to use a more robust hashing strategy for eviction offsets, improving the eviction logic in `auth_probe_record_failure_with_state`.
- Introduced new tests in `handshake_security_tests.rs` to validate eviction logic under various conditions, ensuring low fail streak entries are prioritized for eviction.
- Simplified `route_mode.rs` by removing unnecessary atomic mode tracking, streamlining the transition logic in `RouteRuntimeController`.
- Enhanced `route_mode_security_tests.rs` with comprehensive tests for mode transitions and their effects on session states, ensuring consistency under concurrent modifications.
- Cleaned up `emulator.rs` by removing unused ALPN extension handling, improving code clarity and maintainability.
- Modified `build_emulated_server_hello` to accept ALPN (Application-Layer Protocol Negotiation) as an optional parameter, allowing for the embedding of ALPN markers in the application data payload.
- Implemented logic to handle oversized ALPN values and ensure they do not interfere with the application data payload.
- Added new security tests in `emulator_security_tests.rs` to validate the behavior of the ALPN embedding, including scenarios for oversized ALPN and preference for certificate payloads over ALPN markers.
- Introduced `send_adversarial_tests.rs` to cover edge cases and potential issues in the middle proxy's send functionality, ensuring robustness against various failure modes.
- Updated `middle_proxy` module to include new test modules and ensure proper handling of writer commands during data transmission.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the proxy runtime by tightening metrics binding configuration, improving resource cleanup semantics (connections/IP reservations), and expanding adversarial/integration test coverage—especially around relay cutovers, draining writer cleanup, TLS emulation, masking timing behavior, and user quota enforcement.

Changes:

  • Add metrics_listen (higher priority over metrics_port) and update metrics startup/binding behavior.
  • Refactor middle-proxy draining/cleanup logic (budgeted close, atomic “remove if empty”) and route-mode state handling; add extensive regression/adversarial tests.
  • Strengthen security/robustness around TLS emulation/handshake throttling, masking timing budgets, user quota enforcement, and RAII connection gauges/leases.

Reviewed changes

Copilot reviewed 42 out of 43 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/metrics.rs Adds metrics_listen support and binding priority/validation.
src/maestro/runtime_tasks.rs Updates metrics task spawn logic to prefer metrics_listen.
src/config/types.rs / config.toml Introduces/configures metrics_listen.
src/transport/middle_proxy/send.rs Changes writer send path to reserve-permit before bind; prunes stale writers.
src/transport/middle_proxy/registry.rs / pool_writer.rs / health.rs Adds atomic unregister-if-empty + budgeted drain close logic.
src/proxy/relay.rs Adds quota-aware IO sentinel + per-user serialization to prevent quota races.
src/proxy/handshake.rs / src/protocol/tls.rs / src/tls_front/emulator.rs Tightens anti-replay/skew, adjusts TLS emulation markers/tickets, adds security tests.
src/proxy/masking.rs Adds idle timeouts and minimum outcome/connect timing budgets for masking path.
src/stats/mod.rs Adds lease-based connection gauges (RAII-style).
**_tests.rs (multiple) Adds extensive regression/integration/adversarial test suites for the above behaviors.
Comments suppressed due to low confidence (2)

src/transport/middle_proxy/send.rs:1

  • permit.send(...) drops the send result, so a closed channel (or a send failure if the API returns Result) can silently lose the payload while still committing the registry bind and incrementing success stats. Handle the send outcome explicitly (and only commit/increment on success), or if bind must happen first, roll back by unbinding/closing and pruning the writer when the send fails.
    src/transport/middle_proxy/send.rs:1
  • Same issue as the try-path: the fallback path commits the bind and increments success metrics even if permit.send(...) fails. Ensure the send is checked and failures trigger cleanup/rollback (or reorder operations so bind is only committed after a successful send).

You can also share your feedback on Copilot code review. Take the survey.

trace!(peer = %peer, handshake = ?hex::encode(handshake), "MTProto handshake bytes");
trace!(
peer = %peer,
handshake_head = %hex::encode(&handshake[..8]),
Comment on lines +290 to +304
let quota_lock = this
.quota_limit
.is_some()
.then(|| quota_user_lock(&this.user));
let _quota_guard = if let Some(lock) = quota_lock.as_ref() {
match lock.try_lock() {
Ok(guard) => Some(guard),
Err(_) => {
cx.waker().wake_by_ref();
return Poll::Pending;
}
}
} else {
None
};
Comment on lines +80 to +91
async fn wait_mask_connect_budget(started: Instant) {
let elapsed = started.elapsed();
if elapsed < MASK_TIMEOUT {
tokio::time::sleep(MASK_TIMEOUT - elapsed).await;
}
}

async fn wait_mask_outcome_budget(started: Instant) {
let elapsed = started.elapsed();
if elapsed < MASK_TIMEOUT {
tokio::time::sleep(MASK_TIMEOUT - elapsed).await;
}
Comment on lines +38 to +52
warn!(error = %e, "Invalid metrics_listen address: {}", listen_addr);
return;
}
};
let is_ipv6 = addr.is_ipv6();
match bind_metrics_listener(addr, is_ipv6) {
Ok(listener) => {
info!("Metrics endpoint: http://{}/metrics and /beobachten", addr);
serve_listener(
listener, stats, beobachten, ip_tracker, config_rx, whitelist,
)
.await;
}
Err(e) => {
warn!(error = %e, "Failed to bind metrics on {}", addr);
@DavidOsipov DavidOsipov changed the title Enhance metrics configuration and add health monitoring tests [WIP] Enhance metrics configuration, add health monitoring tests, security hardening, perf optimizations & loads of tests Mar 18, 2026
@DavidOsipov
Copy link
Copy Markdown
Contributor Author

image

- Adjusted QUOTA_USER_LOCKS_MAX based on test and non-test configurations to improve flexibility.
- Implemented logic to retain existing locks when the maximum quota is reached, ensuring efficient memory usage.
- Added comprehensive tests for quota user lock functionality, including cache reuse, saturation behavior, and race conditions.
- Enhanced StatsIo struct to manage wake scheduling for read and write operations, preventing unnecessary self-wakes.
- Introduced separate replay checker domains for handshake and TLS to ensure isolation and prevent cross-pollution of keys.
- Added security tests for replay checker to validate domain separation and window clamping behavior.
@axkurcom axkurcom merged commit 44376b5 into telemt:flow-sec Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants