Skip to content

chore(engine): roxiproxi tests#4864

Open
NathanFlurry wants to merge 1 commit into04-28-feat_sqlite_benchmark_cold_readsfrom
05-01-chore_engine_roxiproxi_tests
Open

chore(engine): roxiproxi tests#4864
NathanFlurry wants to merge 1 commit into04-28-feat_sqlite_benchmark_cold_readsfrom
05-01-chore_engine_roxiproxi_tests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry mentioned this pull request May 1, 2026
11 tasks
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Preview packages published to npm

Install with:

npm install rivetkit@pr-4864

All packages published as 0.0.0-pr.4864.06929ff with tag pr-4864.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-06929ff
docker pull rivetdev/engine:full-06929ff
Individual packages
npm install rivetkit@pr-4864
npm install @rivetkit/react@pr-4864
npm install @rivetkit/rivetkit-napi@pr-4864
npm install @rivetkit/workflow-engine@pr-4864

@NathanFlurry NathanFlurry force-pushed the 05-01-chore_engine_roxiproxi_tests branch from 7bbb3c9 to 5aff63c Compare May 1, 2026 11:13
@NathanFlurry NathanFlurry marked this pull request as ready for review May 1, 2026 11:14
@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Code Review: chore(engine): roxiproxi tests

Overview

This PR adds Toxiproxy-based network fault testing infrastructure to validate envoy reconnection behavior after server-side TCP resets. Key additions:

  • ToxiproxyTestServer — wraps a Toxiproxy Docker container via testcontainers/bollard, with helpers to inject TCP resets, latency, timeouts, and bandwidth limits.
  • on_connect/on_disconnect callbacks — added to EnvoyCallbacks (with default no-ops, so non-breaking).
  • EnvoyConnectionEventWaiter — broadcast-based helper for test assertions on connection lifecycle.
  • One test: envoy_reconnects_after_server_side_tcp_reset verifying reconnect without actor start replay.

Issues

Single-proxy constraint not enforced (toxiproxy.rs:1113)

ToxiproxyTestServer::proxy() always binds Toxiproxy to DEFAULT_PROXY_PORT (8666). A second call on the same server would fail because the port is already in use — but there's nothing preventing callers from trying. Consider returning an error with a clear message if proxy() is called a second time, or supporting multiple proxies by tracking allocated ports.

wait_with_poll is a polling loop (test_helpers.rs:755–775)

This introduces a loop { check; sleep } utility that CLAUDE.md flags: "If it is inside a loop { check; sleep } body, it is polling and should be event-driven instead." For wait_for_envoy_actor the envoy already has broadcast lifecycle events — subscribe_lifecycle_events() could be awaited deterministically instead of polling. For wait_for_connectable, the same approach is acceptable if there's genuinely no server-push event, but that should be confirmed and documented with a one-line comment explaining why polling is necessary.

wait_for_admin polling loop (toxiproxy.rs:1323–1342)

Same pattern. This is more defensible as external-service readiness polling (equivalent to retry backoff), but it would be cleaner to use the wait_with_poll utility just introduced rather than duplicating the pattern inline.

Timing dependency in test (network_faults.rs:865)

The comment "The reset toxic applies when the engine next writes to the Envoy control WebSocket. The ping task writes every few seconds in the test config" is a hidden timing assumption. The 90-second test timeout gives headroom, but correctness depends on a ping interval set in external test config. Consider verifying the timeout accounts for 2× the worst-case ping interval, or triggering a message send explicitly rather than relying on the background ping.

on_disconnect call placement (connection.rs:385–388)

on_disconnect is called after write_handle.abort(). Confirm this is intentional — callers may expect the handle to still be usable inside on_disconnect (e.g., queuing a reconnect action). If EnvoyHandle is unusable after abort, document that the callback is notification-only.


Minor Points

  • EnvoyConnectionEventWaiter has a hardcoded 20-second timeout. Fine for tests, but consider exposing with_timeout() for flexibility in slower CI environments.
  • latency_downstream, timeout_downstream, bandwidth_downstream, enable, disable methods on ToxiproxyProxy are added but unused by current tests. If they're not needed yet, consider deferring to keep scope tight.

Positives

  • Clean Docker abstraction using testcontainers — lifetime tied to _container drop handles cleanup correctly.
  • assert_no_event() pre-check before injecting the fault is a good defensive pattern that catches state pollution.
  • The assertion message on start_count clearly documents the invariant being tested.
  • Default no-op impls on on_connect/on_disconnect keep the trait change non-breaking.
  • Lagged branch in assert_no_event correctly panics — a lagged receiver means state is indeterminate, so failing fast is right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant