Skip to content

chore(depot): fault injection tests#4868

Draft
NathanFlurry wants to merge 18 commits into04-29-feat_sqlite_pitr_forkingfrom
05-01-chore_depot_fault_injection_tests
Draft

chore(depot): fault injection tests#4868
NathanFlurry wants to merge 18 commits into04-29-feat_sqlite_pitr_forkingfrom
05-01-chore_depot_fault_injection_tests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 2, 2026

🚅 Deployed to the rivet-pr-4868 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-cloud 😴 Sleeping (View Logs) Web May 2, 2026 at 8:43 am
kitchen-sink ❌ Build Failed (View Logs) Web May 2, 2026 at 7:14 am
website 😴 Sleeping (View Logs) Web May 2, 2026 at 3:52 am
ladle ✅ Success (View Logs) Web May 2, 2026 at 3:44 am
mcp-hub ✅ Success (View Logs) Web May 2, 2026 at 3:42 am
frontend-inspector ❌ Build Failed (View Logs) Web May 2, 2026 at 3:42 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 2, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented May 2, 2026

Review: chore(depot): fault injection tests

This PR introduces a well-designed fault injection testing infrastructure for the depot SQLite storage path. The phased approach (spec -> controller API -> workflow hooks -> VFS tests) is sound, and the production-leak prevention strategy is thorough.

Overview

  • Adds a test-faults Cargo feature gate compiling fault injection only for dev/test builds
  • Implements DepotFaultController with rule-based dispatch: Fail, Pause, Delay, DropArtifact actions
  • Wires fault hooks into commit, read, hot/cold compaction, reclaim, and cold-tier paths
  • Adds DepotCompactionTestDriver for deterministic workflow forcing
  • Adds disable_planning_timers on DbManagerInput to prevent autonomous compaction during tests
  • Adds check-production-fault-leaks.sh to prevent fault symbols from leaking into release builds

Issues

Race condition in pause/notify pattern - src/fault/controller.rs

The wait_reached and pause_until_released patterns violate the "arm before check" rule from CLAUDE.md. If pause_until_released stores reached = true and calls notify_waiters() between the atomic load and notified().await in wait_reached, the notification is permanently missed because notify_waiters() does not store permits.

Current code has a race window:

pub async fn wait_reached(&self) {
    while !self.state.reached.load(Ordering::SeqCst) {
        self.state.reached_notify.notified().await;
    }
}

The correct pattern arms the future before checking the condition:

pub async fn wait_reached(&self) {
    loop {
        let notified = self.state.reached_notify.notified();
        pin_mut!(notified);
        if self.state.reached.load(Ordering::SeqCst) { break; }
        notified.await;
    }
}

The same race exists in pause_until_released between the released load and release_notify.notified().await. The test passes today because tokio spawns do not race in single-threaded test executors, but the race is real under multi-thread or heavy load.


Production behavior change in read path - src/conveyer/read.rs (not behind test-faults)

The new stale_pidx_pgnos check returns a hard error where the code previously fell through:

} else {
    if stale_pidx_pgnos.contains(&pgno) {
        return Err(SqliteStorageError::ShardCoverageMissing { pgno }.into());
    }

The change is correct per the updated CLAUDE.md. Worth verifying that the existing test stale_pidx_missing_delta_falls_back_to_fdb_shard was updated. Its name implies the old fallback behavior and may silently pass without exercising the new error path.


DepotFaultCheckpoint is defined but unused - src/fault/checkpoint.rs

DepotFaultCheckpoint is a newtype over a string with no callsites in the diff. The pause system uses raw String names throughout. Remove until needed or wire it in.


DropArtifact for staged hot shards clears only chunk 0 - src/workflows/db_hot_compacter.rs

tx.informal().clear(&keys::branch_compaction_stage_hot_shard_key(
    ..., output_ref.shard_id, output_ref.as_of_txid, 0,
));

If a staged hot shard spans multiple chunks, the remaining chunks become orphaned in storage. Either clear all chunks or add a comment that staged hot shards are guaranteed single-chunk blobs.


DropArtifact on delete_objects/list_prefix returns a runtime error - src/cold_tier/faulty.rs

anyhow::bail!("cold-tier DropArtifact is not supported for delete_objects");

This reads as a user-facing runtime error rather than a test misconfiguration assertion. A panic! or unreachable! with a test-setup message would communicate intent more clearly.


Polling loop in wait_until needs stronger justification - src/compaction/test_driver.rs

CLAUDE.md prohibits loop { check; sleep } polling. The existing comment is the right justification but should be more explicit:

// FIXME(test-faults): polling because Gasoline debug does not expose event waiters.
tokio::time::sleep(Duration::from_millis(25)).await;

DepotFaultRuleBuilder::pause has a silent side effect - src/fault/controller.rs

pub fn pause(self, checkpoint: impl Into<String>) -> Result<DepotFaultRuleId> {
    let checkpoint = checkpoint.into();
    self.controller.pause_handle(checkpoint.clone());
    self.insert(DepotFaultAction::Pause { checkpoint })
}

Creating a pause state entry as a side effect of rule registration is unexpected. Add a comment explaining the pre-creation is intentional so the handle is live before the fault fires.


Minor

  • cold_delete_fault_output and delete_orphan_cold_fault_output take deleted_object_keys: Vec<String> only used in the error branch. Consider a borrow or lazy clone.
  • Unregistered ShardCacheFillFaultPoint variants (BeforeGetObject, AfterGetObject, BeforeShardWrite, AfterShardWrite, Skipped) have no hook sites yet. A // Phase 3 comment in the enum would help reviewers.
  • rivetkit-sqlite/tests/inline/fault/ files are skeletal stubs. The PR description should clarify what fault tests are and are not included in this PR.

Positive notes

  • check-production-fault-leaks.sh (LLVM IR grep + probe crate + metadata scan) is exactly the right level of paranoia for test-only infrastructure.
  • test-faults feature gate (not #[cfg(test)]) is correct for cross-crate test dependencies.
  • DbManagerInput::new / with_planning_timers_disabled constructors prevent struct literal bypass of the new field.
  • FaultBoundary classification (PreDurableCommit, AmbiguousAfterDurableCommit, PostDurableNonData) is essential for correct oracle comparison and well-designed.
  • assert_expected_fired() making unfired expected rules a test failure is exactly right per the spec.
  • Delta chunk contiguity check in tx_load_delta_blob (sorting + index gap validation) is a real correctness improvement independent of fault injection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant