chore(depot): fault injection tests#4868
chore(depot): fault injection tests#4868NathanFlurry wants to merge 18 commits into04-29-feat_sqlite_pitr_forkingfrom
Conversation
|
🚅 Deployed to the rivet-pr-4868 environment in rivet-frontend
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Review: chore(depot): fault injection testsThis PR introduces a well-designed fault injection testing infrastructure for the depot SQLite storage path. The phased approach (spec -> controller API -> workflow hooks -> VFS tests) is sound, and the production-leak prevention strategy is thorough. Overview
IssuesRace condition in pause/notify pattern - The Current code has a race window: pub async fn wait_reached(&self) {
while !self.state.reached.load(Ordering::SeqCst) {
self.state.reached_notify.notified().await;
}
}The correct pattern arms the future before checking the condition: pub async fn wait_reached(&self) {
loop {
let notified = self.state.reached_notify.notified();
pin_mut!(notified);
if self.state.reached.load(Ordering::SeqCst) { break; }
notified.await;
}
}The same race exists in Production behavior change in read path - The new } else {
if stale_pidx_pgnos.contains(&pgno) {
return Err(SqliteStorageError::ShardCoverageMissing { pgno }.into());
}The change is correct per the updated CLAUDE.md. Worth verifying that the existing test
tx.informal().clear(&keys::branch_compaction_stage_hot_shard_key(
..., output_ref.shard_id, output_ref.as_of_txid, 0,
));If a staged hot shard spans multiple chunks, the remaining chunks become orphaned in storage. Either clear all chunks or add a comment that staged hot shards are guaranteed single-chunk blobs.
anyhow::bail!("cold-tier DropArtifact is not supported for delete_objects");This reads as a user-facing runtime error rather than a test misconfiguration assertion. A Polling loop in CLAUDE.md prohibits // FIXME(test-faults): polling because Gasoline debug does not expose event waiters.
tokio::time::sleep(Duration::from_millis(25)).await;
pub fn pause(self, checkpoint: impl Into<String>) -> Result<DepotFaultRuleId> {
let checkpoint = checkpoint.into();
self.controller.pause_handle(checkpoint.clone());
self.insert(DepotFaultAction::Pause { checkpoint })
}Creating a pause state entry as a side effect of rule registration is unexpected. Add a comment explaining the pre-creation is intentional so the handle is live before the fault fires. Minor
Positive notes
|

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: