Ensure panic safety #24

matklad · 2018-10-02T17:05:15Z

When a query panics due to a bug during execution, the panic should be propagated to the caller, but the runtime should be left in a consistent state.

This is important for IDE use-case: language server is a long-lived process, and it is being exposed to incomplete code comparatively more often than a command-line compiler. As a result, panics due to bugs are expected to be both more numerous and more annoying.

Currently, I believe we leave InProgress marks in the storage in the case of a panic, which effectively poisons the query, and can't be fixed by changing inputs.

The text was updated successfully, but these errors were encountered:

nikomatsakis · 2018-10-04T09:46:52Z

Hmm. Ordinarily, I'm not a big fan of trying to be panic safe, and would rather that we just "tear it down". But I suppose that in the case of salsa -- since we are controlling all the mutable state -- so long as we propagate the panic cleanly, we can be sure that the system is in a consistent state afterwards. (And, if not, then there is some hidden state that would probably mess up the increment system anyway.)

nikomatsakis · 2018-10-19T13:45:19Z

So I think what we need to do here...

The current strategy when executing a query is that we lock the map and we swap in a placeholder during execution. That is done here, and the value old_memo contains the value we swapped out from the map:

salsa/src/derived.rs

Lines 248 to 264 in 4a6a626

    
           let mut old_memo = match self.probe( 
        
               self.map.upgradable_read(), 
        
               runtime, 
        
               revision_now, 
        
               descriptor, 
        
               key, 
        
           ) { 
        
               ProbeState::UpToDate(v) => return v, 
        
               ProbeState::StaleOrAbsent(map) => { 
        
                   let mut map = RwLockUpgradableReadGuard::upgrade(map); 
        
                   match map.insert(key.clone(), QueryState::in_progress(runtime.id())) { 
        
                       Some(QueryState::Memoized(old_memo)) => Some(old_memo), 
        
                       Some(QueryState::InProgress { .. }) => unreachable!(), 
        
                       None => None, 
        
                   } 
        
               } 
        
           };

Under ordinary circumstances, we eventually invoke overwrite_placeholder to remove the placeholder and replace it with a fresh value:

salsa/src/derived.rs

Lines 456 to 466 in 4a6a626

    
           /// Overwrites the `InProgress` placeholder for `key` that we 
        
           /// inserted; if others were blocked, waiting for us to finish, 
        
           /// the notify them. 
        
           fn overwrite_placeholder( 
        
               &self, 
        
               runtime: &Runtime<DB>, 
        
               descriptor: &DB::QueryDescriptor, 
        
               key: &Q::Key, 
        
               memo: Memo<DB, Q>, 
        
               new_value: &StampedValue<Q::Value>, 
        
           ) {

You can see that this function also has the job of waking up other threads that may be blocked on us.

The danger is that a panic could occur and we would wind up with the placeholder being left in there. That is bad for many reasons: future calls from this thread will error out as if a cycle occurred, and other threads that are blocked waiting for a message will hang indefinitely.

I think that what we want to do is to install an "RAII guard" right here, basically after the InProgress placeholder has been installed, but before we've done anything else:

salsa/src/derived.rs

Line 265 in 4a6a626

This guard would hold on to a reference to the key and looks something like this:

struct PanicGuard<'me, DB, Q> {
    map: &'me RwLock<FxHashMap<Q::Key, QueryState<DB, Q>>>,
    key: &'me Q::Key,
}

impl<'me, DB, Q> Drop for PanicGuard<'me, DB, Q> {..}

the Drop for PanicGuard is intended to execute only if an unexpected panic occurs. That means we have to explicitly forget the guard on the "success" condition. I would probably do this by adding a parameter to overwrite_placeholder and invoking forget the first thing:

    fn overwrite_placeholder(
        &self,
        runtime: &Runtime<DB>,
        descriptor: &DB::QueryDescriptor,
        key: &Q::Key,
        memo: Memo<DB, Q>,
        new_value: &StampedValue<Q::Value>,
        panic_guard: PanicGuard<'_, DB, Q>,
    ) {
        // No panic occurred, do not run the panic-guard destructor:
        std::mem::forget(panic_guard);

        // ... as before ...
    }

If a panic does occur, then I think we probably just want to acquire the write lock on the map and remove the key altogether. I think that should always be ok -- other code treats a removed key as evidence that the value is out of date.

Then we need some tests. =)

nikomatsakis · 2018-10-19T13:45:43Z

There is a Zulip thread dedicated to this issue.

kleimkuhler · 2018-10-20T23:52:39Z

So location of this test case may belong in a new test file within parallel since it's not required the threads be true_parallel or race. Both threads need to sum(a).

Ideally thread1 would overwrite the value of a with an InProgress mark, but panic before it finishes the sum. The current implementation of fn sum may need to change to allow a panic here

salsa/tests/parallel/setup.rs

Line 120 in 4a6a626

sum += db.input(ch);

Once it panics, it should sum_signal_on_exit with a stage value that lets thread2 begin it's sum(a). If thread1 properly panicked and the cleanup of the InProgress mark on a happened, then thread2 should be okay to complete. We would then assert thread1 panicked (not sure right now how to assert that) and thread2 is the value of a.

What I could use a pointer on is how to ensure the fn sum of thread1 panics so that the cleanup happens. Right now fn sum waits and signals, but my thought is the panic would need to happen inside the actual db.input(ch).

We could add a panic Cell in KnobsStruct that is somehow checked at that point? It may required a new derived query that has the sole purpose of panicking when getting a certain key?

nikomatsakis · 2018-10-21T07:46:52Z

@kleimkuhler Hmm. I imagine the simplest test-case would not require threads.

For example, imagine we have two queries, foo and bar, where bar is an input query;

fn foo() -> usize {
}

fn bar() -> usize {
    storage input;
}

and you have

fn foo(db) -> usize {
    assert_eq!(db.bar(), 1);
}

and then you invoke db.foo() without having set db.bar to anything (so that it defaults to 0).

This will panic and — presently — leave an InProgress marker for foo.

Now if you catch that panic (via catch_unwind or by executing db.foo() in another thread), set db.bar to 1 and re-invoke db.foo(), it will give you a cycle error.

But that is incorrect.

nikomatsakis · 2018-10-21T07:53:02Z

But you are absolutely right that we should test the parallel behavior. In particular, I think my write-up neglected to mention an important detail. I wrote:

If a panic does occur, then I think we probably just want to acquire the write lock on the map and remove the key altogether

That is correct, but we also have to inspect the removed key to see if anyone is waiting for us and — if so — propagate a panic to them. I was imagining we would modify our channel so that instead of sending the final result StampedValue<Q::Value>, we send a Result<StampedValue<Q::Value>, PanicOccurred>, where PanicOccurred is some private newtype we use. If we get an Err value, we can then re-propagate the panic.

I think for testing you might like to create this scenario where:

sum begins executing on thread1 but pauses
thread2 then tries to execute sum and hence blocks
thread1 resumes and panics, hence propagating the panic to thread2

Unfortunately we can't quite force that scenario to occur yet (as you can see from the tests). Maybe it's worth modifying the library here -- the problem is that we have no hook that thread2 can execute once it has begun blocking. But we can create a scenario where thread2 signals only once it is about to block, and hence the race is likely to occur. That may be good enough, we can run in a loop a few times or something. 😛

(The only thing I can imagine to do better would be somehow modifying the salsa runtime to send signals when a blocking event occurs or other things via a channel to some other thread which would then signal thread1 to wake up...?)

matklad · 2018-10-21T08:35:52Z

the Drop for PanicGuard is intended to execute only if an unexpected panic occurs.

One thing that feels slightly off (but maybe is OK!) is that this will catch unexpected panics in salsa itself. An alternate is to create a wrapper around QueryFunction::execute (that is, all the code supplied by the user) which turns panics into results using catch_unwind.

fn execute_safely<Q, DB>(db: &DB, query: Q, key: Q::Key) -> std::thread::Result<Q::Value>
where
    Q: QueryFunction,
    DB: Database
{
    std::catch_panic(|| query.execute(db, key))
}

nikomatsakis · 2018-10-23T09:29:27Z

(Note that with #63 you can figure out when one thread is blocking for another and thus induce panics more precisely.)

nikomatsakis · 2018-10-23T09:32:38Z

One thing that feels slightly off (but maybe is OK!) is that this will catch unexpected panics in salsa itself. An alternate is to create a wrapper around QueryFunction::execute (that is, all the code supplied by the user) which turns panics into results using catch_unwind.

To be honest, I can't remember if this was by design or not, but it seems like recovering gracefully from internal salsa panics is a feature, not a bug, no?

Another option would be to use catch_unwind for user code but install a "abort-the-process" guard for salsa's code. This is what rayon does. But I don't think salsa has to manage nearly as many complex invariants as rayon does, so it doesn't seem obviously necessary.

nikomatsakis · 2018-10-30T19:18:16Z

@matklad points out here that it may be very simple to implement the parallel case after all -- in fact, maybe it even works now! Basically, just the act of dropping the channel without sending anything should trigger the "dependent cases" to panic.

nikomatsakis added the bug Something isn't working label Oct 4, 2018

nikomatsakis added the good first issue Good for newcomers label Oct 19, 2018

kleimkuhler mentioned this issue Oct 24, 2018

Panic safely in a single threaded context #62

Merged

kleimkuhler mentioned this issue Oct 31, 2018

Ensure parallel panic safety #81

Merged

nikomatsakis closed this as completed in #81 Nov 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure panic safety #24

Ensure panic safety #24

matklad commented Oct 2, 2018

nikomatsakis commented Oct 4, 2018 •

edited

Loading

nikomatsakis commented Oct 19, 2018

nikomatsakis commented Oct 19, 2018

kleimkuhler commented Oct 20, 2018

nikomatsakis commented Oct 21, 2018

nikomatsakis commented Oct 21, 2018

matklad commented Oct 21, 2018 •

edited

Loading

nikomatsakis commented Oct 23, 2018

nikomatsakis commented Oct 23, 2018

nikomatsakis commented Oct 30, 2018

Ensure panic safety #24

Ensure panic safety #24

Comments

matklad commented Oct 2, 2018

nikomatsakis commented Oct 4, 2018 • edited Loading

nikomatsakis commented Oct 19, 2018

nikomatsakis commented Oct 19, 2018

kleimkuhler commented Oct 20, 2018

nikomatsakis commented Oct 21, 2018

nikomatsakis commented Oct 21, 2018

matklad commented Oct 21, 2018 • edited Loading

nikomatsakis commented Oct 23, 2018

nikomatsakis commented Oct 23, 2018

nikomatsakis commented Oct 30, 2018

nikomatsakis commented Oct 4, 2018 •

edited

Loading

matklad commented Oct 21, 2018 •

edited

Loading