Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ErrorContext for tracking errors across threads #83

Merged
merged 10 commits into from
Dec 9, 2020
Merged

Introduce ErrorContext for tracking errors across threads #83

merged 10 commits into from
Dec 9, 2020

Conversation

kunalmohan
Copy link
Member

fix #61
Tracks and stores function names that send messages across threads. This information will then be passed to PanicInfo as payload for displaying at panic.

@kunalmohan
Copy link
Member Author

This is still a WIP. We now need to replace unwrap() with expect() to pass the info to PanicInfo.

Some part of the new code here has introduced more race conditions causing tests to fail intermittently- they pass on increasing time b/w snapshots.

@imsnif
Copy link
Member

imsnif commented Dec 5, 2020

Hey @kunalmohan - good start on this.

I think, as you point out yourself and we start seeing in the implementation, that any solution forcing us to pass the error context very deep into the stack of functions in the thread itself would be very fragile and hard to maintain. This would mean that any developer who wants to add things would need to be aware of the inner workings of this error mechanism. And worst yet - remember to do things differently. :) This is of course true to being unable to use unwrap and the like.

When I think about it, the only thing we really need in addition to the backtrace of the final thread is knowing which threads the error passed through and which instructions (be it ScreenInstruction, PtyInstruction, etc.) it had in those threads. Do you think we might be able to log some sort of OpenCalls struct globally and match against it when we display the error? I'm not sure what the exact implementation would be for it, but I have a feeling we would be able to do more-or-less exact matching through information we have in the error backtrace. What do you think? Does this make sense? Did I manage to explain my general idea well enough?

@kunalmohan
Copy link
Member Author

Yep, basically logging the send() calls is the main idea. As I see it we have two options here-

  1. Simply log the send() calls with the message they are sending and the function name where the send() is called. Here we won't have to edit any function to accept any new argument. But the user will have to manually identify and filter out each call in the sequence that caused the error. Not very good option in my opinion.
  2. Log the send() call (and other info) along with a unique sequence number assigned at the time of stdin (similar to how we create a new ErrorContext object currently in this PR. The sequence number will have to be passed along with the message between the threads. User will have to figure out the last call made in that sequence, which should be possible with the available backtrace. The user can then simply filter out the calls made in that particular sequence using the sequence number of that call. This would require us to edit just the functions that send any message to another thread.

I'm not sure if any of this matches to your idea of globally logging an OpenCalls struct. If not, could you elaborate your idea a bit more?

@imsnif
Copy link
Member

imsnif commented Dec 6, 2020

To be honest, I'm more leaning toward option 1. I think most (all?) paths are rather unique and there will mostly be just one "open call" the user will have to choose from. I'd even say to just display the first one and assume there are no others. Or am I missing something and this doesn't make sense? (edit: ping @kunalmohan - forgot :) )

@kunalmohan
Copy link
Member Author

This makes sense. We don't have a large number of threads that the user needs to navigate through so it should not be very difficult. And I think you're correct about the paths being unique so I guess we should go ahead with this.

Just to be very clear (before I start work on this), we'll only log the first open call, i.e. only the send() calls in the stdin_handler and the ipc_server threads. These will be logged to a new file (or the existing MOSAIC_TMP_LOG_FILE?). Should this be behind the debug flag or not?

@imsnif
Copy link
Member

imsnif commented Dec 6, 2020

Hmm... maybe I didn't understand you correctly. I was thinking to log the previous calls in memory, and then dump them with the backtrace to screen somehow.

@imsnif
Copy link
Member

imsnif commented Dec 6, 2020

So that in the end we'll have something like
stdin_thread(AcceptInput)->screen_thread(CloseActivePane)->pty_thread(ClosePane)-><backtrace_of_error_in_pty>

EDIT: forgot to add stdin thread :)

@kunalmohan
Copy link
Member Author

Hmm... maybe I didn't understand you correctly. I was thinking to log the previous calls in memory, and then dump them with the backtrace to screen somehow.

Right, sorry, it slipped from my mind. I had just one concern with it. We'll use Arc<Mutex<OpenCalls>> to share it between threads. I was concerned with cases where Mutex lock gets poisoned in panic. But I guess that's rarely going to happen (and we always can switch to parking_lot::Mutex, if it does). So I'll go ahead with this.

I'd even say to just display the first one and assume there are no others.

Also, could you briefly explain what you meant by this?

@imsnif
Copy link
Member

imsnif commented Dec 6, 2020

@kunalmohan - I think I might have been too general here in my thoughts and caused some confusion. Let's take a step back.

If we:

  1. Pass an ErrorContext in every call we make (as in: every time we send an instruction to a different thread)
  2. Attach the name of the instruction that we send so that this context contains a Vec of those instructions which we can attach to a backtrace later if there's an error
  3. Keep that context globally in the thread
  4. use that context in the panic hook (maybe sending it using another panic hook inside each thread somehow?)
  5. in the main panic hook, get:
  • the message if there is one
  • the ErrorContext
  • the backtrace

Could we forgo having something global to the whole app shared with an App<Mutex> (because that will hurt performance)?

@imsnif
Copy link
Member

imsnif commented Dec 6, 2020

I don't know if it will work, but maybe we can somehow use catch_unwind inside each thread to forward the ErrorContext to the main thread?

@kunalmohan
Copy link
Member Author

@imsnif I guess I found a solution. We'll have a LocalKey for each thread.

thread_local!(static OpenCalls: RefCell<ErrorContext> = RefCell::new(ErrorContext::new());

When a thread sends a message to another thread, we'll clone the current ErrorContext of the thread and pass it to the other thread.
When a thread receives a new message, it will replace/update it's current ErrorContext in LocalKey with the one it has received with the new message. Hence at each point in time, the LocalKey of each thread would contain the context of the instruction that it is currently processing.
We can directly access this key in the panic hook and when a thread panics, the hook will automatically select the key of the thread which panicked.
I have tested it here.

@imsnif
Copy link
Member

imsnif commented Dec 6, 2020

Interesting. I haven't used LocalKey before, so maybe you can help me make sure I understand: if one thread panics, the panic hook will access the LocalKey of the panicking thread, rather than the LocalKey of a different thread that did not panic? Or am I misunderstanding this?

@kunalmohan
Copy link
Member Author

kunalmohan commented Dec 6, 2020

When accessing a LocalKey, the LocalKey of the currently executing thread is used.

Since we will be accessing the LocalKey from the panic hook, it will access the LocalKey of the panicking thread (just like thread::current() in panic hook returns the information of panicking thread).

In short, you are correct :)

@imsnif
Copy link
Member

imsnif commented Dec 6, 2020

@kunalmohan - sounds like a great solution. Looking forward to see the implementation :)

@kunalmohan
Copy link
Member Author

An example of how the context is displayed-

thread 'screen' panicked at 'removal index (is 0) should be < len (is 0)': library/alloc/src/vec.rs:1066
ErrorContext {
    calls: [
        "stdin_handler_thread(AcceptInput)",
        "pty_thread(SpawnTerminalVertically)",
        "stream_terminal_bytes(AsyncTask)",
        "screen_thread(HandlePtyEvent)",
    ],
}
   0: mosaic::errors::handle_panic
             at src/errors.rs:11:21
   1: mosaic::start::{{closure}}
             at src/main.rs:146:13
   2: std::panicking::rust_panic_with_hook
             at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:581:17
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:484:9
   4: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:153:18

....

@kunalmohan kunalmohan marked this pull request as ready for review December 7, 2020 07:45
@kunalmohan
Copy link
Member Author

@imsnif this is ready for the first round of review.

All tests are passing on my local machine. I guess again some race conditions are causing failures in CI.

Copy link
Member

@imsnif imsnif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalmohan - I think your solution with thread_local and OPENCALLS is very elegant. I could not have imagined anything simpler. Very cool :)

I left some comments about the implementation. We might have to do some back-and-forth about this, because this is a relatively big and very "deep" feature. I hope you can be a little patient with me. :)

Other than that, about displaying:
What do you think of adding some nice colors? Instead of

ErrorContext {
    calls: [
        "stdin_handler_thread(AcceptInput)",
        "pty_thread(SpawnTerminalVertically)",
        "stream_terminal_bytes(AsyncTask)",
        "screen_thread(HandlePtyEvent)",
    ],
}

something like:

Originating thread(s):
  1. stdin_handler_thread(purple): AcceptInput(green)
  2. pty_thread(purple): SpawnTerminalVertically(green)
  3. stream_terminal_bytes(purple): AsyncTask(green)
  4. screen_thread(purple): HandlePtyEvent(green)

Error: <message>(red)
  <stacktrace>(reset color)

src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/errors.rs Outdated Show resolved Hide resolved
Copy link
Member

@imsnif imsnif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Added some more comments

src/errors.rs Outdated Show resolved Hide resolved
src/errors.rs Show resolved Hide resolved
src/input.rs Outdated Show resolved Hide resolved
Cargo.lock Outdated Show resolved Hide resolved
Copy link
Member

@imsnif imsnif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks great. Thanks for being so patient (and quick!) with all my requests and suggestions. This is top notch work! I would be very happy to keep seeing you contributing and taking part in our project. :)

@imsnif imsnif merged commit 92d1bcf into zellij-org:main Dec 9, 2020
@kunalmohan kunalmohan deleted the error-context branch December 9, 2020 17:01
@PhML PhML mentioned this pull request Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Error handling
2 participants