New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add background hang monitor #21673

Open
wants to merge 18 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@gterzian
Collaborator

gterzian commented Sep 11, 2018

Early feedback welcome, still need to actually integrate it in the constellation message loop, and use it in Layout and Script...


  • ./mach build -d does not report any errors
  • ./mach test-tidy does not report any errors
  • These changes fix #16740 (github issue number if applicable).
  • There are tests for these changes OR
  • These changes do not require tests because _____

This change is Reviewable

@highfive

This comment has been minimized.

highfive commented Sep 11, 2018

Heads up! This PR modifies the following files:

  • @asajeffrey: components/constellation/background_hang_monitor.rs, components/constellation/constellation.rs, components/constellation/tests/hang_monitor_tests.rs, components/constellation/lib.rs
  • @wafflespeanut: python/servo/testing_commands.py
  • @cbrewster: components/constellation/background_hang_monitor.rs, components/constellation/constellation.rs, components/constellation/tests/hang_monitor_tests.rs, components/constellation/lib.rs
  • @paulrouget: components/constellation/background_hang_monitor.rs, components/constellation/constellation.rs, components/constellation/tests/hang_monitor_tests.rs, components/constellation/lib.rs
  • @KiChjang: components/script_traits/script_msg.rs

@gterzian gterzian changed the title from [WIP] Add background monitor to [WIP] Add background hang monitor Sep 11, 2018

@bors-servo

This comment has been minimized.

Contributor

bors-servo commented Sep 12, 2018

☔️ The latest upstream changes (presumably #21672) made this pull request unmergeable. Please resolve the merge conflicts.

}
struct MonitoredComponent {
last_activity: Cell<Instant>,

This comment has been minimized.

@jdm

jdm Sep 12, 2018

Member

I don't think these need to be cells; we should have mutable access to these members when necessary.

@@ -1095,6 +1106,9 @@ where
};
match content {
FromScriptMsg::ForwardToBackgroundHangMonitor(msg) => {

This comment has been minimized.

@jdm

jdm Sep 12, 2018

Member

I fear that including the BHM inside the constellation will cause the amount of traffic that the constellation deals with to increase substantially. I think we should give the BHM its own thread and channel so the constellation remains responsive.

This comment has been minimized.

@gterzian

gterzian Sep 13, 2018

Collaborator

I agree. I was also wondering if the current setup would result in spurious hangs due to BHM messages being stuck further down in the queue...

.map(|id, _| MonitoredComponentType::Script(id.clone()))
.collect();
let msg = MonitoredComponentMsg::NotifyWait(waiting_components);
let _ = self.script_sender.send((id.clone(), ScriptMsg::ForwardToBackgroundHangMonitor(msg)));

This comment has been minimized.

@jdm

jdm Sep 12, 2018

Member

It occurs to me that using the pipeline id isn't an ideal solution here, given that we can have multiple pipelines in a single script thread. I wonder if we should introduce a new EventLoopId concept that the constellation is in charge of tracking and sending to new script and layout threads?

This comment has been minimized.

@gterzian

gterzian Sep 13, 2018

Collaborator

Yes that's a good idea(I looked at using the TopLevelBrowsingContextId, and since we introduced auxiliaries a script-thread could also manage multiple of those.

I can add a new type of Id to the existing EventLoop concept.

This comment has been minimized.

@gterzian

gterzian Sep 16, 2018

Collaborator

An event-loop id might also not reflect reality well either, since while we indeed will have one script-thread per event-loop, we can have several layout threads. For now I have opted for this https://github.com/servo/servo/pull/21673/files#diff-8dc9268416cd4f037d50206e76c73cf9R525

@@ -1239,6 +1254,14 @@ impl ScriptThread {
}
}
fn notify_activity_to_hang_monitor(&self, category: &ScriptThreadEventCategory, pipeline_id: &PipelineId) {
let position = ScriptThreadEventCategory::into_enum_iter().position(|x| &x == category).unwrap();
let hang_annotation = ScriptHangAnnotation::into_enum_iter().nth(position).unwrap();

This comment has been minimized.

@jdm

jdm Sep 12, 2018

Member

Clever, but also a bit scary how easily the two lists could get out of sync! I would be more comfortable if we could generate the matching enums from a macro at build time somehow.

This comment has been minimized.

@gterzian

gterzian Sep 13, 2018

Collaborator

I agree, I will look into it... could be useful to apply to the below as well

ScriptThreadEventCategory::AttachLayout => ProfilerCategory::ScriptAttachLayout,

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 16, 2018

@bors-servo

This comment has been minimized.

Contributor

bors-servo commented Sep 16, 2018

⌛️ Trying commit 4e92898 with merge cd3b0dd...

bors-servo added a commit that referenced this pull request Sep 16, 2018

Auto merge of #21673 - gterzian:add_background_monitor, r=<try>
[WIP] Add background hang monitor

<!-- Please describe your changes on the following line: -->

Early feedback welcome, still need to ~~actually integrate it in the constellation message loop, and~~ use it in Layout and Script...

---
<!-- Thank you for contributing to Servo! Please replace each `[ ]` by `[X]` when the step is complete, and replace `__` with appropriate data: -->
- [ ] `./mach build -d` does not report any errors
- [ ] `./mach test-tidy` does not report any errors
- [ ] These changes fix #16740 (github issue number if applicable).

<!-- Either: -->
- [ ] There are tests for these changes OR
- [ ] These changes do not require tests because _____

<!-- Also, please make sure that "Allow edits from maintainers" checkbox is checked, so that we can help you if you get stuck somewhere along the way.-->

<!-- Pull requests that do not address these steps are welcome, but they will require additional verification as part of the review process. -->

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/21673)
<!-- Reviewable:end -->
@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 16, 2018

@jdm I've restructured the whole thing so we now have one monitoring thread per layout or script, and I've also added a very rudimentary form of backtrace via a signal send to the process by the monitor thread in case of a permanent hang, unfortunately I think it will be handled by all threads in the process, so you'll get a backtrace not just of the hanging one, I guess it's a start... :)

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 19, 2018

I'll try to target the hanging thread more specifically using libc::p_thread and company, however I'm still figuring out how to get a full stack trace beyond the code run in the signal handler.

this provides a good description of how the linux part is done in Gecko: https://dxr.mozilla.org/mozilla-central/source/tools/profiler/core/platform-linux-android.cpp?q=SuspendAndSampleAndResumeThread&redirect_type=direct#148

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 20, 2018

@jdm I think this one could use another round of review. I've added some code pertaining to the SIGPROF handling and also actually trying to communicate the backtrace back to the monitoring thread(who then passes it on to the constellation).

It's a bit messy and I'm not sure if it's actually a good idea... :)

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 21, 2018

Ok I cleaned up the traceback stuff a bit, and the good news is I was able to get the whole thing to crash in the unit-test due to concurrent access to the static(I had made an error with the boolean return value of the compare_and_swap) and fixed it. So it would appear to be somewhat more robust now...

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 21, 2018

@bors-servo

This comment has been minimized.

Contributor

bors-servo commented Sep 21, 2018

⌛️ Trying commit 0d84ce3 with merge 3e34206...

bors-servo added a commit that referenced this pull request Sep 21, 2018

Auto merge of #21673 - gterzian:add_background_monitor, r=<try>
[WIP] Add background hang monitor

<!-- Please describe your changes on the following line: -->

Early feedback welcome, still need to ~~actually integrate it in the constellation message loop, and~~ use it in Layout and Script...

---
<!-- Thank you for contributing to Servo! Please replace each `[ ]` by `[X]` when the step is complete, and replace `__` with appropriate data: -->
- [ ] `./mach build -d` does not report any errors
- [ ] `./mach test-tidy` does not report any errors
- [ ] These changes fix #16740 (github issue number if applicable).

<!-- Either: -->
- [ ] There are tests for these changes OR
- [ ] These changes do not require tests because _____

<!-- Also, please make sure that "Allow edits from maintainers" checkbox is checked, so that we can help you if you get stuck somewhere along the way.-->

<!-- Pull requests that do not address these steps are welcome, but they will require additional verification as part of the review process. -->

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/21673)
<!-- Reviewable:end -->
@bors-servo

This comment has been minimized.

Contributor

bors-servo commented Sep 21, 2018

💔 Test failed - windows-msvc-dev

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 21, 2018

Ok,

  1. using libc doesn't work on windows (I guess compiler flags is the solution for now)
  2. Android: Unexpected dynamic symbols in binary: signal
@jdm

I like a lot of what I'm seeing here! I'm mostly concerned about the large number of extra threads.

@@ -864,6 +880,7 @@ where
fn handle_request(&mut self) {
enum Request {
Script((PipelineId, FromScriptMsg)),
BackGroundHangMonitor(HangAlert),

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

nit: Background

@@ -476,6 +481,13 @@ pub struct ScriptThread {
/// A queue of tasks to be executed in this script-thread.
task_queue: TaskQueue<MainThreadScriptMsg>,
/// A channel on which hang alerts can be sent to the constellation.
/// Stored here for use in the AttachLayout workflow.
background_hang_monitor_to_constellation_chan: IpcSender<HangAlert>,

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

I think this could be sent as part of NewLayoutInfo instead.

pub fn new(
sender: Sender<(MonitoredComponentId, MonitoredComponentMsg)>,
component_id: MonitoredComponentId,
) -> Self {

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

nit: formatting

pub fn send(&self, msg: MonitoredComponentMsg) {
if let Err(_) = self.sender.send((self.component_id.clone(), msg)) {
warn!("BackgroundHangMonitor has gone away");

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

It might be worth storing a boolean to ensure we only report this once; it could end up causing a lot of the same errors given how frequently we send messages on these channels.

let component_id = MonitoredComponentId(pipeline, component_type);
let bhm_chan = BackgroundHangMonitorChan::new(sender, component_id.clone());
install_sigprof_handler();
let _ = thread::Builder::new().spawn(move || {

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

So right now we have one background thread per monitored thread? I wonder if it's possible to have a single background thread that covers all monitored threads instead. In any case, right now each hang monitor contains a hashmap with a single entry, as far as I can tell. For inspiration, check out the implementation of the ServiceWorkerManager, which works in both in-process and multiprocess mode and creates a single thread for coordinating service workers.

use std::time::{Duration, Instant};
/// The means of communication between monitored and monitor, inside of a "trace transaction".
pub static mut TRACE_SENDER: Option<Sender<(libc::pthread_t, Backtrace)>> = None;

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

Rather than using a channel, why not have the signal handler store the values in this global variable Option<(libc::pthread_t, Backtrace)> and read it directly? This will mean that the handler will perform no allocation, which is usually safer. We will probably want to use the atomic crate to ensure the reads and writes are properly synchronized, though.

let mut component = self
.monitored_components
.get_mut(&component_id)
.expect("Receiced NotifyActivity for an unknown component");

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

nit: Received

}
let last_annotation = monitored.last_annotation.unwrap();
if monitored.last_activity.elapsed() > monitored.permanent_hang_timeout {
match monitored.sent_permanent_alert {

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member
if monitored.sent_permanent_alert {
    continue;
}
...
}
if monitored.last_activity.elapsed() > monitored.transient_hang_timeout {
match monitored.sent_transient_alert {
true => continue,

This comment has been minimized.

@jdm

jdm Sep 21, 2018

Member

Same.

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 23, 2018

@jdm Ok I have addressed your comments:

  • One monitor thread per process(or just one if not in multiprocess mode). This somewhat follows the pattern of ServiceWorkerManager.
  • I've tried using the atomic crate, however there were a couple of problems:
    1. Atomic<T> required T to be Copy, so it was quite hard to get the backtrace across(although see a few comments below, perhaps we don't actually want to pass along the backtrace).
    2. On my machine is_lock_free would be false, indicating that the atomic was implemented with locks, which in our case seems to defeat the purpose of using it in the first place, right?

So I simply used a combo of an mut Option, and an AtomicBool to synchronize access to it(and since there won't be multiple monitoring thread, there is no need for the transaction concept anymore).

You can see the switch in this commit 4976bef

Please let me know what you think.

recv(self.port.select(), event) => {
match event {
Some(msg) => Some(msg),
// Our sender has been dropped, quit.

This comment has been minimized.

@gterzian

gterzian Sep 23, 2018

Collaborator

I'm wondering if the assumption that the sender will be dropped is still valid, although I think it is. In single process mode even the constellation will own a sender, yet that will be dropped when the constellation exits, right?

@gterzian

This comment has been minimized.

Collaborator

gterzian commented Sep 23, 2018

It's also noteworthy that in Gecko they seem to have a more complicated "messaging" structure between monitor and monitored thread, apparently in order to do most of the stack sampling work from within the monitor, and only resume the monitored when that is done.

We just get the stack from inside the signal handler and pass it on to the monitor.

I wonder if we should look into taking the stacktrace from within the monitor instead(and just passing some minimal data from monitored to monitor that would be enough to access the backtrace from one via the other)...

See https://dxr.mozilla.org/mozilla-central/source/tools/profiler/core/platform-linux-android.cpp?q=SuspendAndSampleAndResumeThread&redirect_type=direct#156

@jdm

This comment has been minimized.

Member

jdm commented Nov 18, 2018

Appveyor is trying to build the mach crate and failing, so it looks like some of the platform-specific cfgs aren't quite right.

@bors-servo

This comment has been minimized.

Contributor

bors-servo commented Nov 18, 2018

☔️ The latest upstream changes (presumably #22142) made this pull request unmergeable. Please resolve the merge conflicts.

process::abort();
}));
if let Err(()) = suspend_thread(thread_id) {
return None;

This comment has been minimized.

@Aaron1011

Aaron1011 Nov 19, 2018

Contributor

The previous panic hook needs to be restored here.

This comment has been minimized.

@gterzian

gterzian Nov 19, 2018

Collaborator

thanks!

fmt
@jdm

jdm approved these changes Nov 19, 2018

@jdm

This comment has been minimized.

Member

jdm commented Nov 19, 2018

Go ahead and squash the commits into a set that makes sense!
@bors-servo delegate+

@bors-servo

This comment has been minimized.

Contributor

bors-servo commented Nov 19, 2018

✌️ @gterzian can now approve this pull request

/// The means of communication between monitored and monitor.
#[cfg(any(target_os = "android", target_os = "linux"))]
static mut SAMPLEE_CONTEXT: Option<(MonitoredThreadId, libc::mcontext_t)> = None;

This comment has been minimized.

@Aaron1011

Aaron1011 Nov 19, 2018

Contributor

I think this still suffers from the same problem - since it's not accessed atomically, the sampler thread may not see writes made by the samplee thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment