New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mozilla bug 845190 isn't reproducing in rr #40
Comments
Pretty sure now that this isn't happening. I let the automation run on bare metal for 850 runs and didn't repro, but when I ran the test outside of rr on the same machine, it only repro'd 1/10 times, and I saw the same symptoms as in the VM when it didn't repro. So almost certainly a scheduling issue or something in that neighborhood. |
Some more data
Skimming the actual bug, it sounds like some code is running on a off-main thread too late after shutdown. So it might be that the scheduling under ptrace is actually too fair. rr isn't letting the buggy thread get far enough behind to exhibit the bug (or resp., main thread far enough ahead). |
That might explain too why the bug reproduces much less frequently on bare metal with many cores; the threads can actually run in parallel. |
Whole-system lock-ups are getting annoying. I saw a whole passel of gettimeofday/clock_gettime calls under strace for this test, so I wondered if those calls were disturbing things too much. I finished #43 to test this out, but the results were
|
Hi Chris, Did you try smaller values for thread preemption? I think the current value is set to 1000000 retired conditional branches. That might increase the chance of hitting the bug. Albert Von meinem iPhone gesendet Am 15.05.2013 um 08:13 schrieb Chris Jones notifications@github.com:
|
Hey Albert, yep, I tried the values |
I wrote a library to start gathering information about the executions themselves, but the output is hard to interpret. So nothing substantial to report, but getting closer. |
Still not able to reproduce in VM with the fix from #48. Maybe will have better luck on bare metal. |
Finally, some data. The monitoring lib is dumping context-switching stats on exit (and receipt of usually-fatal signals). The Here's what I see running the testcase outside of rr, when it passes pass: RRMON[6531][xpcshell]: vol: 10; nonvol: 124 RRMON[6611][xpcshell]: vol: 10; nonvol: 124 RRMON[6976][xpcshell]: vol: 10; nonvol: 125 I'd like to dump stats from when it fails too, but Gecko is stomping the signal handlers for my monitor lib. Here's what things look like running under rr-without-filter-lib, which will soon be an obsolete configuration pass: RRMON[6409][xpcshell]: vol: 14103; nonvol: 209 RRMON[6451][xpcshell]: vol: 14097; nonvol: 216 RRMON[6491][xpcshell]: vol: 14103; nonvol: 222 and with the filter lib pass: RRMON[5392][xpcshell]: vol: 5120; nonvol: 134 RRMON[5433][xpcshell]: vol: 5130; nonvol: 129 RRMON[6316][xpcshell]: vol: 5115; nonvol: 125 And for comparison, results from pass: RRMON[7025][xpcshell]: vol: 14086; nonvol: 107 RRMON[7065][xpcshell]: vol: 14096; nonvol: 108 RRMON[7104][xpcshell]: vol: 14094; nonvol: 111 This doesn't give any information about the actual schedule itself, but shows some interesting things
|
With a better measuring mechanism, I get similar but slightly different data. Each listing is representative no-rr vol: 61; nonvol: 142; sig: 0; hard-fault: 0 ptrace vol: 14140; nonvol: 137; sig: 0; hard-fault: 0 rr-no-filter vol: 15933; nonvol: 257; sig: 0; hard-fault: 0 rr-10000 (default) vol: 6538; nonvol: 146; sig: 0; hard-fault: 0 rr-1 vol: 7105; nonvol: 116; sig: 0; hard-fault: 0 rr-(1<<30) vol: 6568; nonvol: 151; sig: 0; hard-fault: 0 rr-1000 vol: 6570; nonvol: 141; sig: 0; hard-fault: 0 rr-100000 vol: 6598; nonvol: 148; sig: 0; hard-fault: 0 The |
Aha! In my VM, the |
On bare metal, rbc is working, but the execution profile isn't significantly different. So I expect there's a very frequent syscall that we're not filtering. Likely candidate is |
(7f20946 :} ) |
Top unfiltered syscalls on this testcase, summed over 5 runs syscall count 224 2644 # gettid [Ed: mostly buffer flushes] 192 2160 # mmap2 240 2006 # futex 195 1640 # stat64 125 1560 # mprotect 197 970 # fstat64 |
(Just to satisfy my own pedantry, here's a histogram without filtering for one run syscall count 78 11250 # gettimeofday 265 480 # clock_gettime 192 410 # mmap2 240 386 # futex 195 320 # stat64 125 306 # mprotect 197 192 # fstat64 So we definitely chose the right syscalls to filter first.) |
Beyond filtering may-block syscalls, there doesn't look like much win here. mprotect was pretty high up in the list, and it's trivial to wrap, but doing so didn't make much difference 125 306 # mprotect ----- filter ----- 125 294 # mprotect The vast majority of the calls must be coming from ld/libc, which isn't too surprising. Similarly, I tried un-filtering 240 386 # futex ----- un-filter ----- 240 406 # futex This one is a bit surprising. libc uses futexes to protect some internal state, and those might invoke the syscall directly. But if that were the case, I would expect libc to enter the kernel through the same interface as for the "real" futex. Or it's possible that I'm going to see if |
OK, |
I haven't been updating this bug not because I haven't been working on it, but because I haven't figured it out yet. Early this week I wrote a little tool to visualize perf schedules. That's helped a lot, but still not there. From what I've seen
Anyway, atm I'm enjoying being "distracted" by gdb ;). |
It turns out that rr will often busy-wait, which might explain some of the fairly surprising CPU usage. This in turn might be affecting recording fidelity, but that's a bit more of a stretch. |
The gecko bug is closed now, and we've recorded other oranges, so going to let this one go. |
Bug reproduces seemingly every time outside of rr, haven't hit it in rr after hundreds of runs. I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr. But unlucky scheduling is always possible.
The text was updated successfully, but these errors were encountered: