Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mozilla bug 845190 isn't reproducing in rr #40

Closed
joneschrisg opened this issue May 14, 2013 · 20 comments
Closed

Mozilla bug 845190 isn't reproducing in rr #40

joneschrisg opened this issue May 14, 2013 · 20 comments
Assignees

Comments

@joneschrisg
Copy link
Contributor

Bug reproduces seemingly every time outside of rr, haven't hit it in rr after hundreds of runs. I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr. But unlucky scheduling is always possible.

@ghost ghost assigned joneschrisg May 14, 2013
@joneschrisg
Copy link
Contributor Author

I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr.

Pretty sure now that this isn't happening. I let the automation run on bare metal for 850 runs and didn't repro, but when I ran the test outside of rr on the same machine, it only repro'd 1/10 times, and I saw the same symptoms as in the VM when it didn't repro. So almost certainly a scheduling issue or something in that neighborhood.

@joneschrisg
Copy link
Contributor Author

Some more data

  • taskset 0x1 vs taskset 0x3 doesn't seem to make a difference; bare metal is at ~700 runs with the latter and no failures
  • running the test under a no-op wrapper still reproduces the bug. So some scheduler quirk around terminals or process hierarchies doesn't seem to be "it".
  • the bug doesn't reproduce under strace either

Skimming the actual bug, it sounds like some code is running on a off-main thread too late after shutdown. So it might be that the scheduling under ptrace is actually too fair. rr isn't letting the buggy thread get far enough behind to exhibit the bug (or resp., main thread far enough ahead).

@joneschrisg
Copy link
Contributor Author

That might explain too why the bug reproduces much less frequently on bare metal with many cores; the threads can actually run in parallel.

@joneschrisg
Copy link
Contributor Author

Whole-system lock-ups are getting annoying.

I saw a whole passel of gettimeofday/clock_gettime calls under strace for this test, so I wondered if those calls were disturbing things too much. I finished #43 to test this out, but the results were

  • a seemingly-related error reproduces more frequently, but not the orange itself
  • recording firefox is now broken under the filter lib :/

@anoll
Copy link
Contributor

anoll commented May 15, 2013

Hi Chris,

Did you try smaller values for thread preemption? I think the current value is set to 1000000 retired conditional branches. That might increase the chance of hitting the bug.

Albert

Von meinem iPhone gesendet

Am 15.05.2013 um 08:13 schrieb Chris Jones notifications@github.com:

Whole-system lock-ups are getting annoying.

I saw a whole passel of gettimeofday/clock_gettime calls under strace for this test, so I wondered if those calls were disturbing things too much. I finished #43 to test this out, but the results were

a seemingly-related error reproduces more frequently, but not the orange itself
recording firefox is now broken under the filter lib :/

Reply to this email directly or view it on GitHub.

@joneschrisg
Copy link
Contributor Author

Hey Albert, yep, I tried the values 1 and INT_MAX and a few others in between. Unfortunately they didn't help the bug reproduce :/. That was pretty surprising. Still digging :).

@joneschrisg
Copy link
Contributor Author

I wrote a library to start gathering information about the executions themselves, but the output is hard to interpret. So nothing substantial to report, but getting closer.

@joneschrisg
Copy link
Contributor Author

Still not able to reproduce in VM with the fix from #48. Maybe will have better luck on bare metal.

@joneschrisg
Copy link
Contributor Author

Finally, some data. The monitoring lib is dumping context-switching stats on exit (and receipt of usually-fatal signals). The vol item is voluntary context switches, which happen at syscalls. nonvol is when a task exhausts its timeslice.

Here's what I see running the testcase outside of rr, when it passes

pass:
    RRMON[6531][xpcshell]: vol: 10; nonvol: 124
    RRMON[6611][xpcshell]: vol: 10; nonvol: 124
    RRMON[6976][xpcshell]: vol: 10; nonvol: 125

I'd like to dump stats from when it fails too, but Gecko is stomping the signal handlers for my monitor lib.

Here's what things look like running under rr-without-filter-lib, which will soon be an obsolete configuration

pass:
    RRMON[6409][xpcshell]: vol: 14103; nonvol: 209
    RRMON[6451][xpcshell]: vol: 14097; nonvol: 216
    RRMON[6491][xpcshell]: vol: 14103; nonvol: 222

and with the filter lib

pass:
    RRMON[5392][xpcshell]: vol: 5120; nonvol: 134
    RRMON[5433][xpcshell]: vol: 5130; nonvol: 129
    RRMON[6316][xpcshell]: vol: 5115; nonvol: 125

And for comparison, results from ptrace

pass:
    RRMON[7025][xpcshell]: vol: 14086; nonvol: 107
    RRMON[7065][xpcshell]: vol: 14096; nonvol: 108
    RRMON[7104][xpcshell]: vol: 14094; nonvol: 111

This doesn't give any information about the actual schedule itself, but shows some interesting things

  1. The filter lib removes a bunch of switching overhead, as expected. Unexpectedly though, it's cutting down on nonvol switches too, meaning straight-line code is running faster. I can't really explain that except maybe from cache effects?
  2. The trace with the filter lib is looking more like the actual execution, except there are 500x (!!) more vol switches. I can't explain that yet either, but seems like the next place to look. One candidate is overflowing scratch buffers a lot.
  3. strace's execution looks a lot like rr's without the filter lib. So there's hope rr can be lower overhead on the tracee side than strace. The fewer nonvol switches compared to rr-without-filter-lib might be explained by strace being friendlier to cache.

@joneschrisg
Copy link
Contributor Author

With a better measuring mechanism, I get similar but slightly different data. Each listing is representative

no-rr
  vol: 61; nonvol: 142; sig: 0; hard-fault: 0
ptrace
  vol: 14140; nonvol: 137; sig: 0; hard-fault: 0
rr-no-filter
  vol: 15933; nonvol: 257; sig: 0; hard-fault: 0
rr-10000 (default)
  vol: 6538; nonvol: 146; sig: 0; hard-fault: 0
rr-1
  vol: 7105; nonvol: 116; sig: 0; hard-fault: 0
rr-(1<<30)
  vol: 6568; nonvol: 151; sig: 0; hard-fault: 0
rr-1000
  vol: 6570; nonvol: 141; sig: 0; hard-fault: 0
rr-100000
  vol: 6598; nonvol: 148; sig: 0; hard-fault: 0

The rr-[n] measurements are done with MAX_SWITCH_COUNTER set to n. So it appears that the number of preemptions we can force between syscalls during this test run is pretty low.

@joneschrisg
Copy link
Contributor Author

Aha! In my VM, the RBC is always returning 0, but that can't possibly be correct. That means that preemption is effectively disabled.

@joneschrisg
Copy link
Contributor Author

On bare metal, rbc is working, but the execution profile isn't significantly different. So I expect there's a very frequent syscall that we're not filtering. Likely candidate is write, since this test is very spew-happy.

@joneschrisg
Copy link
Contributor Author

Aha! In my VM, the RBC is always returning 0, but that can't possibly be correct.

(7f20946 :} )

@joneschrisg
Copy link
Contributor Author

Top unfiltered syscalls on this testcase, summed over 5 runs

syscall count
 224      2644  # gettid [Ed: mostly buffer flushes]
 192      2160  # mmap2
 240      2006  # futex
 195      1640  # stat64
 125      1560  # mprotect
 197       970  # fstat64

@joneschrisg
Copy link
Contributor Author

(Just to satisfy my own pedantry, here's a histogram without filtering for one run

syscall count
  78     11250  # gettimeofday
 265       480  # clock_gettime
 192       410  # mmap2
 240       386  # futex
 195       320  # stat64
 125       306  # mprotect
 197       192  # fstat64

So we definitely chose the right syscalls to filter first.)

@joneschrisg
Copy link
Contributor Author

Beyond filtering may-block syscalls, there doesn't look like much win here. mprotect was pretty high up in the list, and it's trivial to wrap, but doing so didn't make much difference

 125       306  # mprotect
----- filter -----
 125       294  # mprotect

The vast majority of the calls must be coming from ld/libc, which isn't too surprising.

Similarly, I tried un-filtering futex, and got similar results

 240       386  # futex
----- un-filter -----
 240       406  # futex

This one is a bit surprising. libc uses futexes to protect some internal state, and those might invoke the syscall directly. But if that were the case, I would expect libc to enter the kernel through the same interface as for the "real" futex. Or it's possible that ld is doing a crazy number of futex operations on startup. (Which reminds me, the futex() wrapper doesn't do blocking correctly. Need to turn that off.)

I'm going to see if perf sched sheds any light.

@joneschrisg
Copy link
Contributor Author

OK, perf sched is pretty hot :D. I recorded traces from a failed run not-under-rr, and a passing run under rr. There's a huge amount of data so will take longer than I have left today to analyze. But very exciting.

@joneschrisg
Copy link
Contributor Author

I haven't been updating this bug not because I haven't been working on it, but because I haven't figured it out yet. Early this week I wrote a little tool to visualize perf schedules. That's helped a lot, but still not there. From what I've seen

  • the failing thread is the 9th xpcshell thread. That thread isn't created on the passing runs I've seen
  • xpcshell eats a lot more CPU when running under rr. This is somewhat unexpected.
  • rr uses quite a bit of CPU (not unexpected), and contends heavily with perf.
  • we may be able to deduce the problem(s) with rr's scheduling, but it's also looking increasingly likely that the important difference is just adding too much overhead. In that case, it might be better to implement wrapped blocking syscalls to "guess and check" rather than keep sinking time in analysis, since we want those anyway.

Anyway, atm I'm enjoying being "distracted" by gdb ;).

@joneschrisg
Copy link
Contributor Author

  • rr uses quite a bit of CPU (not unexpected), and contends heavily with perf.

It turns out that rr will often busy-wait, which might explain some of the fairly surprising CPU usage. This in turn might be affecting recording fidelity, but that's a bit more of a stretch.

@joneschrisg
Copy link
Contributor Author

The gecko bug is closed now, and we've recorded other oranges, so going to let this one go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants