Mozilla bug 845190 isn't reproducing in rr #40

joneschrisg · 2013-05-14T21:33:45Z

Bug reproduces seemingly every time outside of rr, haven't hit it in rr after hundreds of runs. I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr. But unlucky scheduling is always possible.

joneschrisg · 2013-05-14T23:21:14Z

I think that rr may be changing execution semantics. There's an error message printed from rr runs that I don't think I see sans rr.

Pretty sure now that this isn't happening. I let the automation run on bare metal for 850 runs and didn't repro, but when I ran the test outside of rr on the same machine, it only repro'd 1/10 times, and I saw the same symptoms as in the VM when it didn't repro. So almost certainly a scheduling issue or something in that neighborhood.

joneschrisg · 2013-05-15T00:26:01Z

Some more data

taskset 0x1 vs taskset 0x3 doesn't seem to make a difference; bare metal is at ~700 runs with the latter and no failures
running the test under a no-op wrapper still reproduces the bug. So some scheduler quirk around terminals or process hierarchies doesn't seem to be "it".
the bug doesn't reproduce under strace either

Skimming the actual bug, it sounds like some code is running on a off-main thread too late after shutdown. So it might be that the scheduling under ptrace is actually too fair. rr isn't letting the buggy thread get far enough behind to exhibit the bug (or resp., main thread far enough ahead).

joneschrisg · 2013-05-15T00:28:34Z

That might explain too why the bug reproduces much less frequently on bare metal with many cores; the threads can actually run in parallel.

joneschrisg · 2013-05-15T06:13:51Z

Whole-system lock-ups are getting annoying.

I saw a whole passel of gettimeofday/clock_gettime calls under strace for this test, so I wondered if those calls were disturbing things too much. I finished #43 to test this out, but the results were

a seemingly-related error reproduces more frequently, but not the orange itself
recording firefox is now broken under the filter lib :/

anoll · 2013-05-15T07:02:56Z

Hi Chris,

Did you try smaller values for thread preemption? I think the current value is set to 1000000 retired conditional branches. That might increase the chance of hitting the bug.

Albert

Von meinem iPhone gesendet

Am 15.05.2013 um 08:13 schrieb Chris Jones notifications@github.com:

Whole-system lock-ups are getting annoying.

I saw a whole passel of gettimeofday/clock_gettime calls under strace for this test, so I wondered if those calls were disturbing things too much. I finished #43 to test this out, but the results were

a seemingly-related error reproduces more frequently, but not the orange itself
recording firefox is now broken under the filter lib :/
—
Reply to this email directly or view it on GitHub.

joneschrisg · 2013-05-15T07:18:37Z

Hey Albert, yep, I tried the values 1 and INT_MAX and a few others in between. Unfortunately they didn't help the bug reproduce :/. That was pretty surprising. Still digging :).

joneschrisg · 2013-05-15T22:36:02Z

I wrote a library to start gathering information about the executions themselves, but the output is hard to interpret. So nothing substantial to report, but getting closer.

joneschrisg · 2013-05-17T14:51:27Z

Still not able to reproduce in VM with the fix from #48. Maybe will have better luck on bare metal.

joneschrisg · 2013-05-17T16:06:27Z

Finally, some data. The monitoring lib is dumping context-switching stats on exit (and receipt of usually-fatal signals). The vol item is voluntary context switches, which happen at syscalls. nonvol is when a task exhausts its timeslice.

Here's what I see running the testcase outside of rr, when it passes

pass:
    RRMON[6531][xpcshell]: vol: 10; nonvol: 124
    RRMON[6611][xpcshell]: vol: 10; nonvol: 124
    RRMON[6976][xpcshell]: vol: 10; nonvol: 125

I'd like to dump stats from when it fails too, but Gecko is stomping the signal handlers for my monitor lib.

Here's what things look like running under rr-without-filter-lib, which will soon be an obsolete configuration

pass:
    RRMON[6409][xpcshell]: vol: 14103; nonvol: 209
    RRMON[6451][xpcshell]: vol: 14097; nonvol: 216
    RRMON[6491][xpcshell]: vol: 14103; nonvol: 222

and with the filter lib

pass:
    RRMON[5392][xpcshell]: vol: 5120; nonvol: 134
    RRMON[5433][xpcshell]: vol: 5130; nonvol: 129
    RRMON[6316][xpcshell]: vol: 5115; nonvol: 125

And for comparison, results from ptrace

pass:
    RRMON[7025][xpcshell]: vol: 14086; nonvol: 107
    RRMON[7065][xpcshell]: vol: 14096; nonvol: 108
    RRMON[7104][xpcshell]: vol: 14094; nonvol: 111

This doesn't give any information about the actual schedule itself, but shows some interesting things

The filter lib removes a bunch of switching overhead, as expected. Unexpectedly though, it's cutting down on nonvol switches too, meaning straight-line code is running faster. I can't really explain that except maybe from cache effects?
The trace with the filter lib is looking more like the actual execution, except there are 500x (!!) more vol switches. I can't explain that yet either, but seems like the next place to look. One candidate is overflowing scratch buffers a lot.
strace's execution looks a lot like rr's without the filter lib. So there's hope rr can be lower overhead on the tracee side than strace. The fewer nonvol switches compared to rr-without-filter-lib might be explained by strace being friendlier to cache.

joneschrisg · 2013-05-17T17:10:21Z

With a better measuring mechanism, I get similar but slightly different data. Each listing is representative

no-rr
  vol: 61; nonvol: 142; sig: 0; hard-fault: 0
ptrace
  vol: 14140; nonvol: 137; sig: 0; hard-fault: 0
rr-no-filter
  vol: 15933; nonvol: 257; sig: 0; hard-fault: 0
rr-10000 (default)
  vol: 6538; nonvol: 146; sig: 0; hard-fault: 0
rr-1
  vol: 7105; nonvol: 116; sig: 0; hard-fault: 0
rr-(1<<30)
  vol: 6568; nonvol: 151; sig: 0; hard-fault: 0
rr-1000
  vol: 6570; nonvol: 141; sig: 0; hard-fault: 0
rr-100000
  vol: 6598; nonvol: 148; sig: 0; hard-fault: 0

The rr-[n] measurements are done with MAX_SWITCH_COUNTER set to n. So it appears that the number of preemptions we can force between syscalls during this test run is pretty low.

joneschrisg · 2013-05-17T17:33:05Z

Aha! In my VM, the RBC is always returning 0, but that can't possibly be correct. That means that preemption is effectively disabled.

joneschrisg · 2013-05-17T17:44:01Z

On bare metal, rbc is working, but the execution profile isn't significantly different. So I expect there's a very frequent syscall that we're not filtering. Likely candidate is write, since this test is very spew-happy.

joneschrisg · 2013-05-17T17:50:53Z

Aha! In my VM, the RBC is always returning 0, but that can't possibly be correct.

(7f20946 :} )

joneschrisg · 2013-05-17T18:41:27Z

Top unfiltered syscalls on this testcase, summed over 5 runs

syscall count
 224      2644  # gettid [Ed: mostly buffer flushes]
 192      2160  # mmap2
 240      2006  # futex
 195      1640  # stat64
 125      1560  # mprotect
 197       970  # fstat64

joneschrisg · 2013-05-17T21:14:09Z

(Just to satisfy my own pedantry, here's a histogram without filtering for one run

syscall count
  78     11250  # gettimeofday
 265       480  # clock_gettime
 192       410  # mmap2
 240       386  # futex
 195       320  # stat64
 125       306  # mprotect
 197       192  # fstat64

So we definitely chose the right syscalls to filter first.)

joneschrisg · 2013-05-17T22:04:16Z

Beyond filtering may-block syscalls, there doesn't look like much win here. mprotect was pretty high up in the list, and it's trivial to wrap, but doing so didn't make much difference

 125       306  # mprotect
----- filter -----
 125       294  # mprotect

The vast majority of the calls must be coming from ld/libc, which isn't too surprising.

Similarly, I tried un-filtering futex, and got similar results

 240       386  # futex
----- un-filter -----
 240       406  # futex

This one is a bit surprising. libc uses futexes to protect some internal state, and those might invoke the syscall directly. But if that were the case, I would expect libc to enter the kernel through the same interface as for the "real" futex. Or it's possible that ld is doing a crazy number of futex operations on startup. (Which reminds me, the futex() wrapper doesn't do blocking correctly. Need to turn that off.)

I'm going to see if perf sched sheds any light.

joneschrisg · 2013-05-17T22:55:40Z

OK, perf sched is pretty hot :D. I recorded traces from a failed run not-under-rr, and a passing run under rr. There's a huge amount of data so will take longer than I have left today to analyze. But very exciting.

joneschrisg · 2013-05-23T05:37:27Z

I haven't been updating this bug not because I haven't been working on it, but because I haven't figured it out yet. Early this week I wrote a little tool to visualize perf schedules. That's helped a lot, but still not there. From what I've seen

the failing thread is the 9th xpcshell thread. That thread isn't created on the passing runs I've seen
xpcshell eats a lot more CPU when running under rr. This is somewhat unexpected.
rr uses quite a bit of CPU (not unexpected), and contends heavily with perf.
we may be able to deduce the problem(s) with rr's scheduling, but it's also looking increasingly likely that the important difference is just adding too much overhead. In that case, it might be better to implement wrapped blocking syscalls to "guess and check" rather than keep sinking time in analysis, since we want those anyway.

Anyway, atm I'm enjoying being "distracted" by gdb ;).

joneschrisg · 2013-06-25T19:49:45Z

rr uses quite a bit of CPU (not unexpected), and contends heavily with perf.

It turns out that rr will often busy-wait, which might explain some of the fairly surprising CPU usage. This in turn might be affecting recording fidelity, but that's a bit more of a stretch.

joneschrisg · 2014-01-31T01:07:07Z

The gecko bug is closed now, and we've recorded other oranges, so going to let this one go.

ghost assigned joneschrisg May 14, 2013

joneschrisg mentioned this issue May 17, 2013

Stack smashing with syscall wrapper when running xpcshell test (blocks #40) #47

Closed

This was referenced Jul 16, 2013

Track down unexpectedly traced open syscalls #224

Closed

Implement wrappers for more syscalls #229

Closed

joneschrisg closed this as completed Jan 31, 2014

khuey mentioned this issue Oct 13, 2016

Firefox crashes under rr: "Bad system call" #1848

Closed

hsivonen mentioned this issue Mar 6, 2018

rr replay -p PID attaches to the root of the process tree #2168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mozilla bug 845190 isn't reproducing in rr #40

Mozilla bug 845190 isn't reproducing in rr #40

joneschrisg commented May 14, 2013

joneschrisg commented May 14, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 15, 2013

anoll commented May 15, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 23, 2013

joneschrisg commented Jun 25, 2013

joneschrisg commented Jan 31, 2014

Mozilla bug 845190 isn't reproducing in rr #40

Mozilla bug 845190 isn't reproducing in rr #40

Comments

joneschrisg commented May 14, 2013

joneschrisg commented May 14, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 15, 2013

anoll commented May 15, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 15, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 17, 2013

joneschrisg commented May 23, 2013

joneschrisg commented Jun 25, 2013

joneschrisg commented Jan 31, 2014