Get rr working with valgrind (or vice versa), as far as possible #16

joneschrisg · 2013-05-08T19:43:20Z

It appears that valgrind identifies itself as Merom, which apparently isn't supported yet. I'm testing with distro-supplied valgrind 3.7.0, so things may be different in later valgrind. If not, then we can either add Merom support to rr, or add later support for a newer architecture to valgrind.

The question of how well rr+valgrind could work is also interesting. Needs some thought. But even basic checking allowed me to diagnose #15.

joneschrisg · 2014-01-07T02:00:17Z

As of valgrind 3.9.0, the valgrind CPU still reports itself as merom. The release notes claim to support AVX2 now, which is way up in haswell land. But crucially, AVX2 is only supported for x86-64. Sigh. The other sandy bridge \ merom features may be gated on x86-64 too.

But we have another option: merom has a deterministic store-insn counter according to this paper. It would be pretty straightforward to add that support, but not sure it's worth it yet.

joneschrisg · 2014-01-07T02:50:06Z

The other sandy bridge \ merom features may be gated on x86-64 too.

Valgrind reports itself as sandy bridge on x86-64 machines that have AVX, but not on x86. Looks like x86 support is falling behind.

joneschrisg · 2014-01-07T19:20:07Z

#606 is a way to get valgrind working.

joneschrisg · 2014-01-08T02:44:28Z

On second thought, rr's "syscall injection" is technically self-modifying code, which voids valgrind's warranty. But I somewhat suspect (hope) that things will just work, and indeed that'll be a good test of whether rr is cleaning up after itself properly in tracee tasks.

joneschrisg · 2014-01-08T03:35:37Z

Yet another problem is that valgrind-instrumented code longjmp's out of some signal handlers, which rr doesn't know how to deal with.

joneschrisg · 2014-08-05T21:23:38Z

#1262 is YET ANOTHER annoying bug caused by uninitialized C++ object fields. We deal with enough really nasty bugs in rr that this class is just insulting! Will probably take a bit of time soon to retry valgrind'ing rr.

CC @rocallahan

rocallahan · 2014-08-05T21:27:44Z

This bug is about running valgrind tools on tracee code, right? Whereas for uninitialized fields in rr itself, you'd just need to run valgrind on rr itself, which should just work, I'd have thought...

joneschrisg · 2014-08-05T21:52:12Z

I'd like to have both. I agree it's worth treating those problems separately.

valgrind doesn't work on even just rr out of the box, which is what I care about more, because per discussion above valgrind's emulated CPU identifies itself as Merom and rr barfs. ISTR force-overriding the detected microarch, but something broke. I'd like to try again.

Now that I think about it, for the rr-only case, we can have it launch a subprocess to make a non-emulated cpuid call, and then users wouldn't even have to force their microarch. (Sorry Julian!)

joneschrisg · 2014-08-06T21:12:33Z

ISTR force-overriding the detected microarch, but something broke.

The reason is that libpfm does its own CPUIDs to decide on the encoding of the raw microarch-specific events that rr chooses based on its own CPUID. This isn't how abstractions are supposed to work. So we'll need #974 to get libpfm out of the way before we can work around the CPUID emulation.

An interesting question is what valgrind does with perf_event_open attributes. If the syscall is passed through, then clients can observe CPU emulation by passing raw events that don't match the emulated microarch. But I doubt anyone cares.

joneschrisg · 2014-08-07T23:02:42Z

After #1274, here's the status

valgrind still implements only up to Merom for 32-bit processes, as of valgrid r14237. Workaround is to run rr as rr -a 'ivy bridge' (replacing ivy bridge with your CPU's uarch)
early in startup, valgrind makes an unexpected write when rr is expecting the write(-1) to check the rcb_cntr. I'm pretty sure this is valgrind warning us about writing to fd -1!! Helpful, but annoying in this case. Run rr as rr -a 'ivy bridge' -f to ignore that unexpected write.
soon after that, rr aborts because valgrind doesn't implement kcmp. Requires a valgrind fix.

So with valgrind patched to support kcmp, I think we're good to go on the tracer-side checking, which is the more useful of the two.

joneschrisg · 2014-08-07T23:09:27Z

Filed a valgrind bug for kcmp support.

joneschrisg · 2014-09-04T13:50:40Z

The valgrind patch landed. The last to-do item is exec'ing tracees more quickly, in order to "detach" valgrind. Currently in replay, valgrind observes tid values change when the rr fork child switches from "real" execution mode to using emulated data from the trace. I think the easiest thing to do might be to add a "hidden" rr command, rr __launch /foo --args or something like that, which we can exec immediately after the fork and that prepares the child process for replay.

joneschrisg · 2015-01-10T00:23:41Z

The problem referenced above may still exist, but we have another problem to deal with: the image after valgrind starts running rr doesn't have a vdso image mapped. I assume this is because valgrind's dynamic linker resolves references to vdso symbols to valgrind helpers. To work around this, we could include code to make traced syscalls in the rr image itself, at a known offset in the binary. Then, if we don't find the vdso, we can use the rr code. But that's pretty tricky a bit a delicate, so I don't think it's worth it at this point.

Valgrind unmaps this for it's own nefarious purposes (rr-debugger#16). However, it is simple enough to provide our own syscall instruction to use instead.

Keno · 2016-09-08T20:26:14Z

Status update after #1796:
valgrind rr record seems to work fine, but valgrind rr replay still has issues:

valgrind ./bin/rr replay -a
==6829== Memcheck, a memory error detector
==6829== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==6829== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==6829== Command: ./bin/rr replay -a
==6829==

rr: Warning: You appear to be running in a VMWare guest with a bug
    where a conditional branch instruction between two CPUID instructions
    sometimes fails to be counted by the conditional branch performance
    counter. Work around this problem by adding
        monitor_control.disable_hvsim_clusters = true
    to your .vmx file.

--6829-- WARNING: unhandled amd64-linux syscall: 446
--6829-- You may be able to write your own handler.
--6829-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--6829-- Nevertheless we consider this a bug.  Please report
--6829-- it at http://valgrind.org/support/bug_reports.html.
[FATAL /home/keno/rr-vanilla/src/ReplaySession.cc:454:check_pending_sig() errno: SUCCESS]
 (task 6830 (rec:6820) at time 1310)
 -> Assertion `0 < t->stop_sig()' failed to hold. Replaying `SCHED': expecting tracee signal or trap, but instead at `open' (ticks: 145293624)
Launch gdb with
  gdb '-l' '-1' '-ex' 'target extended-remote :6830' /home/keno/.local/share/rr/latest-trace/mmap_hardlink_2_julia

Valgrind unmaps this for it's own nefarious purposes (rr-debugger#16). However, it is simple enough to provide our own syscall instruction to use instead.

VelorumS · 2017-06-16T18:29:56Z

Do I understand correctly that running valgrind rr replay -a does the same checks as running valgrind on the application itself?

Basically, this use case is a massive speed boost for valgrind?

rocallahan · 2017-06-16T21:12:27Z

No. If you do that, Valgrind will not check the application.

It is possible to implement binary instrumentation of the replay, but the approach has to be a bit different. We've actually implemented this in a closed-source branch. And yes, we could use this to implement a memcheck-during-replay tool. Unfortunately we have to hold onto this code until we figure out a business model to make this work sustainable.

clevcode · 2019-03-01T03:48:40Z

I'm very interested in being able to apply some type of binary instrumentation during the replay. Any plans on open-sourcing the branch with that, or anyone else interested in looking into reimplementing something like this with me?

dr-m · 2022-01-20T17:22:54Z

For what it is worth, rr works just fine with instrumented -fsanitize=… code. For example, you can continue to a heap-use-after-free report of AddressSanitizer, set a watchpoint on the ASAN shadow address, and reverse-continue to find where the memory was freed. For MemorySanitizer it is a little trickier, because the MSAN output does not mention shadow byte addresses. Origin tracking is your friend.

I believe that the combination of ASAN and MSAN is equivalent or superior to Valgrind’s default memcheck tool, and debugging multi-threaded code is much easier than with valgrind --vgdb=yes.

MemorySanitizer does involve some additional effort, because all linked code (except libc) must be compiled with clang -fsanitize=memory, or otherwise you will get reports that any memory that was initialized by uninstrumented code is uninitialized. For C++ programs, you have to build and use an instrumented libc++ instead of libstdc++.

khuey · 2022-01-20T17:55:10Z

Yeah rr + the various sanitizers is pretty powerful, and I don't think there's a ton of need to ever deal with valgrind proper.

Time to let this issue die.

joneschrisg mentioned this issue Jan 8, 2014

Integrate chronicle-recorder #609

Open

This was referenced Aug 8, 2014

valgrind aborts after fork during replay #1276

Open

Initialize this field. #1277

Merged

benturner mentioned this issue Aug 27, 2014

replay error with ASAN #1294

Closed

joneschrisg mentioned this issue Jan 10, 2015

Tasks may not have an address space if in an inconsistent state. #1418

Merged

qiankehan mentioned this issue Nov 16, 2015

Unsupported sched_setparam #1592

Closed

Keno mentioned this issue Sep 8, 2016

Fix running valgrind on rr and issues identified it #1796

Merged

khuey mentioned this issue Oct 13, 2016

Firefox crashes under rr: "Bad system call" #1848

Closed

khuey mentioned this issue Oct 18, 2016

exit_with_syscallbuf_signal_32 fails reliably on kernel v4.9rc1 #1851

Closed

mjfroman mentioned this issue Nov 7, 2016

Assertion `!!(blocked & mask) == is_sig_blocked(sig)' failed to hold. SIGTRAP is not blocked #1879

Closed

dcolascione mentioned this issue Nov 28, 2016

Assertion failure with signal unblocking in Emacs #1912

Closed

tbsaunde mentioned this issue Apr 18, 2017

assertion failure at Registers.cc:422 with sched_setaffinity() #2020

Closed

tbsaunde mentioned this issue May 7, 2017

reverse-finish not behaving properly #2026

Closed

hotsphink mentioned this issue Jun 12, 2017

RecordSession.cc:400:handle_seccomp_trap Assertion `!t->delay_syscallbuf_reset' #2042

Closed

dcolascione mentioned this issue Feb 14, 2018

rr segfaults when running UML #2164

Open

stransky mentioned this issue Mar 27, 2020

Crashes on Fedora 32 on replay #2469

Closed

allstarschh mentioned this issue Sep 11, 2020

"./mach run --debugger rr" makes Firefox crashes in dbus_threads_init immediately #2654

Closed

khuey closed this as completed Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rr working with valgrind (or vice versa), as far as possible #16

Get rr working with valgrind (or vice versa), as far as possible #16

joneschrisg commented May 8, 2013

joneschrisg commented Jan 7, 2014

joneschrisg commented Jan 7, 2014

joneschrisg commented Jan 7, 2014

joneschrisg commented Jan 8, 2014

joneschrisg commented Jan 8, 2014

joneschrisg commented Aug 5, 2014

rocallahan commented Aug 5, 2014

joneschrisg commented Aug 5, 2014

joneschrisg commented Aug 6, 2014

joneschrisg commented Aug 7, 2014

joneschrisg commented Aug 7, 2014

joneschrisg commented Sep 4, 2014

joneschrisg commented Jan 10, 2015

Keno commented Sep 8, 2016

VelorumS commented Jun 16, 2017

rocallahan commented Jun 16, 2017

clevcode commented Mar 1, 2019

dr-m commented Jan 20, 2022

khuey commented Jan 20, 2022

Get rr working with valgrind (or vice versa), as far as possible #16

Get rr working with valgrind (or vice versa), as far as possible #16

Comments

joneschrisg commented May 8, 2013

joneschrisg commented Jan 7, 2014

joneschrisg commented Jan 7, 2014

joneschrisg commented Jan 7, 2014

joneschrisg commented Jan 8, 2014

joneschrisg commented Jan 8, 2014

joneschrisg commented Aug 5, 2014

rocallahan commented Aug 5, 2014

joneschrisg commented Aug 5, 2014

joneschrisg commented Aug 6, 2014

joneschrisg commented Aug 7, 2014

joneschrisg commented Aug 7, 2014

joneschrisg commented Sep 4, 2014

joneschrisg commented Jan 10, 2015

Keno commented Sep 8, 2016

VelorumS commented Jun 16, 2017

rocallahan commented Jun 16, 2017

clevcode commented Mar 1, 2019

dr-m commented Jan 20, 2022

khuey commented Jan 20, 2022