mongodb-rr-experiment
Over the years, we at MongoDB have developed tooling within our correctness testing infrastructure
to make it easier to debug crashes (by collecting core dumps), hangs (by collecting thread stacks
and lock requests), and data corruption (by collecting data files). However, we have yet to evolve a
better strategy around debugging race conditions and still depend on an engineer to run the failed
test many times with additional logging, or to have them think really hard about where in the code
to add a sleep. Technologies such as rr
may help us form a better story for investigating
race-related issues without requiring effort from an engineer to manually reproduce the failure.
Setup
git clone https://github.com/visemet/mongodb-rr-experiment.git
cd mongodb-rr-experiment
rr
Building The following instructions were adapted from https://github.com/mozilla/rr/wiki/Building-And-Installing.
sudo apt update
sudo apt install \
capnproto \
ccache \
clang \
cmake \
coreutils \
g++-multilib \
gdb \
git \
libcapnp-dev \
make \
manpages-dev \
ninja-build \
pkg-config \
python-pexpect \
python3-pexpect
git clone https://github.com/mozilla/rr.git
cd rr
git checkout 5.2.0
CC=clang CXX=clang++ cmake -B build/ -G Ninja -Ddisable32bit=ON .
cmake --build .
sudo cmake --build . --target install
sudo sysctl kernel.perf_event_paranoid=1
Building MongoDB
The following instructions were adapted from https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source.
sudo apt install libcurl4-openssl-dev python-pip
git clone https://github.com/mongodb/mongo.git
cd mongo
git remote add visemet https://github.com/visemet/mongo.git
git fetch visemet mongodb-rr-experiment
git checkout visemet/mongodb-rr-experiment
python2 -m pip install -r etc/pip/dev-requirements.txt
python2 -m pip install --user psutil==5.4.8
Results
You may notice when comparing the columns in the tables below that (1) there weren't any cases where
a failure could only be reproduced using rr
, and (2) there were multiple cases where a failure
could only be reproduced manually. This shouldn't be interpreted as saying rr
is ineffective. It
is still very likely that rr
would save an engineer both time and effort when investigating a
build failure. The results simply demonstrate that it isn't possible to solely rely on rr
as the
answer to investigating all race-related issues.
Single-process failures
Build failure | Able to reproduce? | |
---|---|---|
using rr | manually | |
BF-9810 | ||
BF-9958 | ✓ | ✓ |
BF-10742 | ✓ | ✓ |
BF-10932 | ✓ | ✓ |
Single server process failures
Build failure | Able to reproduce? | |
---|---|---|
using rr | manually | |
BF-6346 | ✓ | |
BF-8424 | ✓ | ✓ |
BF-9030 |
Multi server process failures
Build failure | Able to reproduce? | |
---|---|---|
using rr | manually | |
BF-7114 | ✓ | |
BF-7588 | ✓ | ✓ |
BF-7888 | ✓ | |
BF-8258 | ||
BF-8642 | ✓ | ✓ |
BF-9248 | ✓ | |
BF-9426 | ||
BF-9552 | ✓ | ✓ |
BF-9864 | ||
BF-10729 | ✓ | ✓ |
BF-11054 | ✓ | ✓ |