Experiment with rr to see if it can be used to boost the likelihood of reproducing race-related issues
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bf-10729
bf-10742
bf-10932
bf-11054
bf-6346
bf-7114
bf-7588
bf-7888
bf-8258
bf-8424
bf-8642
bf-9030
bf-9248
bf-9426
bf-9552
bf-9810
bf-9864
bf-9958
README.rst

README.rst

mongodb-rr-experiment

Over the years, we at MongoDB have developed tooling within our correctness testing infrastructure to make it easier to debug crashes (by collecting core dumps), hangs (by collecting thread stacks and lock requests), and data corruption (by collecting data files). However, we have yet to evolve a better strategy around debugging race conditions and still depend on an engineer to run the failed test many times with additional logging, or to have them think really hard about where in the code to add a sleep. Technologies such as rr may help us form a better story for investigating race-related issues without requiring effort from an engineer to manually reproduce the failure.

Setup

git clone https://github.com/visemet/mongodb-rr-experiment.git
cd mongodb-rr-experiment

Building rr

The following instructions were adapted from https://github.com/mozilla/rr/wiki/Building-And-Installing.

sudo apt update
sudo apt install     \
    capnproto        \
    ccache           \
    clang            \
    cmake            \
    coreutils        \
    g++-multilib     \
    gdb              \
    git              \
    libcapnp-dev     \
    make             \
    manpages-dev     \
    ninja-build      \
    pkg-config       \
    python-pexpect   \
    python3-pexpect
git clone https://github.com/mozilla/rr.git
cd rr
git checkout 5.2.0

CC=clang CXX=clang++ cmake -B build/ -G Ninja -Ddisable32bit=ON .
cmake --build .

sudo cmake --build . --target install
sudo sysctl kernel.perf_event_paranoid=1

Building MongoDB

The following instructions were adapted from https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source.

sudo apt install libcurl4-openssl-dev python-pip
git clone https://github.com/mongodb/mongo.git
cd mongo

git remote add visemet https://github.com/visemet/mongo.git
git fetch visemet mongodb-rr-experiment
git checkout visemet/mongodb-rr-experiment

python2 -m pip install -r etc/pip/dev-requirements.txt
python2 -m pip install --user psutil==5.4.8

Results

You may notice when comparing the columns in the tables below that (1) there weren't any cases where a failure could only be reproduced using rr, and (2) there were multiple cases where a failure could only be reproduced manually. This shouldn't be interpreted as saying rr is ineffective. It is still very likely that rr would save an engineer both time and effort when investigating a build failure. The results simply demonstrate that it isn't possible to solely rely on rr as the answer to investigating all race-related issues.

Single-process failures

Build failure Able to reproduce?
using rr manually
BF-9810    
BF-9958
BF-10742
BF-10932

Single server process failures

Build failure Able to reproduce?
using rr manually
BF-6346  
BF-8424
BF-9030    

Multi server process failures

Build failure Able to reproduce?
using rr manually
BF-7114  
BF-7588
BF-7888  
BF-8258    
BF-8642
BF-9248  
BF-9426    
BF-9552
BF-9864    
BF-10729
BF-11054