Skip to content

Fix: malformed brda records caused bad branching coverage #26342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

orbl
Copy link

@orbl orbl commented Jun 19, 2025

While working on a large bazel migration project in Volvo Cars, I noticed that for merged coverage reports (i.e. `bazel coverage target1 target2 ... targetN) bazel would generate different branch coverage on each run. It would also be very inflated compared to running the same in our pre-bazel environment. This essentially made it unusable.

After much detective work it turns out that individual tests in bazel can create malformed BRDA records where the same unique branch (i.e. the same, line, block and branch) have multiple values. When this happens the coverageOutputGenerator simply adds all the incorrect records to the merged file. In the merging logic there are some checks that sometimes drops individual records based on the order in which the BRDA records have been merged. Given that bazel does not guarantee that order, this then leads to each run having different branch coverage.

E.g.
Bazel has generated this (counted as 6 branches)
BRDA:131,0,0,1
BRDA:131,0,1,1
BRDA:131,0,0,1
BRDA:131,0,1,0
BRDA:131,0,0,3
BRDA:131,0,1,1

Which should've been only this (counted as 2 branches):
BRDA:131,0,1,2
BRDA:131,0,0,5

This fix ensures that each branch is unique for a particular source file. Duplicates are now merged, ensuring that the tally is correct. I have verified that this works consistently on our large code base and added some regression tests to avoid this bug in the future.

@orbl orbl requested a review from lberki as a code owner June 19, 2025 13:48
Copy link

google-cla bot commented Jun 19, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Jun 19, 2025
@fmeum fmeum requested review from c-mita and removed request for lberki June 19, 2025 20:00
@c-mita
Copy link
Member

c-mita commented Jun 20, 2025

When does this actually occur?

it turns out that individual tests in bazel can create malformed BRDA records where the same unique branch (i.e. the same, line, block and branch) have multiple values

So before the coverage merging runs, a single test has more than one line item for a given branch? What is creating that information and why?

@orbl
Copy link
Author

orbl commented Jun 20, 2025

I have been unable to pinpoint why this happens in the first place. But here are the conditions I can use to reproduce it consistently:

  • Run bazel coverage target1 target2 ... targetN for at least 25 cc_test's in our code base
  • ALL coverage.dat files are taken from cache
  • When merging the .dat files from the cache we get the behavior described above.

We use exclusively the bazel built-in remote_coverage_tools.

The submitted fix aims at solving this problem for the final merging of the .dat files. I am spending the day today trying to find how it occurs in the first place to see if I can fix it there too, but I have limited time to dedicate, so I hope this fix can be submitted independently of the upstream issue. Especially, given that one can use flags to replace the coverage mergers, I figure this needs to be fixed both for individual tests and for the final merging.

@orbl
Copy link
Author

orbl commented Jun 20, 2025

I'll underline that the line 131 example above is taken from actual .dat files generated by Bazel and are not hypothetical.

@c-mita
Copy link
Member

c-mita commented Jun 20, 2025

When merging the .dat files from the cache we get the behavior described above.
Is this just #12142?

There is an issue where two different test targets instrument the same source file, but there's a mismatch in the branch data for some lines (the example I have in that issue is due to C++ template instantiation, but I believe there are other ways this can happen).

But from what I understand, that's not the root of your problem. I would not expect to see duplicated "line, block, branch" within a single report.

My understanding is that, in your case, a single test target is producing multiple branch data for a single "(line, branch, block)" combination? This of course would the trip up merging as per #12142.

In which case, you should be able to identify a single test that produces an anomalous coverage report.

Run bazel coverage target1 target2 ... targetN for at least 25 cc_test's in our code base

Can you run bazel coverage --combined_report=none for each target in turn and examine their individual coverage outputs and see which one has the odd behaviour? Ideally you could also inspect the raw coverage data produced by the compiler, although this might be trickier for you.

And which compiler version are you using? There are some differences between GCC versions, and Clang's coverage also works differently.

@orbl
Copy link
Author

orbl commented Jun 20, 2025

In which case, you should be able to identify a single test that produces an anomalous coverage report.
I already know a test that produces it, but it is proprietary Volvo Cars code and I cannot share it with you. Is there something in particular you want me to check for you in it?

While I can't share the source code, here is the (lightly) obfuscated coverage code for a single test (unmerged & copied straight out of the .dat):

SF:PATHTOFILE
FN:87,_ZN3csp5trace13LogTraceEventIJRKjEEEvNS0_12TraceEventIdEDpOT_
FN:87,_ZN3csp5trace13LogTraceEventIJRKjS3_EEEvNS0_12TraceEventIdEDpOT_
FN:87,_ZN3csp5trace13LogTraceEventIJiEEEvNS0_12TraceEventIdEDpOT_
FN:87,_ZN3csp5trace13LogTraceEventIJiiEEEvNS0_12TraceEventIdEDpOT_
FN:87,_ZN3csp5trace13LogTraceEventIJjjEEEvNS0_12TraceEventIdEDpOT_
FN:140,_ZN3csp5trace20DelayedLogTraceEventIJiEE17CallLogTraceEventIJLm0EEEEvSt16integer_sequenceImJXspT_EEE
FN:107,_ZN3csp5trace20DelayedLogTraceEventIJiEEC2IJRiEEENS0_12TraceEventIdEDpOT_
FN:130,_ZN3csp5trace20DelayedLogTraceEventIJiEED2Ev
FN:140,_ZN3csp5trace20DelayedLogTraceEventIJiiEE17CallLogTraceEventIJLm0ELm1EEEEvSt16integer_sequenceImJXspT_EEE
FN:107,_ZN3csp5trace20DelayedLogTraceEventIJiiEEC2IJRKjS5_EEENS0_12TraceEventIdEDpOT_
FN:107,_ZN3csp5trace20DelayedLogTraceEventIJiiEEC2IJiiEEENS0_12TraceEventIdEDpOT_
FN:130,_ZN3csp5trace20DelayedLogTraceEventIJiiEED2Ev
FN:119,_ZN3csp5trace20DelayedLogTraceEventIJiiEEaSEOS2_
FN:140,_ZN3csp5trace20DelayedLogTraceEventIJjjEE17CallLogTraceEventIJLm0ELm1EEEEvSt16integer_sequenceImJXspT_EEE
FN:114,_ZN3csp5trace20DelayedLogTraceEventIJjjEEC2EOS2_
FN:107,_ZN3csp5trace20DelayedLogTraceEventIJjjEEC2IJRKjS5_EEENS0_12TraceEventIdEDpOT_
FN:130,_ZN3csp5trace20DelayedLogTraceEventIJjjEED2Ev
FN:48,_ZN3csp5trace9LogTracer13LogTraceEventIJRKjEEEvNS0_12TraceEventIdEDpOT_
FN:48,_ZN3csp5trace9LogTracer13LogTraceEventIJRKjS4_EEEvNS0_12TraceEventIdEDpOT_
FN:48,_ZN3csp5trace9LogTracer13LogTraceEventIJiEEEvNS0_12TraceEventIdEDpOT_
FN:48,_ZN3csp5trace9LogTracer13LogTraceEventIJiiEEEvNS0_12TraceEventIdEDpOT_
FN:48,_ZN3csp5trace9LogTracer13LogTraceEventIJjjEEEvNS0_12TraceEventIdEDpOT_
FN:62,_ZNK3csp5trace9LogTracer14IsEventEnabledENS0_12TraceEventIdE
FNDA:1,_ZN3csp5trace13LogTraceEventIJRKjEEEvNS0_12TraceEventIdEDpOT_
FNDA:5,_ZN3csp5trace13LogTraceEventIJRKjS3_EEEvNS0_12TraceEventIdEDpOT_
FNDA:1,_ZN3csp5trace13LogTraceEventIJiEEEvNS0_12TraceEventIdEDpOT_
FNDA:1,_ZN3csp5trace13LogTraceEventIJiiEEEvNS0_12TraceEventIdEDpOT_
FNDA:4,_ZN3csp5trace13LogTraceEventIJjjEEEvNS0_12TraceEventIdEDpOT_
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJiEE17CallLogTraceEventIJLm0EEEEvSt16integer_sequenceImJXspT_EEE
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJiEEC2IJRiEEENS0_12TraceEventIdEDpOT_
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJiEED2Ev
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJiiEE17CallLogTraceEventIJLm0ELm1EEEEvSt16integer_sequenceImJXspT_EEE
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJiiEEC2IJRKjS5_EEENS0_12TraceEventIdEDpOT_
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJiiEEC2IJiiEEENS0_12TraceEventIdEDpOT_
FNDA:2,_ZN3csp5trace20DelayedLogTraceEventIJiiEED2Ev
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJiiEEaSEOS2_
FNDA:4,_ZN3csp5trace20DelayedLogTraceEventIJjjEE17CallLogTraceEventIJLm0ELm1EEEEvSt16integer_sequenceImJXspT_EEE
FNDA:1,_ZN3csp5trace20DelayedLogTraceEventIJjjEEC2EOS2_
FNDA:4,_ZN3csp5trace20DelayedLogTraceEventIJjjEEC2IJRKjS5_EEENS0_12TraceEventIdEDpOT_
FNDA:5,_ZN3csp5trace20DelayedLogTraceEventIJjjEED2Ev
FNDA:1,_ZN3csp5trace9LogTracer13LogTraceEventIJRKjEEEvNS0_12TraceEventIdEDpOT_
FNDA:5,_ZN3csp5trace9LogTracer13LogTraceEventIJRKjS4_EEEvNS0_12TraceEventIdEDpOT_
FNDA:1,_ZN3csp5trace9LogTracer13LogTraceEventIJiEEEvNS0_12TraceEventIdEDpOT_
FNDA:1,_ZN3csp5trace9LogTracer13LogTraceEventIJiiEEEvNS0_12TraceEventIdEDpOT_
FNDA:4,_ZN3csp5trace9LogTracer13LogTraceEventIJjjEEEvNS0_12TraceEventIdEDpOT_
FNDA:13,_ZNK3csp5trace9LogTracer14IsEventEnabledENS0_12TraceEventIdE
FNF:23
FNH:23
BRDA:49,0,0,1
BRDA:49,0,1,0
BRDA:49,0,0,1
BRDA:49,0,1,0
BRDA:49,0,0,2
BRDA:49,0,1,2
BRDA:49,0,0,1
BRDA:49,0,1,0
BRDA:49,0,0,3
BRDA:49,0,1,2
BRDA:64,0,0,12
BRDA:64,0,1,1
BRDA:120,0,0,1
BRDA:120,0,1,0
BRDA:131,0,0,1
BRDA:131,0,1,1
BRDA:131,0,0,1
BRDA:131,0,1,0
BRDA:131,0,0,4
BRDA:131,0,1,1
BRF:20
BRH:15
DA:48,12
DA:49,12
DA:50,8
DA:51,8
DA:52,8
DA:55,8
DA:57,8
DA:58,8
DA:60,12
DA:62,13
DA:63,13
DA:64,13
DA:65,12
DA:67,1
DA:87,12
DA:88,12
DA:89,12
DA:107,7
DA:108,7
DA:109,7
DA:114,1
DA:115,1
DA:116,1
DA:117,1
DA:119,1
DA:120,1
DA:121,1
DA:122,1
DA:123,1
DA:125,1
DA:127,1
DA:130,8
DA:131,8
DA:133,6
DA:135,8
DA:140,6
DA:141,6
DA:142,6
LH:38
LF:38
end_of_record

We are using gcc-9.4.0-x86_64-linux as a compiler.

Since the data can get malformed due to upstreams problems (whether it is from the compiler, or the issue you mention), I think a bit of robustification is in order. Multiple BRDA records with the same unique id (line, block and branch number) should not be allowed. Especially since one cannot guarantee the upstream coverage generation.

@c-mita
Copy link
Member

c-mita commented Jun 20, 2025

I believe I understand why this is happening.

You have the same function listed three times in your report:

FN:130,_ZN3csp5trace20DelayedLogTraceEventIJiEED2Ev
FN:130,_ZN3csp5trace20DelayedLogTraceEventIJiiEED2Ev
FN:130,_ZN3csp5trace20DelayedLogTraceEventIJjjEED2Ev

One is defined with <int>, a second with <int, int>, and another with <unsigned int, unsigned int>. Each of those then contributes a "set of branches" separately.

I have actually encountered this before; I just forgot about this particular case. It is also, unfortunately, fundamentally intractable, being the same as the problem described in #12142.

Your patch may solve your particular case, but it does not solve the general problem.

I have a reproduction of the issue here: https://github.com/c-mita/bazel_issues/tree/master/lcov_issues/branch_merge

// lib.h
template <typename T>
T do_t_thing(T x, int y) {
    T z = x;
    z = y > 0 ? z + y : z;
    return z;
}
// baz.cc

// thing is a struct with + and = operators

int do_many_things(int x, int y) {
    thing z(x, y);
    z = do_t_thing(z, y);
    return do_t_thing(z.x[0], y);
}

Executing a test with do_many_things(1, 1) yields the following branch coverage for lib.h:

SF:branch_merge/lib.h
FN:5,_Z10do_t_thingI5thingET_S1_i
FN:5,_Z10do_t_thingIiET_S0_i
FNDA:1,_Z10do_t_thingI5thingET_S1_i
FNDA:1,_Z10do_t_thingIiET_S0_i
FNF:2
FNH:2
BRDA:7,0,0,1
BRDA:7,0,1,0
BRDA:7,0,0,1
BRDA:7,0,1,0
BRDA:7,0,2,1
BRDA:7,0,3,0
BRDA:7,0,4,1
BRDA:7,0,5,0
BRDA:7,0,6,0
BRDA:7,0,7,0
BRDA:7,0,8,1
BRDA:7,0,9,0
BRDA:7,0,10,0
BRDA:7,0,11,0
BRF:14
BRH:5

The instantiation with an int is simple; it only has two branches, but the struct one has 12, of which four are actually taken.

It also isn't immediately obvious to me which of those branches align with the ones in the "simple" case. So simply merging like this is not necessarily "correct".

All that said, while there may not be a "correct" solution, there are varying degrees of "wrong" and I think Bazel should be "less wrong". At the very least, report merging should be a commutative and associative operation (the order in which merge reports should not matter) and this is currently not the case.

I need to think and dig a little more before committing to an answer though.

@iancha1992 iancha1992 added coverage team-Rules-CPP Issues for C++ rules labels Jun 20, 2025
@orbl
Copy link
Author

orbl commented Jun 23, 2025

Glad to see that you have a good theory on what could cause this.

As you point out, while this change "fixes" it for us, it might not fix it in all cases. This is a natural outcome of the fact that you cannot guarantee correct input data. However, to merge and keep records where the same (in lcov language) "unique" branch is reported multiple time is akin to making a hash map with duplicates of the same hash key pointing to different values. It is semantically and functionally broken.

If our aim is to make a solution that guarantees branch merging to be correct in all cases, we need to work upstream: rewriting tests, code, setting compiler flags and making the organisation doing these things highly disciplined. You and I will not be able to solve this through code changes to Bazel.

We can however do some mitigation. Here are some ideas:

  1. If duplicates of the same UNIQUE branch are detected, through an error. Have a flag that end users can use to ignore/silence the error
  2. Merge the branches naively (like in my PR) but make a warning that it is happening
    b) Same, but only do it if a --experimental_merge_duplicate_branch_coverage flag is set.
  3. All of the above

Commutativeness
On your note that coverage needs to be commutative, this PR will not fix that, but does reduce the impact of it. Since there will be fewer records, there will be fewer cases in which case the logic that drops data comes into play. I've made a separate issue here and am happy to make a PR with a mitigation or warning. However, I ask that it be considered separately from this PR, since it is fundamentally a different change.

@orbl
Copy link
Author

orbl commented Jun 23, 2025

Another note, @c-mita, when running coverage on the same files using cmake, gcov and gcovr, the merging works fine and we get the individual merged records, instead of duplicates like with bazels merging.

BRDA:131,0,1,2
BRDA:131,0,0,5

So while you and I agree that this pull request is not necessarily a "fix", given that the problem lies with the input data being merged. From the perspective of volvo developers, the bazel coverage solution is "incorrect" compared to their original experience. Leaving this unchanged should not be an option, and I hope we can find a way forward here.

@c-mita
Copy link
Member

c-mita commented Jun 23, 2025

I have a change internally that simplifies the merging logic and, as a side effect, solves this problem too (it's very similar looking code). It shouldn't take too long to get submitted.

@orbl
Copy link
Author

orbl commented Jun 24, 2025

Alrighty! I was looking forward to become a contributor to the bazel source, but the most important thing is that this is solved. Looking forward to seeing your change.

@orbl
Copy link
Author

orbl commented Jun 25, 2025

@c-mita, how long do you think it will take until you get your changes submitted and merged?

copybara-service bot pushed a commit that referenced this pull request Jun 25, 2025
Coverage report merging should be consistent, no matter the order in
which reports are merged (the merge operation should be commutative and
associative). Dropping branches when the input reports disagree on the
number of branches violated this (we used to drop the RHS, so the
operation didn't commute).

The new merging strategy is simply to combine the execution values for
the same "(block, branch)" combination. If a combination doesn't exist
for one report, it's simply added to the merged result.

This may not still be "correct" for when the number of branches differs
between reports, but it's arguably "less wrong" since it's at least now
consistent.

Moving this logic to the "addBranch" method on SourceFileCoverage also
neatly handles the case where the same "block, branch" pair occurs more
than once in a single coverage report for a file. This can occur
legitimately when C++ templates are instantiated several times (see
#26342 (comment)
and #12142).

To simplify the merging logic and avoid having to handle "BA" and "BRDA"
style branches differently, we ensure that BranchCoverage always has a
branch identifier set. This will just be a simple incrementing counter
when parsing BA lines. A new test to ensure BA lines are correctly
printed out has been added to verify the change to BranchCoverage
doesn't break anything in that regard.

Fixes #12142

PiperOrigin-RevId: 775577636
Change-Id: I1c23c89ca9c4477aeee7f4a7cca70612248c8a1c
@c-mita
Copy link
Member

c-mita commented Jun 25, 2025

@c-mita, how long do you think it will take until you get your changes submitted and merged?

4753fa0 should resolve the issue.

I'll need to create a new release of the coverage tool for it to take effect normally. There's a bit more cleaning up I'd like to do, as well as some long standing performance issues. I'm not sure how much I'll get round to, but either way I should make a new release of the tool within the week.

@orbl
Copy link
Author

orbl commented Jun 25, 2025

Looks very good. I really like replacing the merging function with just adding the branches per file. Makes it much simpler.

A suggestion that could help improve your change even more: lines are currently used throughout the code as identifiers for branches. This is incorrect, the combination of line, branch and block is what makes a branch unique. Either one can remove using "line" entirely as input to the branch related functions (and thus simplify the code) or it should be replaces with a tuple or a hash of the id's (to make it correct).

When you've uploaded the new version, could you pass me the hash and url, so I can point our internal Bazel to it while waiting for it to be included in the next release.

@c-mita
Copy link
Member

c-mita commented Jun 25, 2025

A suggestion that could help improve your change even more: lines are currently used throughout the code as identifiers for branches. This is incorrect, the combination of line, branch and block is what makes a branch unique. Either one can remove using "line" entirely as input to the branch related functions (and thus simplify the code) or it should be replaces with a tuple or a hash of the id's (to make it correct).

You're suggesting changing the mapping from {line -> [(block, branch, eval, exec) ...]} to {(line, block, branch) -> (eval, exec)}?

I can see some merit in that; I don't think we ever actually need to query branch data by line; we just need to walk the mapping in natural key order when printing out or merging.

@orbl
Copy link
Author

orbl commented Jun 25, 2025

Yup, and this way bazels coverage management would be consistent with the way coverage tools treat branch coverage as well.

@orbl
Copy link
Author

orbl commented Jul 2, 2025

@c-mita have you had a chance to publish the fixed coverageOutputRunner on the google mirrors yet?

Also, will this be included in the next release of Bazel?

@c-mita
Copy link
Member

c-mita commented Jul 2, 2025

@c-mita have you had a chance to publish the fixed coverageOutputRunner on the google mirrors yet?

Also, will this be included in the next release of Bazel?

See #26407

It should also make it into 8.4.0 (#26408), but it seems there's a conflict to be resolved first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer coverage team-Rules-CPP Issues for C++ rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants