[c++] avrogencpp: emit deterministic include guard#399
Conversation
CodeGen::guard() in avrogencpp.cc was suffixing the generated header's include guard with the output of boost::mt19937 seeded from ::time(nullptr). That produced a different guard on every avrogen invocation, e.g.: #ifndef FOO_AVROGEN_H_3350718792_H #ifndef FOO_AVROGEN_H_2362587291_H Two consequences: 1. Generated headers were non-deterministic. Repeated runs on the same schema produced different bytes. 2. Build systems that key their cache on input-content digests (e.g. Bazel's remote cache, the Nix store) saw every consumer of the generated header miss the cache on every build, even when the schema was byte-identical. In a hermetic two-output-base Bazel build of Redpanda this surfaced as a chain of cascade rebuilds starting at manifest_file.avrogen.h and propagating through every .cc that included it. headerFile_ is already guaranteed-unique per output. The random suffix doesn't add uniqueness, only entropy. Mirrors the same change being proposed upstream at apache/avro.
|
also upstream: apache#3778 |
|
@copilot review |
Agent-Logs-Url: https://github.com/redpanda-data/avro/sessions/b62d1929-ada4-43d6-9a02-d5dc98f08aaf Co-authored-by: travisdowns <2403521+travisdowns@users.noreply.github.com>
The change is straightforward and correct. One thing the original commit left behind: after removing the RNG suffix from
|
dotnwat
left a comment
There was a problem hiding this comment.
any clue why the random value existed? i hit a brick wall git blaming back to the 15 year old commit that added it.
I guess it's because you can pass any small relative path to the generator, depending on how the build works, so you might have a/foo.h and b/foo.h and pass -o foo.h in both cases with a and b being CWD respectively, in which cause the guard would clash. In our build we pass full paths (relative to the repo root) so that shouldn't be a problem, but I guess it does mean the change may not be accepted upstream. Another way to solve this would be to have an option to use |
Summary
CodeGen::guard()(lang/c++/impl/avrogencpp.cc:770) was suffixing the generated header's include guard with the output ofboost::mt19937seeded from::time(nullptr)— producing a different guard on every invocation:This makes generated headers non-deterministic by design, which is a problem for build systems that key their cache on input-content digests (Bazel remote cache, Nix store paths, etc.). On byte-identical schemas they nonetheless see every consumer of the generated header invalidate its action key on every invocation, forcing the entire downstream chain to rebuild.
Discovered while doing hermeticity work on Redpanda —
manifest_file.avrogen.hshowed up as a root non-hermetic action in a two-output-base bazel build comparison, then cascaded throughmanifest_list_avro.cc.o→ iceberg lib → the//src/v/redpanda:redpandabinary, blowing the PGO-instrument cache on the overnight CI builds.Why this is safe
headerFile_is already a unique-per-output path (the path of the file being written).makeCanonical(h, true)turns that path into a valid C identifier, which by itself is a fine guard name. The RNG-derived suffix only added entropy, not uniqueness — there's no scenario where two avrogen runs producing headers at the same output path are supposed to coexist with conflicting guards.After this change the guard is
<canonicalised-path>_H, deterministic across builds.Upstream
Same change being proposed against
apache/avroso users on a stock upstream cppgen pick it up too. Sister PR inredpanda-data/redpanda: #30487 (consumes the fix via a bazel patch on this branch as a stopgap until this lands).