[pull] master from git:master#184
Merged
pull[bot] merged 27 commits intoturkdevops:masterfrom Mar 26, 2026
Merged
Conversation
To make clear that the function `get_midx_checksum()` does not do anything to modify its argument, mark the MIDX pointer as const. The following commit will rename this function altogether to make clear that it returns the raw bytes of the checksum, not a hex-encoded copy of it. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since 541204a (Documentation: document naming schema for structs and their functions, 2024-07-30), we have adopted a naming convention for functions that would prefer a name like, say, `midx_get_checksum()` over `get_midx_checksum()`. Adopt this convention throughout the midx.h API. Since this function returns a raw (that is, non-hex encoded) hash, let's suffix the function with "_hash()" to make this clear. As a side effect, this prepares us for the subsequent change which will introduce a "_hex()" variant that encodes the checksum itself. Suggested-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When trying to print out, say, the hexadecimal representation of a
MIDX's hash, our code will do something like:
hash_to_hex_algop(midx_get_checksum_hash(m),
m->source->odb->repo->hash_algo);
, which is both cumbersome and repetitive. In fact, all but a handful of
callers to `midx_get_checksum_hash()` do exactly the above. Reduce the
repetitive nature of calling `midx_get_checksum_hash()` by having it
return a pointer into a static buffer containing the above result.
For the handful of callers that do need to compare the raw bytes and
don't want to deal with an encoded copy (e.g., because they are passing
it to hasheq() or similar), they may still rely on
`midx_get_checksum_hash()` which returns the raw bytes.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All multi-pack-index sub-commands (write, verify, repack, and expire) support a '--progress' command-line option, despite not listing it as one of the common options in `common_opts`. As a result each sub-command declares its own `OPT_BIT()` for a "--progress" command-line option. Centralize this within the `common_opts` to avoid re-declaring it in each sub-command. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since fcb2205 (midx: implement support for writing incremental MIDX chains, 2024-08-06), the command-line options '--incremental' and '--bitmap' were declared to be incompatible with one another when running 'git multi-pack-index write'. However, since 27afc27 (midx: implement writing incremental MIDX bitmaps, 2025-03-20), that incompatibility no longer exists, despite the documentation saying so. Correct this by removing the stale reference to their incompatibility. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since c39fffc (tests: start asserting that *.txt SYNOPSIS matches -h output, 2022-10-13), the manual page for 'git multi-pack-index' has a SYNOPSIS section which differs from 'git multi-pack-index -h'. Correct this while also documenting additional options accepted by the 'write' sub-command. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit d4bf1d8 (multi-pack-index: verify missing pack, 2018-09-13) adds a new test to the MIDX test script to test how we handle missing packs. While the commit itself describes the test as "verify missing pack[s]", the test itself is actually called "verify packnames out of order", despite that not being what it tests. Likely this was a copy-and-paste of the test immediately above it of the same name. Correct this by renaming the test to match the commit message. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In midx_pack_order(), we compute for each bitmapped pack the first bit to correspond to an object in that pack, along with how many bits were assigned to object(s) in that pack. Initially, each bitmap_nr value is set to zero, and each bitmap_pos value is set to the sentinel BITMAP_POS_UNKNOWN. This is done to ensure that there are no packs who have an unknown bit position but a somehow non-zero number of objects (cf. `write_midx_bitmapped_packs()` in midx-write.c). Once the pack order is fully determined, midx_pack_order() sets the bitmap_pos field for any bitmapped packs to zero if they are still listed as BITMAP_POS_UNKNOWN. However, we enumerate the bitmapped packs in order of `ctx->pack_perm`. This is fine for existing cases, since the only time the `ctx->pack_perm` array holds a value outside of the addressable range of `ctx->info` is when there are expired packs, which only occurs via 'git multi-pack-index expire', which does not support writing MIDX bitmaps. As a result, the range of ctx->pack_perm covers all values in [0, `ctx->nr`), so enumerating in this order isn't an issue. A future change necessary for compaction will complicate this further by introducing a wrapper around the `ctx->pack_perm` array, which turns the given `pack_int_id` into one that is relative to the lower end of the compaction range. As a result, indexing into `ctx->pack_perm` through this helper, say, with "0" will produce a crash when the lower end of the compaction range has >0 pack(s) in its base layer, since the subtraction will wrap around the 32-bit unsigned range, resulting in an uninitialized read. But the process is completely unnecessary in the first place: we are enumerating all values of `ctx->info`, and there is no reason to process them in a different order than they appear in memory. Index `ctx->info` directly to reflect that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the MIDX writing code, there are four functions which perform some sort of MIDX write operation. They are: - write_midx_file() - write_midx_file_only() - expire_midx_packs() - midx_repack() All of these functions are thin wrappers over `write_midx_internal()`, which implements the bulk of these routines. As a result, the `write_midx_internal()` function takes six arguments. Future commits in this series will want to add additional arguments, and in general this function's signature will be the union of parameters among *all* possible ways to write a MIDX. Instead of adding yet more arguments to this function to support MIDX compaction, introduce a `struct write_midx_opts`, which has the same struct members as `write_midx_internal()`'s arguments. Adding additional fields to the `write_midx_opts` struct is preferable to adding additional arguments to `write_midx_internal()`. This is because the callers below all zero-initialize the struct, so each time we add a new piece of information, we do not have to pass the zero value for it in all other call-sites that do not care about it. For now, no functional changes are included in this patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The MIDX file format currently requires that pack files be identified by
the lexicographic ordering of their names (that is, a pack having a
checksum beginning with "abc" would have a numeric pack_int_id which is
smaller than the same value for a pack beginning with "bcd").
As a result, it is impossible to combine adjacent MIDX layers together
without permuting bits from bitmaps that are in more recent layer(s).
To see why, consider the following example:
| packs | preferred pack
--------+-------------+---------------
MIDX #0 | { X, Y, Z } | Y
MIDX #1 | { A, B, C } | B
MIDX #2 | { D, E, F } | D
, where MIDX #2's base MIDX is MIDX #1, and so on. Suppose that we want
to combine MIDX layers #0 and #1, to create a new layer #0' containing
the packs from both layers. With the original three MIDX layers, objects
are laid out in the bitmap in the order they appear in their source
pack, and the packs themselves are arranged according to the pseudo-pack
order. In this case, that ordering is Y, X, Z, B, A, C.
But recall that the pseudo-pack ordering is defined by the order that
packs appear in the MIDX, with the exception of the preferred pack,
which sorts ahead of all other packs regardless of its position within
the MIDX. In the above example, that means that pack 'Y' could be placed
anywhere (so long as it is designated as preferred), however, all other
packs must be placed in the location listed above.
Because that ordering isn't sorted lexicographically, it is impossible
to compact MIDX layers in the above configuration without permuting the
object-to-bit-position mapping. Changing this mapping would affect all
bitmaps belonging to newer layers, rendering the bitmaps associated with
MIDX #2 unreadable.
One of the goals of MIDX compaction is that we are able to shrink the
length of the MIDX chain *without* invalidating bitmaps that belong to
newer layers, and the lexicographic ordering constraint is at odds with
this goal.
However, packs do not *need* to be lexicographically ordered within the
MIDX. As far as I can gather, the only reason they are sorted lexically
is to make it possible to perform a binary search over the pack names in
a MIDX, necessary to make `midx_contains_pack()`'s performance
logarithmic in the number of packs rather than linear.
Relax this constraint by allowing MIDX writes to proceed with packs that
are not arranged in lexicographic order. `midx_contains_pack()` will
lazily instantiate a `pack_names_sorted` array on the MIDX, which will
be used to implement the binary search over pack names.
This change produces MIDXs which may not be correctly read with external
tools or older versions of Git. Though older versions of Git know how to
gracefully degrade and ignore any MIDX(s) they consider corrupt,
external tools may not be as robust. To avoid unintentionally breaking
any such tools, guard this change behind a version bump in the MIDX's
on-disk format.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `ctx->pack_perm` array can be considered as a permutation between the original `pack_int_id` of some given pack to its position in the `ctx->info` array containing all packs. Today we can always index into this array with any known `pack_int_id`, since there is never a `pack_int_id` which is greater than or equal to the value `ctx->nr`. That is not necessarily the case with MIDX compaction. For example, suppose we have a MIDX chain with three layers, each containing three packs. The base of the MIDX chain will have packs with IDs 0, 1, and 2, the next layer 3, 4, and 5, and so on. If we are compacting the topmost two layers, we'll have input `pack_int_id` values between [3, 8], but `ctx->nr` will only be 6. In that example, if we want to know where the pack whose original `pack_int_id` value was, say, 7, we would compute `ctx->pack_perm[7]`, leading to an uninitialized read, since there are only 6 entries allocated in that array. To address this, there are a couple of options: - We could allocate enough entries in `ctx->pack_perm` to accommodate the largest `orig_pack_int_id` value. - Or, we could internally shift the input values by the number of packs in the base layer of the lower end of the MIDX compaction range. This patch prepare us to take the latter approach, since it does not allocate more memory than strictly necessary. (In our above example, the base of the lower end of the compaction range is the first MIDX layer (having three packs), so we would end up indexing `ctx->pack_perm[7-3]`, which is a valid read.) Note that this patch does not actually implement that approach yet, but merely performs a behavior-preserving refactoring which will make the change easier to carry out in the future. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When filling packs from an existing MIDX, `fill_packs_from_midx()` handles preparing a MIDX'd pack, and reading out its pack name from the existing MIDX. MIDX compaction will want to perform an identical operation, though the caller will look quite different than `fill_packs_from_midx()`. To reduce any future code duplication, extract `fill_pack_from_midx()` from `fill_packs_from_midx()` to prepare to call our new helper function in a future change. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our `midx-write.c::fill_packs_from_midx()` function currently enumerates
the range [0, m->num_packs), and then shifts its index variable up by
`m->num_packs_in_base` to produce a valid `pack_int_id`.
Instead, directly enumerate the range:
[m->num_packs_in_base, m->num_packs_in_base + m->num_packs)
, which are the original pack_int_ids themselves as opposed to the
indexes of those packs relative to the MIDX layer they are contained
within.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When computing the set of objects to appear in a MIDX, we use compute_sorted_entries(), which handles objects from various existing sources one fanout layer at a time. The process for computing this set is slightly different during MIDX compaction, so factor out the existing functionality into its own routine to prevent `compute_sorted_entries()` from becoming too difficult to read. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Though our 'read-midx' test tool is capable of printing information about a single MIDX layer identified by its checksum, no caller in our test suite exercises this path. Unfortunately, there is a memory leak lurking in this (currently) unused path that would otherwise be exposed by the following commit. This occurs when providing a MIDX layer checksum other than the tip. As we walk over the MIDX chain trying to find the matching layer, we drop our reference to the top-most MIDX layer. Thus, our call to 'close_midx()' later on leaks memory between the top-most MIDX layer and the MIDX layer immediately following the specified one. Plug this leak by holding a reference to the tip of the MIDX chain, and ensure that we call `close_midx()` before terminating the test tool. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When managing a MIDX chain with many layers, it is convenient to combine a sequence of adjacent layers into a single layer to prevent the chain from growing too long. While it is conceptually possible to "compact" a sequence of MIDX layers together by running "git multi-pack-index write --stdin-packs", there are a few drawbacks that make this less than desirable: - Preserving the MIDX chain is impossible, since there is no way to write a MIDX layer that contains objects or packs found in an earlier MIDX layer already part of the chain. So callers would have to write an entirely new (non-incremental) MIDX containing only the compacted layers, discarding all other objects/packs from the MIDX. - There is (currently) no way to write a MIDX layer outside of the MIDX chain to work around the above, such that the MIDX chain could be reassembled substituting the compacted layers with the MIDX that was written. - The `--stdin-packs` command-line option does not allow us to specify the order of packs as they appear in the MIDX. Therefore, even if there were workarounds for the previous two challenges, any bitmaps belonging to layers which come after the compacted layer(s) would no longer be valid. This commit introduces a way to compact a sequence of adjacent MIDX layers into a single layer while preserving the MIDX chain, as well as any bitmap(s) in layers which are newer than the compacted ones. Implementing MIDX compaction does not require a significant number of changes to how MIDX layers are written. The main changes are as follows: - Instead of calling `fill_packs_from_midx()`, we call a new function `fill_packs_from_midx_range()`, which walks backwards along the portion of the MIDX chain which we are compacting, and adds packs one layer a time. In order to preserve the pseudo-pack order, the concatenated pack order is preserved, with the exception of preferred packs which are always added first. - After adding entries from the set of packs in the compaction range, `compute_sorted_entries()` must adjust the `pack_int_id`'s for all objects added in each fanout layer to match their original `pack_int_id`'s (as opposed to the index at which each pack appears in `ctx.info`). Note that we cannot reuse `midx_fanout_add_midx_fanout()` directly here, as it unconditionally recurs through the `->base_midx`. Factor out a `_1()` variant that operates on a single layer, reimplement the existing function in terms of it, and use the new variant from `midx_fanout_add_compact()`. Since we are sorting the list of objects ourselves, the order we add them in does not matter. - When writing out the new 'multi-pack-index-chain' file, discard any layers in the compaction range, replacing them with the newly written layer, instead of keeping them and placing the new layer at the end of the chain. This ends up being sufficient to implement MIDX compaction in such a way that preserves bitmaps corresponding to more recent layers in the MIDX chain. The tests for MIDX compaction are so far fairly spartan, since the main interesting behavior here is ensuring that the right packs/objects are selected from each layer, and that the pack order is preserved despite whether or not they are sorted in lexicographic order in the original MIDX chain. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Enable callers to generate reachability bitmaps when performing MIDX layer compaction by combining all existing bitmaps from the compacted layers. Note that because of the object/pack ordering described by the previous commit, the pseudo-pack order for the compacted MIDX is the same as concatenating the individual pseudo-pack orderings for each layer in the compaction range. As a result, the only non-test or documentation change necessary is to treat all objects as non-preferred during compaction so as not to disturb the object ordering. In the future, we may want to adjust which commit(s) receive reachability bitmaps when compacting multiple .bitmap files into one, or even generate new bitmaps (e.g., if the references have moved significantly since the .bitmap was generated). This commit only implements combining all existing bitmaps in range together in order to demonstrate and lay the groundwork for more exotic strategies. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ps/odb-sources: odb/source: make `begin_transaction()` function pluggable odb/source: make `write_alternate()` function pluggable odb/source: make `read_alternates()` function pluggable odb/source: make `write_object_stream()` function pluggable odb/source: make `write_object()` function pluggable odb/source: make `freshen_object()` function pluggable odb/source: make `for_each_object()` function pluggable odb/source: make `read_object_stream()` function pluggable odb/source: make `read_object_info()` function pluggable odb/source: make `close()` function pluggable odb/source: make `reprepare()` function pluggable odb/source: make `free()` function pluggable odb/source: introduce source type for robustness odb: move reparenting logic into respective subsystems odb: embed base source in the "files" backend odb: introduce "files" source odb: split `struct odb_source` into separate header
The "odb.h" header currently includes the "odb/source.h" file. This is somewhat roundabout though: most callers shouldn't have to care about the `struct odb_source`, but should rather use the ODB-level functions. Furthermore, it means that a couple of definitions have to live on the source level even though they should be part of the generic interface. Reverse the relation between "odb/source.h" and "odb.h" and move the enums and typedefs that relate to the generic interfaces back into "odb.h". Add the necessary includes to all files that rely on the transitive include. Suggested-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a subsequent commit we're about to introduce a new `odb_source_count_objects()` function so that we can make the logic pluggable. Prepare for this change by extracting the logic that we have to count packed objects into a standalone function. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In "builtin/gc.c" we have some logic that checks whether we need to repack objects. This is done by counting the number of objects that we have and checking whether it exceeds a certain threshold. We don't really need an accurate object count though, which is why we only open a single object directory shard and then extrapolate from there. Extract this logic into a new function that is owned by the loose object database source. This is done to prepare for a subsequent change, where we'll introduce object counting on the object database source level. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Generalize the function introduced in the preceding commit to not only be able to approximate the number of loose objects, but to also provide an accurate count. The behaviour can be toggled via a new flag. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Introduce generic object counting on the object database source level with a new backend-specific callback function. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Similar to the preceding commit, introduce counting of objects on the object database level, replacing the logic that we have in `repo_approximate_object_count()`. Note that the function knows to cache the object count. It's unclear whether this cache is really required as we shouldn't have that many cases where we count objects repeatedly. But to be on the safe side the caching mechanism is retained, with the only excepting being that we also have to use the passed flags as caching key. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Further work on incremental repacking using MIDX/bitmap * tb/incremental-midx-part-3.2: midx: enable reachability bitmaps during MIDX compaction midx: implement MIDX compaction t/helper/test-read-midx.c: plug memory leak when selecting layer midx-write.c: factor fanout layering from `compute_sorted_entries()` midx-write.c: enumerate `pack_int_id` values directly midx-write.c: extract `fill_pack_from_midx()` midx-write.c: introduce `midx_pack_perm()` helper midx: do not require packs to be sorted in lexicographic order midx-write.c: introduce `struct write_midx_opts` midx-write.c: don't use `pack_perm` when assigning `bitmap_pos` t/t5319-multi-pack-index.sh: fix copy-and-paste error in t5319.39 git-multi-pack-index(1): align SYNOPSIS with 'git multi-pack-index -h' git-multi-pack-index(1): remove non-existent incompatibility builtin/multi-pack-index.c: make '--progress' a common option midx: introduce `midx_get_checksum_hex()` midx: rename `get_midx_checksum()` to `midx_get_checksum_hash()` midx: mark `get_midx_checksum()` arguments as const
The logic to count objects has been cleaned up. * ps/object-counting: odb: introduce generic object counting odb/source: introduce generic object counting object-file: generalize counting objects object-file: extract logic to approximate object count packfile: extract logic to count number of objects odb: stop including "odb/source.h"
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )