Add --duplicated-nevra "keep-last" option, and --delayed-dump #325

praiskup · 2022-08-10T11:16:52Z

This is follow-up for #307. Unfortunately, the parallel processing performance degrades a bit
with the "keep-last" option.

Motivation for Copr: https://pagure.io/copr/copr/issue/840

We can tweak this in the future (keep latest pkg per mtime, keep the first one, etc.).

praiskup · 2022-08-10T11:30:32Z

parallel processing performance degrades a bit

By that I mean:

using createrepo_c --no-database --update --skip-stat --recycle-pkglist --pkglist add . --workers 8
on a repository with 250k+ packages, and only one is being added, i.e. mostly a CPU bound task (not I/O)
the slow-down is from 55s to 91s on my machine

We could start another pool of workers there, but I'm not sure it is needed at this point in time.

dralley · 2022-08-11T20:46:17Z

Presumably in a situation where there are many duplicates, this would significantly improve the metadata size correct?

praiskup · 2022-08-12T05:33:22Z

Presumably in a situation where there are many duplicates, this would significantly improve the metadata size correct?

Yes. The delayed-write logic can also be tweaked to implement any additional filtering mechanism in the future (if metadata size is a concern).

…

On Thu, Aug 11, 2022 at 10:47 PM Daniel Alley ***@***.***> wrote: Presumably in a situation where there are many duplicates, this would significantly improve the metadata size correct? — Reply to this email directly, view it on GitHub <#325 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZAID3PYT7ATP4X5FX2BP3VYVRCJANCNFSM56ED5DZA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

dralley · 2022-08-18T01:48:18Z

keep-last should probably keep the most recent package as determined by a second-order sort by build time, which is what yum / dnf / zypper does when encountering multiple packages of the same NEVRA.

kontura

keep-last should probably keep the most recent package as determined by a second-order sort by build time, which is what yum / dnf / zypper does when encountering multiple packages of the same NEVRA.

I would also like the build-time ordering. It would break some current CI tests but I can easily fixed that.

src/dumper_thread.c

src/cmd_parser.c

src/dumper_thread.c

src/createrepo_c.c

praiskup · 2022-10-31T17:30:05Z

Thank you for the review! Updated but WIP, I still need to implement build-time ordering.

praiskup · 2022-11-01T09:06:17Z

keep-last should probably keep the most recent package as determined by a second-order sort by build time, which is what yum / dnf / zypper does when encountering multiple packages of the same NEVRA.

So, trying to understand what actually DNF does, and it appears it prefers the oldest package from all with the NVR, aka the oldest per build time. Is it a correct observation?

Even though we are destructively changing the recycled cr_Packages data, we can keep the structure self-contained so the upper level logic will automatically take care of the allocated memory. The recycled metadata cr_Packages use one chunk per all packages (CR_PACKAGE_SINGLE_CHUNK). That's why (pkg->chunk == NULL) and we can re-use it for a new chunk. We just need to change the type of pkg to !CR_PACKAGE_SINGLE_CHUNK so the cr_package_free() cleans the new chunk. This allows us to stop taking care of the locally allocated location_* variables, because they are _cleanup_free_. This makes even the buffer handling simpler.

dralley · 2022-11-01T12:51:28Z

AFAIK it prefers the newest package by NEVRA for any given package name, and the newest per build-time between packages with the same NEVRA.

kontura · 2022-11-01T14:12:53Z

For dnf I think this depends on the order of packages in metadata. I believe libsolv picks just the first package in the metadata.
I made a hacky CI test to demonstrate: rpm-software-management/ci-dnf-stack#1173

However this is just for dnf install package-name we might be doing something else for different arguments/commands.
It is possible that different filtering will send different ids to libsolv resulting in installing the other package but I don't think there is any dedicated logic to it. It might happen by accident if at all.

praiskup · 2022-11-01T14:19:00Z

Still experimenting a bit, and DNF seems to prefer the one which comes first
sorted alphabetically (or that comes first in the repodata, because createrepo_c
sorts the packages alphabetically).

For example this repo, in a Rawhide container:

# dnf -y install dummy-pkg --repofrompath=ordering,https://copr-be.cloud.fedoraproject.org/archive/issues/createrepo_c-pr-325/repo2/ --setopt ordering.gpgcheck=0
...
Installed:
  dummy-pkg-20221031_1057-1.fc38.x86_64 
...
# tail /var/log/dnf.librepo.log | grep dummy-pkg
https://copr-be.cloud.fedoraproject.org/archive/issues/createrepo_c-pr-325/repo2/a/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm

The one starting with ../a/dummy-pkg... was installed. But when you look at RPM headers:

# for rpm in `find -name '*x86_64.rpm'`; do echo "== $rpm =="; rpm -qpi "$rpm" 2>/dev/null| grep Build\ Date; done
== ./a/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Mon 31 Oct 2022 10:00:35 AM UTC
== ./b/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Mon 31 Oct 2022 10:00:35 AM UTC
== ./c/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Tue 01 Nov 2022 08:59:25 AM UTC
== ./d/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Mon 31 Oct 2022 10:00:21 AM UTC

Then the newest one is ./c which was built today. Though I agree that what
you describe @dralley is the behavior I'd expect and prefer. @j-mracek is this a bug,
accident, or expected behavior? :-)

praiskup · 2022-11-01T14:21:42Z

@kontura was faster! :-) Good to have this tested.

Anyway, for the keep-last option; should we do a two-level sort, by build time (which might still have collisions) and then by path name alphabetically?

kontura · 2022-11-01T14:42:35Z

Anyway, for the keep-last option; should we do a two-level sort, by build time (which might still have collisions) and then by path name alphabetically?

That sounds reasonable to me. 🙂

praiskup · 2022-11-01T15:31:53Z

Ok, please take another look.

dralley · 2022-11-01T20:39:11Z

Are we sure libsolv has any deliberate strategy here, as opposed to some kind of parsing bug where various package entries attempt to overwrite each other?

Is there an easy way that we can ask libsolv "load this repo and then dump the solvables" to compare the input and what libsolv sees at the pool level (pre- any kind of actual dependency solving / choosing logic)?

e.g. If some package was present twice in the metadata by being present in multiple locations, this comment makes me suspicious that you would only see one package with potentially duplicate filelists and changelog metadata.

kontura · 2022-11-02T06:50:09Z

Are we sure libsolv has any deliberate strategy here, as opposed to some kind of parsing bug where various package entries attempt to overwrite each other?

Oh I meant during solving. Libsolv does parse all of the packages and each has a unique id (even duplicates). If you do install foo libdnf basically finds all ids that match name foo and sends them to libsolv. Here the solver has to pick something, I would guess that from duplicate packages it takes the lowest id (that is the first duplicate in metadata).

Is there an easy way that we can ask libsolv "load this repo and then dump the solvables" to compare the input and what libsolv sees at the pool level (pre- any kind of actual dependency solving / choosing logic)?

There is exactly such a thing. Libsolv has these tools (rpm libsolv-tools-0.7.22-3.fc37.x86_64). You can do:

$ repo2solv ./path/to/repo | dumpsolv

dralley · 2022-11-02T20:11:58Z

Oh I meant during solving. Libsolv does parse all of the packages and each has a unique id (even duplicates).

I think that would work while parsing primary.xml since you can create new solvables, I'm just curious what happens if you encounter duplicate pkgid while parsing other.xml and filelists.xml. The files and changelogs have to be associated back to solvables, and all it has to go off of is NEVRA and pkgid, so how would it pick which solvable the files and changelogs get associated with? If they all get dumped onto one solvable, then you'd have duplicate solvables but with broken file deps and changelogs on some or all of them.

Basically similar to #306

However I've just done some testing and nothing stood out as being obviously wrong, so maybe it has some magic up its sleeve.

src/dumper_thread.c

src/createrepo_c.c

kontura · 2022-11-02T20:28:43Z

src/cmd_parser.c

@@ -213,6 +213,11 @@ static GOptionEntry cmd_entries[] =
      "Read the list of packages from old metadata directory and re-use it.  This "
      "option is only useful with --update (complements --pkglist and friends).",
      NULL },
+    { "delayed-dump", 0, G_OPTION_FLAG_HIDDEN, G_OPTION_ARG_NONE,


Is there any benefit to having this as a hidden option? Maybe for debugging? But personally I wouldn't add it here at all.

I used it several times for debugging, yes. There shouldn't be cons in having the option, so I'd prefer to keeping it? Just please re-consider, and I'll drop if you really want.

We have discussed this in our team and such a hidden option could cause more trouble than its worth.

Someone could find it and use it to perhaps workaround some other issue or use it in some other hard to imagine way. On top of that we would still have to maintain it.

I really would prefer if we didn't add it.
For debugging the mode can be turned on in the code.

ACK, dropped the command line option then, thanks for checking. I kept the "internal" delayed_dump option just to not overcomplicate the commit changes if that is OK?

Great, thanks!

kontura · 2022-11-03T09:02:48Z

Basically similar to #306

However I've just done some testing and nothing stood out as being obviously wrong, so maybe it has some magic up its sleeve.

I was interested in how it is done so I spent some time digging into the libsolv parser and there is no magic. It has the same bug as you mentioned.
Its just that I didn't realize that repo2solv by default uses just primary. Testing it with options to use full filelists and changelogs:

repo2solv . -F -C | dumpsolv

It does add all the duplicate entries to the first id, duplicating them.

I am thinking what does this mean for dnf or other tools. dnf would show duplicated changelogs but the filelists probably just add some overhead?

Revamped 'pkg_from_md' that was unused for some time, but renamed to 'clean'.

The 'task_cleanup' target was reached without free-ing the 'pkg' structure in one of the error cases before. The 'res' was allocated on two places.

Will be needed later for the "skip" mechanism.

praiskup · 2022-11-03T09:47:02Z

PR updated.

Normally we dump (to XML and DB) directly from the pool of threading workers, in a streaming fashion (as soon as the metadata about packages are determined it is written). The new option causes that the metadata are loaded first (still in parallel), and then written at one time (by one thread for now). Since all the metadata (cr_Package) is available in the memory in one time, we can do some additional processing and filtering (before we eventually perform the dump). This is an optional behavior because the delayed XML chunk preparation slows down the overall multi-threaded processing time.

Commit 8fb99cd added a detection mechanism for duplicated NEVRA packages and an useful warning. The new option '--duplicated-nevra keep' keeps the default behavior (if specified) — all the duplicate packages are dumped into metadata. When '--duplicated-nevra keep-last' argument is used, only the last NEVRA (ordered by build time, then by its ocation path) is dumped to the metadata and the former duplicates are skipped. Relates: https://pagure.io/copr/copr/issue/840 Fixes: 325

dralley · 2022-11-05T00:07:23Z

@kontura So Pulp has some logic to attempt to pick the package that DNF would pick when syncing a repo that happens to contain duplicate NEVRA (it skips over the other duplicates), and currently it uses build time as the differentiator because that's what we thought it does.

Should we change that to instead just pick the first one?

kontura · 2022-11-07T08:43:24Z

I have tried syncing with DNF reposync and that keeps all the duplicates.
Just to make sure I understand correctly you mean when Pulp is doing the syncing right?

If they have duplicate NEVRA and pkgid (checksum) then libsolv picks the first one.

However for just duplicate NEVRA DNF works with all of them. For example when you try to install something both of them provide then yes it (libsolv) picks the first one. But if you do install /path/to/file and the file is only present in the second one - it uses the second one.
I think that in general libsolv will use any package of the duplicates that satisfies the dependency resolution and since the packages are different it could be any of them.

Based on that if you want to match what DNF does you need all of them.

dralley · 2022-11-07T18:52:46Z

Just to make sure I understand correctly you mean when Pulp is doing the syncing right?

Yeah. The trouble is that Pulp encoded some optimistic assumptions early on that NEVRA would be unique, and it led to bugs when that wasn't the case, but we can't reject those repos because there's too many of them and it would be a pain in the ass, so instead we're just trying to pick the one that DNF / Yum / zypper would prefer.

In theory the dependencies ought to be the same between the packages unless the packager has been doing grievously stupid things anyway.

kontura requested changes Aug 23, 2022

View reviewed changes

src/dumper_thread.c Show resolved Hide resolved

src/cmd_parser.c Outdated Show resolved Hide resolved

src/dumper_thread.c Outdated Show resolved Hide resolved

src/createrepo_c.c Outdated Show resolved Hide resolved

kontura reviewed Aug 23, 2022

View reviewed changes

src/createrepo_c.c Outdated Show resolved Hide resolved

praiskup force-pushed the delayed-dump-and-skip branch from bc4127a to 824d2ce Compare October 31, 2022 17:28

praiskup force-pushed the delayed-dump-and-skip branch from 1427d39 to f5072b9 Compare November 1, 2022 15:31

praiskup force-pushed the delayed-dump-and-skip branch from f5072b9 to 05bcd40 Compare November 1, 2022 15:32

kontura mentioned this pull request Nov 2, 2022

createrepo_c: Update tests with duplicates to include buildtime rpm-software-management/ci-dnf-stack#1174

Merged

kontura requested changes Nov 2, 2022

View reviewed changes

praiskup added 3 commits November 3, 2022 10:40

Don't remove old metadata packages with --update

ed9ce71

Revamped 'pkg_from_md' that was unused for some time, but renamed to 'clean'.

Better 'res' and 'pkg' allocation handling

c85bbf9

The 'task_cleanup' target was reached without free-ing the 'pkg' structure in one of the error cases before. The 'res' was allocated on two places.

New helper wait_for_incremented_ids()

8db3a6a

Will be needed later for the "skip" mechanism.

praiskup force-pushed the delayed-dump-and-skip branch from 05bcd40 to c8dedb6 Compare November 3, 2022 09:45

praiskup force-pushed the delayed-dump-and-skip branch from c8dedb6 to 856bdb8 Compare November 3, 2022 15:35

kontura approved these changes Nov 3, 2022

View reviewed changes

kontura merged commit e656df3 into rpm-software-management:master Nov 3, 2022

praiskup mentioned this pull request Dec 5, 2022

Option to only include the latest successful build in the repos fedora-copr/copr#840

Open

dralley mentioned this pull request Jan 12, 2023

Verify repodata integrity rpm-software-management/libdnf#1591

Open

praiskup mentioned this pull request May 9, 2023

repo with 2 builds of same nevra is giving user the older build to install fedora-copr/copr#2704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --duplicated-nevra "keep-last" option, and --delayed-dump #325

Add --duplicated-nevra "keep-last" option, and --delayed-dump #325

praiskup commented Aug 10, 2022

praiskup commented Aug 10, 2022

dralley commented Aug 11, 2022

praiskup commented Aug 12, 2022 via email •

edited

Loading

dralley commented Aug 18, 2022

kontura left a comment

praiskup commented Oct 31, 2022

praiskup commented Nov 1, 2022

dralley commented Nov 1, 2022 •

edited

Loading

kontura commented Nov 1, 2022

praiskup commented Nov 1, 2022 •

edited

Loading

praiskup commented Nov 1, 2022

kontura commented Nov 1, 2022

praiskup commented Nov 1, 2022

dralley commented Nov 1, 2022 •

edited

Loading

kontura commented Nov 2, 2022 •

edited

Loading

dralley commented Nov 2, 2022

kontura Nov 2, 2022

praiskup Nov 2, 2022

kontura Nov 3, 2022

praiskup Nov 3, 2022

kontura Nov 3, 2022

kontura commented Nov 3, 2022

praiskup commented Nov 3, 2022

dralley commented Nov 5, 2022

kontura commented Nov 7, 2022

dralley commented Nov 7, 2022

Add --duplicated-nevra "keep-last" option, and --delayed-dump #325

Add --duplicated-nevra "keep-last" option, and --delayed-dump #325

Conversation

praiskup commented Aug 10, 2022

praiskup commented Aug 10, 2022

dralley commented Aug 11, 2022

praiskup commented Aug 12, 2022 via email • edited Loading

dralley commented Aug 18, 2022

kontura left a comment

Choose a reason for hiding this comment

praiskup commented Oct 31, 2022

praiskup commented Nov 1, 2022

dralley commented Nov 1, 2022 • edited Loading

kontura commented Nov 1, 2022

praiskup commented Nov 1, 2022 • edited Loading

praiskup commented Nov 1, 2022

kontura commented Nov 1, 2022

praiskup commented Nov 1, 2022

dralley commented Nov 1, 2022 • edited Loading

kontura commented Nov 2, 2022 • edited Loading

dralley commented Nov 2, 2022

kontura Nov 2, 2022

Choose a reason for hiding this comment

praiskup Nov 2, 2022

Choose a reason for hiding this comment

kontura Nov 3, 2022

Choose a reason for hiding this comment

praiskup Nov 3, 2022

Choose a reason for hiding this comment

kontura Nov 3, 2022

Choose a reason for hiding this comment

kontura commented Nov 3, 2022

praiskup commented Nov 3, 2022

dralley commented Nov 5, 2022

kontura commented Nov 7, 2022

dralley commented Nov 7, 2022

praiskup commented Aug 12, 2022 via email •

edited

Loading

dralley commented Nov 1, 2022 •

edited

Loading

praiskup commented Nov 1, 2022 •

edited

Loading

dralley commented Nov 1, 2022 •

edited

Loading

kontura commented Nov 2, 2022 •

edited

Loading