-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --duplicated-nevra "keep-last" option, and --delayed-dump #325
Add --duplicated-nevra "keep-last" option, and --delayed-dump #325
Conversation
By that I mean:
We could start another pool of workers there, but I'm not sure it is needed at this point in time. |
Presumably in a situation where there are many duplicates, this would significantly improve the metadata size correct? |
Presumably in a situation where there are many duplicates, this would significantly improve the metadata size correct?
Yes. The delayed-write logic can also be tweaked to implement any
additional filtering mechanism in the future (if metadata size is a
concern).
…On Thu, Aug 11, 2022 at 10:47 PM Daniel Alley ***@***.***> wrote:
Presumably in a situation where there are many duplicates, this would
significantly improve the metadata size correct?
—
Reply to this email directly, view it on GitHub
<#325 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZAID3PYT7ATP4X5FX2BP3VYVRCJANCNFSM56ED5DZA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep-last
should probably keep the most recent package as determined by a second-order sort by build time, which is what yum / dnf / zypper does when encountering multiple packages of the same NEVRA.
I would also like the build-time ordering. It would break some current CI tests but I can easily fixed that.
bc4127a
to
824d2ce
Compare
Thank you for the review! Updated but WIP, I still need to implement build-time ordering. |
So, trying to understand what actually DNF does, and it appears it prefers the oldest package from all with the NVR, aka the oldest per build time. Is it a correct observation? |
Even though we are destructively changing the recycled cr_Packages data, we can keep the structure self-contained so the upper level logic will automatically take care of the allocated memory. The recycled metadata cr_Packages use one chunk per all packages (CR_PACKAGE_SINGLE_CHUNK). That's why (pkg->chunk == NULL) and we can re-use it for a new chunk. We just need to change the type of pkg to !CR_PACKAGE_SINGLE_CHUNK so the cr_package_free() cleans the new chunk. This allows us to stop taking care of the locally allocated location_* variables, because they are _cleanup_free_. This makes even the buffer handling simpler.
AFAIK it prefers the newest package by NEVRA for any given package name, and the newest per build-time between packages with the same NEVRA. |
For dnf I think this depends on the order of packages in metadata. I believe libsolv picks just the first package in the metadata. However this is just for |
Still experimenting a bit, and DNF seems to prefer the one which comes first For example this repo, in a Rawhide container:
The one starting with
Then the newest one is |
@kontura was faster! :-) Good to have this tested. Anyway, for the |
That sounds reasonable to me. 🙂 |
1427d39
to
f5072b9
Compare
Ok, please take another look. |
f5072b9
to
05bcd40
Compare
Are we sure libsolv has any deliberate strategy here, as opposed to some kind of parsing bug where various package entries attempt to overwrite each other? Is there an easy way that we can ask libsolv "load this repo and then dump the solvables" to compare the input and what libsolv sees at the pool level (pre- any kind of actual dependency solving / choosing logic)? e.g. If some package was present twice in the metadata by being present in multiple locations, this comment makes me suspicious that you would only see one package with potentially duplicate filelists and changelog metadata. |
Oh I meant during solving. Libsolv does parse all of the packages and each has a unique id (even duplicates). If you do
There is exactly such a thing. Libsolv has these tools (rpm libsolv-tools-0.7.22-3.fc37.x86_64). You can do:
|
I think that would work while parsing primary.xml since you can create new solvables, I'm just curious what happens if you encounter duplicate pkgid while parsing other.xml and filelists.xml. The files and changelogs have to be associated back to solvables, and all it has to go off of is NEVRA and pkgid, so how would it pick which solvable the files and changelogs get associated with? If they all get dumped onto one solvable, then you'd have duplicate solvables but with broken file deps and changelogs on some or all of them. Basically similar to #306 However I've just done some testing and nothing stood out as being obviously wrong, so maybe it has some magic up its sleeve. |
src/cmd_parser.c
Outdated
@@ -213,6 +213,11 @@ static GOptionEntry cmd_entries[] = | |||
"Read the list of packages from old metadata directory and re-use it. This " | |||
"option is only useful with --update (complements --pkglist and friends).", | |||
NULL }, | |||
{ "delayed-dump", 0, G_OPTION_FLAG_HIDDEN, G_OPTION_ARG_NONE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any benefit to having this as a hidden option? Maybe for debugging? But personally I wouldn't add it here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used it several times for debugging, yes. There shouldn't be cons in having the option, so I'd prefer to keeping it? Just please re-consider, and I'll drop if you really want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed this in our team and such a hidden option could cause more trouble than its worth.
Someone could find it and use it to perhaps workaround some other issue or use it in some other hard to imagine way. On top of that we would still have to maintain it.
I really would prefer if we didn't add it.
For debugging the mode can be turned on in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, dropped the command line option then, thanks for checking. I kept the "internal" delayed_dump option just to not overcomplicate the commit changes if that is OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
I was interested in how it is done so I spent some time digging into the libsolv parser and there is no magic. It has the same bug as you mentioned.
It does add all the duplicate entries to the first I am thinking what does this mean for dnf or other tools. dnf would show duplicated changelogs but the filelists probably just add some overhead? |
Revamped 'pkg_from_md' that was unused for some time, but renamed to 'clean'.
The 'task_cleanup' target was reached without free-ing the 'pkg' structure in one of the error cases before. The 'res' was allocated on two places.
Will be needed later for the "skip" mechanism.
05bcd40
to
c8dedb6
Compare
PR updated. |
Normally we dump (to XML and DB) directly from the pool of threading workers, in a streaming fashion (as soon as the metadata about packages are determined it is written). The new option causes that the metadata are loaded first (still in parallel), and then written at one time (by one thread for now). Since all the metadata (cr_Package) is available in the memory in one time, we can do some additional processing and filtering (before we eventually perform the dump). This is an optional behavior because the delayed XML chunk preparation slows down the overall multi-threaded processing time.
Commit 8fb99cd added a detection mechanism for duplicated NEVRA packages and an useful warning. The new option '--duplicated-nevra keep' keeps the default behavior (if specified) — all the duplicate packages are dumped into metadata. When '--duplicated-nevra keep-last' argument is used, only the last NEVRA (ordered by build time, then by its ocation path) is dumped to the metadata and the former duplicates are skipped. Relates: https://pagure.io/copr/copr/issue/840 Fixes: 325
c8dedb6
to
856bdb8
Compare
@kontura So Pulp has some logic to attempt to pick the package that DNF would pick when syncing a repo that happens to contain duplicate NEVRA (it skips over the other duplicates), and currently it uses build time as the differentiator because that's what we thought it does. Should we change that to instead just pick the first one? |
I have tried syncing with DNF If they have duplicate NEVRA and pkgid (checksum) then libsolv picks the first one. However for just duplicate NEVRA DNF works with all of them. For example when you try to install something both of them provide then yes it (libsolv) picks the first one. But if you do Based on that if you want to match what DNF does you need all of them. |
Yeah. The trouble is that Pulp encoded some optimistic assumptions early on that NEVRA would be unique, and it led to bugs when that wasn't the case, but we can't reject those repos because there's too many of them and it would be a pain in the ass, so instead we're just trying to pick the one that DNF / Yum / zypper would prefer. In theory the dependencies ought to be the same between the packages unless the packager has been doing grievously stupid things anyway. |
This is follow-up for #307. Unfortunately, the parallel processing performance degrades a bit
with the "keep-last" option.
Motivation for Copr: https://pagure.io/copr/copr/issue/840
We can tweak this in the future (keep latest pkg per mtime, keep the first one, etc.).