Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --duplicated-nevra "keep-last" option, and --delayed-dump #325

Merged

Conversation

praiskup
Copy link
Member

This is follow-up for #307. Unfortunately, the parallel processing performance degrades a bit
with the "keep-last" option.

Motivation for Copr: https://pagure.io/copr/copr/issue/840

We can tweak this in the future (keep latest pkg per mtime, keep the first one, etc.).

@praiskup
Copy link
Member Author

parallel processing performance degrades a bit

By that I mean:

  • using createrepo_c --no-database --update --skip-stat --recycle-pkglist --pkglist add . --workers 8
  • on a repository with 250k+ packages, and only one is being added, i.e. mostly a CPU bound task (not I/O)
  • the slow-down is from 55s to 91s on my machine

We could start another pool of workers there, but I'm not sure it is needed at this point in time.

@dralley
Copy link
Contributor

dralley commented Aug 11, 2022

Presumably in a situation where there are many duplicates, this would significantly improve the metadata size correct?

@praiskup
Copy link
Member Author

praiskup commented Aug 12, 2022 via email

@dralley
Copy link
Contributor

dralley commented Aug 18, 2022

keep-last should probably keep the most recent package as determined by a second-order sort by build time, which is what yum / dnf / zypper does when encountering multiple packages of the same NEVRA.

Copy link
Contributor

@kontura kontura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep-last should probably keep the most recent package as determined by a second-order sort by build time, which is what yum / dnf / zypper does when encountering multiple packages of the same NEVRA.

I would also like the build-time ordering. It would break some current CI tests but I can easily fixed that.

src/dumper_thread.c Show resolved Hide resolved
src/cmd_parser.c Outdated Show resolved Hide resolved
src/dumper_thread.c Outdated Show resolved Hide resolved
src/createrepo_c.c Outdated Show resolved Hide resolved
src/createrepo_c.c Outdated Show resolved Hide resolved
@praiskup
Copy link
Member Author

Thank you for the review! Updated but WIP, I still need to implement build-time ordering.

@praiskup
Copy link
Member Author

praiskup commented Nov 1, 2022

keep-last should probably keep the most recent package as determined by a second-order sort by build time, which is what yum / dnf / zypper does when encountering multiple packages of the same NEVRA.

So, trying to understand what actually DNF does, and it appears it prefers the oldest package from all with the NVR, aka the oldest per build time. Is it a correct observation?

Even though we are destructively changing the recycled cr_Packages data,
we can keep the structure self-contained so the upper level logic will
automatically take care of the allocated memory.

The recycled metadata cr_Packages use one chunk per all packages
(CR_PACKAGE_SINGLE_CHUNK).  That's why (pkg->chunk == NULL) and we can
re-use it for a new chunk.  We just need to change the type of pkg to
!CR_PACKAGE_SINGLE_CHUNK so the cr_package_free() cleans the new
chunk.

This allows us to stop taking care of the locally allocated location_*
variables, because they are _cleanup_free_.  This makes even the buffer
handling simpler.
@dralley
Copy link
Contributor

dralley commented Nov 1, 2022

AFAIK it prefers the newest package by NEVRA for any given package name, and the newest per build-time between packages with the same NEVRA.

@kontura
Copy link
Contributor

kontura commented Nov 1, 2022

For dnf I think this depends on the order of packages in metadata. I believe libsolv picks just the first package in the metadata.
I made a hacky CI test to demonstrate: rpm-software-management/ci-dnf-stack#1173

However this is just for dnf install package-name we might be doing something else for different arguments/commands.
It is possible that different filtering will send different ids to libsolv resulting in installing the other package but I don't think there is any dedicated logic to it. It might happen by accident if at all.

@praiskup
Copy link
Member Author

praiskup commented Nov 1, 2022

Still experimenting a bit, and DNF seems to prefer the one which comes first
sorted alphabetically (or that comes first in the repodata, because createrepo_c
sorts the packages alphabetically).

For example this repo, in a Rawhide container:

# dnf -y install dummy-pkg --repofrompath=ordering,https://copr-be.cloud.fedoraproject.org/archive/issues/createrepo_c-pr-325/repo2/ --setopt ordering.gpgcheck=0
...
Installed:
  dummy-pkg-20221031_1057-1.fc38.x86_64 
...
# tail /var/log/dnf.librepo.log | grep dummy-pkg
https://copr-be.cloud.fedoraproject.org/archive/issues/createrepo_c-pr-325/repo2/a/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm

The one starting with ../a/dummy-pkg... was installed. But when you look at RPM headers:

# for rpm in `find -name '*x86_64.rpm'`; do echo "== $rpm =="; rpm -qpi "$rpm" 2>/dev/null| grep Build\ Date; done
== ./a/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Mon 31 Oct 2022 10:00:35 AM UTC
== ./b/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Mon 31 Oct 2022 10:00:35 AM UTC
== ./c/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Tue 01 Nov 2022 08:59:25 AM UTC
== ./d/dummy-pkg-20221031_1057-1.fc38.x86_64.rpm ==
Build Date  : Mon 31 Oct 2022 10:00:21 AM UTC

Then the newest one is ./c which was built today. Though I agree that what
you describe @dralley is the behavior I'd expect and prefer. @j-mracek is this a bug,
accident, or expected behavior? :-)

@praiskup
Copy link
Member Author

praiskup commented Nov 1, 2022

@kontura was faster! :-) Good to have this tested.

Anyway, for the keep-last option; should we do a two-level sort, by build time (which might still have collisions) and then by path name alphabetically?

@kontura
Copy link
Contributor

kontura commented Nov 1, 2022

Anyway, for the keep-last option; should we do a two-level sort, by build time (which might still have collisions) and then by path name alphabetically?

That sounds reasonable to me. 🙂

@praiskup
Copy link
Member Author

praiskup commented Nov 1, 2022

Ok, please take another look.

@dralley
Copy link
Contributor

dralley commented Nov 1, 2022

Are we sure libsolv has any deliberate strategy here, as opposed to some kind of parsing bug where various package entries attempt to overwrite each other?

Is there an easy way that we can ask libsolv "load this repo and then dump the solvables" to compare the input and what libsolv sees at the pool level (pre- any kind of actual dependency solving / choosing logic)?

e.g. If some package was present twice in the metadata by being present in multiple locations, this comment makes me suspicious that you would only see one package with potentially duplicate filelists and changelog metadata.

@kontura
Copy link
Contributor

kontura commented Nov 2, 2022

Are we sure libsolv has any deliberate strategy here, as opposed to some kind of parsing bug where various package entries attempt to overwrite each other?

Oh I meant during solving. Libsolv does parse all of the packages and each has a unique id (even duplicates). If you do install foo libdnf basically finds all ids that match name foo and sends them to libsolv. Here the solver has to pick something, I would guess that from duplicate packages it takes the lowest id (that is the first duplicate in metadata).

Is there an easy way that we can ask libsolv "load this repo and then dump the solvables" to compare the input and what libsolv sees at the pool level (pre- any kind of actual dependency solving / choosing logic)?

There is exactly such a thing. Libsolv has these tools (rpm libsolv-tools-0.7.22-3.fc37.x86_64). You can do:

$ repo2solv ./path/to/repo | dumpsolv

@dralley
Copy link
Contributor

dralley commented Nov 2, 2022

Oh I meant during solving. Libsolv does parse all of the packages and each has a unique id (even duplicates).

I think that would work while parsing primary.xml since you can create new solvables, I'm just curious what happens if you encounter duplicate pkgid while parsing other.xml and filelists.xml. The files and changelogs have to be associated back to solvables, and all it has to go off of is NEVRA and pkgid, so how would it pick which solvable the files and changelogs get associated with? If they all get dumped onto one solvable, then you'd have duplicate solvables but with broken file deps and changelogs on some or all of them.

Basically similar to #306

However I've just done some testing and nothing stood out as being obviously wrong, so maybe it has some magic up its sleeve.

src/dumper_thread.c Outdated Show resolved Hide resolved
src/createrepo_c.c Outdated Show resolved Hide resolved
src/createrepo_c.c Outdated Show resolved Hide resolved
src/createrepo_c.c Outdated Show resolved Hide resolved
src/cmd_parser.c Outdated
@@ -213,6 +213,11 @@ static GOptionEntry cmd_entries[] =
"Read the list of packages from old metadata directory and re-use it. This "
"option is only useful with --update (complements --pkglist and friends).",
NULL },
{ "delayed-dump", 0, G_OPTION_FLAG_HIDDEN, G_OPTION_ARG_NONE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benefit to having this as a hidden option? Maybe for debugging? But personally I wouldn't add it here at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it several times for debugging, yes. There shouldn't be cons in having the option, so I'd prefer to keeping it? Just please re-consider, and I'll drop if you really want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed this in our team and such a hidden option could cause more trouble than its worth.

Someone could find it and use it to perhaps workaround some other issue or use it in some other hard to imagine way. On top of that we would still have to maintain it.

I really would prefer if we didn't add it.
For debugging the mode can be turned on in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, dropped the command line option then, thanks for checking. I kept the "internal" delayed_dump option just to not overcomplicate the commit changes if that is OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@kontura
Copy link
Contributor

kontura commented Nov 3, 2022

Basically similar to #306

However I've just done some testing and nothing stood out as being obviously wrong, so maybe it has some magic up its sleeve.

I was interested in how it is done so I spent some time digging into the libsolv parser and there is no magic. It has the same bug as you mentioned.
Its just that I didn't realize that repo2solv by default uses just primary. Testing it with options to use full filelists and changelogs:

repo2solv . -F -C | dumpsolv

It does add all the duplicate entries to the first id, duplicating them.

I am thinking what does this mean for dnf or other tools. dnf would show duplicated changelogs but the filelists probably just add some overhead?

Revamped 'pkg_from_md' that was unused for some time, but renamed to
'clean'.
The 'task_cleanup' target was reached without free-ing the 'pkg'
structure in one of the error cases before.

The 'res' was allocated on two places.
Will be needed later for the "skip" mechanism.
@praiskup
Copy link
Member Author

praiskup commented Nov 3, 2022

PR updated.

Normally we dump (to XML and DB) directly from the pool of threading
workers, in a streaming fashion (as soon as the metadata about packages
are determined it is written).

The new option causes that the metadata are loaded first (still in
parallel), and then written at one time (by one thread for now).
Since all the metadata (cr_Package) is available in the memory in one
time, we can do some additional processing and filtering (before we
eventually perform the dump).

This is an optional behavior because the delayed XML chunk preparation
slows down the overall multi-threaded processing time.
Commit 8fb99cd added a detection
mechanism for duplicated NEVRA packages and an useful warning.

The new option '--duplicated-nevra keep' keeps the default behavior (if
specified) — all the duplicate packages are dumped into metadata.

When '--duplicated-nevra keep-last' argument is used, only the last
NEVRA (ordered by build time, then by its ocation path) is dumped to
the metadata and the former duplicates are skipped.

Relates: https://pagure.io/copr/copr/issue/840
Fixes: 325
@kontura kontura merged commit e656df3 into rpm-software-management:master Nov 3, 2022
@dralley
Copy link
Contributor

dralley commented Nov 5, 2022

@kontura So Pulp has some logic to attempt to pick the package that DNF would pick when syncing a repo that happens to contain duplicate NEVRA (it skips over the other duplicates), and currently it uses build time as the differentiator because that's what we thought it does.

Should we change that to instead just pick the first one?

@kontura
Copy link
Contributor

kontura commented Nov 7, 2022

I have tried syncing with DNF reposync and that keeps all the duplicates.
Just to make sure I understand correctly you mean when Pulp is doing the syncing right?

If they have duplicate NEVRA and pkgid (checksum) then libsolv picks the first one.

However for just duplicate NEVRA DNF works with all of them. For example when you try to install something both of them provide then yes it (libsolv) picks the first one. But if you do install /path/to/file and the file is only present in the second one - it uses the second one.
I think that in general libsolv will use any package of the duplicates that satisfies the dependency resolution and since the packages are different it could be any of them.

Based on that if you want to match what DNF does you need all of them.

@dralley
Copy link
Contributor

dralley commented Nov 7, 2022

Just to make sure I understand correctly you mean when Pulp is doing the syncing right?

Yeah. The trouble is that Pulp encoded some optimistic assumptions early on that NEVRA would be unique, and it led to bugs when that wasn't the case, but we can't reject those repos because there's too many of them and it would be a pain in the ass, so instead we're just trying to pick the one that DNF / Yum / zypper would prefer.

In theory the dependencies ought to be the same between the packages unless the packager has been doing grievously stupid things anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants