Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Sorted PKG-INFO file fields to make it reproducible #1305

Closed
wants to merge 1 commit into from

Conversation

@stevenengler
Copy link

stevenengler commented Mar 21, 2018

I have sorted the output to the PKG-INFO file in order to make the file reproducible. Without sorting, fields can be in different orders each time it is run. For example each time I build a certain package, I get changes such as:

Provides-​Extra: docs
Provides-​Extra: test

vs

Provides-​Extra: test
Provides-​Extra: docs

Another benefit is that the sorted output has a more consistent formatting. I'm not affiliated with them, but reproducible-builds.org explains why reproducible builds are important.

@pganssle

This comment has been minimized.

Copy link
Member

pganssle commented Mar 21, 2018

This is only a problem on Python 2 and Python <3.5, right? For Python 3.6+ anything that came from a dictionary should be sorted by insertion order already, and thus entirely reproducible, right?

If you need backwards compatibility, presumably you can pass an OrderedDict into whatever generates these, and it will always give the same answer.

Not that I think sorting the outputs is a serious problem, but it's worth noting that the trade-off is that we make it so that anyone who is already using ordered containers for these things will now have the order changed from whatever order they specified to sorted order.

@@ -74,7 +74,7 @@ def write_pkg_file(self, file):
('Maintainer-email', 'maintainer_email'),
)

for field, attr in optional_fields:
for field, attr in sorted(optional_fields):

This comment has been minimized.

Copy link
@pganssle

pganssle Mar 21, 2018

Member

This already has a fixed order, there's no need to sort it.

file.write('Platform: %s\n' % platform)
else:
self._write_list(file, 'Platform', self.get_platforms())
self._write_list(file, 'Platform', sorted(self.get_platforms()))

This comment has been minimized.

Copy link
@pganssle

pganssle Mar 21, 2018

Member

I think all of these _write_list calls take a list as input, so its order will be reproducible.

@pganssle

This comment has been minimized.

Copy link
Member

pganssle commented Mar 21, 2018

@stevenengler Do you have an example of a setup.py that is generating these PKG-INFO files with changing order? I think that would only happen if you passed a set or something to provides_extra. If you just switch to a list it should have a consistent order (unless you re-ordered it yourself).

@stevenengler

This comment has been minimized.

Copy link
Author

stevenengler commented Mar 21, 2018

@pganssle Thanks for all your comments. I may have gone overboard with the sorted calls, my goal was to make them explicit to developers, but you're right that they're not really needed for lists (assuming they're inserted into the lists in a repeatable fashion). Plus I didn't consider that some developers might expect them to be in some specific unsorted order.

So one example I've seen this for is pyopenssl, which has:

extras_require={
    "test": [
        "flaky",
        "pretend",
        "pytest>=3.0.1",
    ],
    "docs": [
        "sphinx",
        "sphinx_rtd_theme",
    ]
},

Then I think these extras_require items get added to provides_extras here (dist.py#L416).

I have been compiling with Python 3.6, which I would think should have these in a fixed order, but that doesn't seem to be happening. Plus it would be nice if it worked on older versions of Python as well.

What do you think if we sort the order here (dist.py#L410) instead? Then we can remove all the other sorted calls except for maybe the project_urls.items() which may be unordered.

@pganssle

This comment has been minimized.

Copy link
Member

pganssle commented Mar 21, 2018

@stefanholek From my perspective it doesn't really matter much if it gets sorted early or late if it gets sorted somewhere along the pipeline.

Sorting it early does allow users to screw with the order after it's been read from setup but before it gets written to PKG-INFO, but I think the main question here is whether there's any valid use case for respecting the order in which these things are written.

If there's some valid use case, then I think no sorting should happen at all, and if you want reproducible builds you either set PYTHONHASHSEED to some known value as part of the build or use ordered containers for these things (like OrderedDict pre-3.6 or just regular dict in Python 3.6+). If there's no valid use case then we might as well do the latest possible sort in case something changes later on that un-sorts things.

Either way, if this PR is going to be accepted it should probably have some tests.

@jaraco @benoit-pierre Any thoughts on whether there's a valid reason not to sort PKG-INFO metadata?

@stevenengler

This comment has been minimized.

Copy link
Author

stevenengler commented Mar 21, 2018

@pganssle Makes sense to me. The last thing I want to do is break a common workflow with setuptools. If you think it's worthwhile to sort anything, or maybe just sort dictionaries, I'll be happy to make the changes and write some tests. If not, that's okay and I think setting PYTHONHASHSEED to a constant value when building is a reasonable approach (if you can think of any downsides please let me know), or trying to convince package maintainers to use OrderedDict in their setup.py.

@jaraco

This comment has been minimized.

Copy link
Member

jaraco commented May 18, 2018

I'd rather not sort items unnecessarily. Wherever possible, let's retain the order indicated by the packager. Reproducible builds are worthwhile, so let's sort but only where necessary.

I have been compiling with Python 3.6, which I would think should have these in a fixed order, but that doesn't seem to be happening.

That's surprising and unexpected. I'd love to see an example where that's not the case.

Please take another stab at it. Start with tests that illustrate the failure modes (even if intermittent). Once we have that, then we can be more confident that the changes are correcting the failure and not over-correcting.

@jaraco jaraco changed the title Sorted PKG-INFO file fields to make it reproducible [wip] Sorted PKG-INFO file fields to make it reproducible May 18, 2018
@jaraco jaraco added the draft label May 18, 2018
@stevenengler

This comment has been minimized.

Copy link
Author

stevenengler commented May 18, 2018

@jaraco I'm away for the next two weeks, but I'll look at this again once I'm back. I had figured out the reason for the random sort order on Python 3.6, but I've since forgotten it (I think it had to do with the data being converted between types and the order being lost during the conversion).

@pganssle

This comment has been minimized.

Copy link
Member

pganssle commented May 19, 2018

I'm actually still kinda on the side of sorting everything, unless we can come up with a reason not to do so. Unfortunately, it seems like the only way to figure out whether or not people are relying on the current ordering behavior is to just break it and see if it causes any problems.

@kushaldas kushaldas mentioned this pull request Nov 16, 2018
0 of 8 tasks complete
@yan12125

This comment has been minimized.

Copy link

yan12125 commented Jan 5, 2019

Not sure about other fields, but at least provides_extras requires extra handling. In setuptools it's a set, and apparently a set is still not deterministic in Python 3.7.2.

A failure log: https://tests.reproducible-builds.org/archlinux/community/buildbot/buildbot-1.7.0-1-any.pkg.tar.xz.html (buildbot package for Arch Linux)

I write a tiny script to mimc what setuptools does when building this package:

extras_require = {
    'test': [],
    'bundle': [],
    'tls': [],
    'docs': [],
}

s = set()
for k in extras_require.keys():
    s.add(k)

for k in s:
    print(k)

Two sample runs of this script:

$ python t.py
tls
docs
test
bundle

$ python t.py
tls
test
docs
bundle

Actualy I'm not sure why provides_extras is a set. AFAIK dict keys are unique, so a list is fine, right?

@anthraxx

This comment has been minimized.

Copy link

anthraxx commented Feb 16, 2019

Can we get this issue sorted out? It persists for quite a while and sorting isn't really an intrusive operation considering all the other regular code-path and computation done during execution.
This still blocks reproducible builds and is kind of a trivia issue to solve.

@pganssle

This comment has been minimized.

Copy link
Member

pganssle commented Feb 17, 2019

@anthraxx I think for the people who care about reproducible builds, this is not really an issue. Setting HASHSEED to a known value while doing the build should be enough to make at least this part of the build reproducible. I'm not sure what the story is with the other blockers, though.

@anthraxx

This comment has been minimized.

Copy link

anthraxx commented Feb 17, 2019

@pganssle

This comment has been minimized.

Copy link
Member

pganssle commented Feb 17, 2019

@anthraxx I am not saying that we cannot change this, just explaining that this is less critical than it may seem because it's just a default setting that is simple to work around, and from my understanding is not even the most serious problem with reproducible builds in Python.

The best thing to do if you want to get this moving would be to make a PR that ensures that setuptools is not taking ordered containers and turning them into unordered containers anywhere along the way, and also ensuring that unordered containers are converted to ordered containers in a reproducible way. This probably also involves adding a bunch of tests, because this is the sort of thing that is super easy to get regressions for.

Alternatively, you could provide some sort of compelling evidence that no one is relying on the ordering behavior of their ordered containers, in which case we can just sort everything when producing PKG-INFO.

Obviously the second option would be preferable from a maintenance point of view, but it does unfortunately require you to prove a negative. If you have the ability to play around with the Arch package repository's python2.7 packages, there are probably some pretty strong signals you can get by testing their builds against a version of setuptools that simply sorts outputs.

jaraco added a commit that referenced this pull request Feb 17, 2019
@jaraco

This comment has been minimized.

Copy link
Member

jaraco commented Feb 17, 2019

Thanks for reigniting the interest in this change.

I'm fairly confident that the order of requirements is meaningful, as it affects the order that dependencies are resolved... and some packages rely on that order to avoid certain bugs or to enforce certain overrides. I'm unsure if the test suite captures that expectation or if the changes proposed herein would affect that expectation, but it's one example of how making a change like this should be selective and surgical and itself include tests that capture the expectation. If the code doesn't have tests capturing the expectation, the code can be subsequently changed back with a similar pull request.

Correct me if I missed something, but the only use-case reported herein about non-deterministic ordering of PKG-INFO is the Provides-Extra ordering. I suggest we create a more selective PR that adds a test capturing the expectation. I've done just that in #1690.

@stevenengler

This comment has been minimized.

Copy link
Author

stevenengler commented Feb 17, 2019

It's been a long time and I'm not working in this area anymore, but a year ago I built the ~3000 most popular conda-forge packages and I believe there were more than just Provides-Extra items that were out of order. It's probably a good start to fix those, but you'll have to fix it piece-by-piece if you want to make everything reproducible. As for making Python packages reproducible in general, everything's not quite there yet, but it wouldn't be difficult. I was able to modify only the conda-build tool to make almost all Python packages on conda-forge reproducible.

@pganssle

This comment has been minimized.

Copy link
Member

pganssle commented Feb 17, 2019

@stevenengler Was that in Python 3.6? I think there are definitely other places in setuptools that use dict, which is ordered in 3.6+, but not before.

@anthraxx

This comment has been minimized.

Copy link

anthraxx commented Feb 18, 2019

I went through ~100 non-reproducible python packages in Arch and i think i only found Provides-Extra being unreproducible in PKG-INFO.
To be more precise, I have also seen version being non reproducible if setup.cfg uses tag_date, but that's totally unrelated problem (I'm also already writing a fix and test cases).
There also seems to be a requires.txt that writes dependencies in different categories in an unreproducible order, so latter may be something similar as Provides-Extra is.

@lamby

This comment has been minimized.

Copy link

lamby commented Mar 7, 2019

See also #894215 in Debian.

@jaraco

This comment has been minimized.

Copy link
Member

jaraco commented Oct 7, 2019

I've merged #1690 and I'm hopeful that will make builds reproducible on modern Pythons (and even on Python 2.7 if PYTHONHASHSEED is not set for the build). As I result, I expect not to need this change, which is more invasive to the effect of the change. Will be happy to revisit if one can identify other areas where builds are not reproducible. Thanks for your patience on this issue.

@jaraco jaraco closed this Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.