New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not scan whole file tree when making MANIFEST #764

Merged
merged 1 commit into from Oct 15, 2016

Conversation

Projects
None yet
4 participants
@timheap
Contributor

timheap commented Aug 29, 2016

When building a MANIFEST from a MANIFEST.in, setuptools previously
scanned the whole directory tree in to a list, and then picked matching
files based on MANIFEST.in commands from this list.

Now, files are found using the glob library from Python 3.5. This only
explores directories that need to be scanned, resulting in a large speed
improvement for projects with large file trees. A modified glob
module has been included. It has been changed to support back to
Python 2.6, and to include .hidden files in its matches. The previous
functionality included .hidden files in its glob matches. It is
unclear if this behaviour is desired and required, or accidental and not
required, but for strict backwards-compatibility, this behaviour is
kept.

Each command in the MANIFEST.in is now represented by its own function
on the FileList (include, exclude, graft, etc.) to allow for an
efficient implementation. The previous commands
FileList.include_pattern and FileList.exclude_pattern still
exist for backwards compatibility, but these use the slow 'scan all
files' method, so are discouraged. global_include by its nature must
scan all directories in the project to work, so this does not receive
any speed improvements.

The changes will speed up creating packages for the vast majority of
users. There are a few unusual corner cases, such as multiple graft
commands operating on the same set of directories, that will be slower.
These can be solved by consolidating the overlapping graft commands in
to one command.

Note that this is a competing PR to #610. The approach in #610 made it hard to correctly implement graft.

@timheap timheap force-pushed the timheap:fast_egg_info_command branch 3 times, most recently from 7de66bb to 5c655a8 Aug 29, 2016

@timheap

This comment has been minimized.

Contributor

timheap commented Aug 29, 2016

These tests seem to be failing on CI for unrelated reasons. The tests fail even on master currently. I do not know why, and the stack traces are rather confusing.

@timheap

This comment has been minimized.

Contributor

timheap commented Aug 29, 2016

Oh, py.test v3 is still broken. Running the tests with py.test>=2.8,<3 works just fine.

@timheap timheap force-pushed the timheap:fast_egg_info_command branch from 5c655a8 to d568f4a Aug 29, 2016

@timheap

This comment has been minimized.

Contributor

timheap commented Aug 29, 2016

OK, test pass again. I've rebased this on top of #765, which excludes py.test>=3.0 completely.

@benoit-pierre

This comment has been minimized.

Member

benoit-pierre commented Sep 14, 2016

This patch is great, and fix my issue with following symlinks when building the manifest. I however get a test failure on my system:

self = <setuptools.tests.test_manifest.TestFileListTest object at 0x2b1c7b96d850>

    def test_include_pattern(self):
        # return False if no match
        file_list = FileList()
        self.make_files([])
        assert not file_list.include_pattern('*.py')

        # return True if files match
        file_list = FileList()
        self.make_files(['a.py', 'b.txt'])
        assert file_list.include_pattern('*.py')

        # test * matches all files
        file_list = FileList()
        self.make_files(['a.py', 'b.txt'])
        file_list.include_pattern('*')
>       assert file_list.files == ['a.py', 'b.txt']
E       assert ['b.txt', 'a.py'] == ['a.py', 'b.txt']
E         At index 0 diff: 'b.txt' != 'a.py'
E         Use -v to get the full diff

This is both with Python 2.7.12 and Python 3.5.2 on Arch Linux.

@timheap

This comment has been minimized.

Contributor

timheap commented Sep 15, 2016

Thanks for checking this.

The order of files in the file list appears to depend on the underlying file system. My system reported the files in the order they were created, but apparently yours does not. I've wrapped the file list in a set() for the comparison, as the order is not part of the test.

There is already a separate test that checks that FileList.sort() does the sensible thing.

@timheap

This comment has been minimized.

Contributor

timheap commented Sep 28, 2016

The code for this patch is complete, barring any feedback. Is there anything more I can do to help this patch along?

@jaraco

Overall, looks good. I would simply accept it except for the questions I raise above. Please have a look. Thanks.

return found
def include(self, pattern):
"""Include files that match 'pattera'n."""

This comment has been minimized.

@jaraco

jaraco Oct 14, 2016

Member

This comment looks like it has a typo.

This comment has been minimized.

@timheap

timheap Oct 14, 2016

Contributor

👍

@@ -239,7 +322,151 @@ def check_broken_egg_info(self):
class FileList(_FileList):
"""File list that accepts only existing, platform-independent paths"""
# Implementations of the various MANIFEST.in commands

This comment has been minimized.

@jaraco

jaraco Oct 14, 2016

Member

Previously, this class was a subclass of the implementation in distutils. It's not clear from this diff if there's still value in subclassing _FileList. Is this a complete re-implementation or does it also re-use some of the upstream implementation? Do you expect that these changes could be merged upstream at some point?

This comment has been minimized.

@timheap

timheap Oct 14, 2016

Contributor

I kept it for backwards compatibility. Packages or plugins may depend on calling FileList.include_pattern, the old way of adding packages, and I did not want to break these.

self._add_egg_info(cmd=ei_cmd)
self.filelist.include_pattern("*", prefix=ei_cmd.egg_info)
def _add_egg_info(self, cmd):

This comment has been minimized.

@jaraco

jaraco Oct 14, 2016

Member

Previously, cmd.egg_base was processed, but now it appears not to be. What new behavior obviates the need for this behavior?

This comment has been minimized.

@timheap

timheap Oct 14, 2016

Contributor

The behaviour is still required, but is now as simple as self.filelist.graft(ei_cmd.egg_info), assuming I understand what is going in _add_egg_info(). I've replaced that whole function with that one-liner.

This comment has been minimized.

@jaraco

jaraco Oct 14, 2016

Member

Okay. Great!

"""
Filename globbing utility. Mostly a copy of `glob` from Python 3.5.
Changes include:

This comment has been minimized.

@jaraco

jaraco Oct 14, 2016

Member

This file adds a big amount of technical debt to the project. Although I believe it will work, it's a fork of the functionality in Python 3.5 without a merge strategy. Moreover, it has already diverged from the upstream implementation. Thanks, by the way for describing the divergence (I worry other contributors might not be so diligent).

But I wonder - if you could rely on Python 3.5, would the implementation have been different, or would you have had to override all of this behavior not to ignore hidden files? For example, if you look at the recent setuptools.sdist.py36compat.py, you'll see functionality that's a backport of code in Python 3.7, which can be removed when Python 3.7 is the minimum supported version. Could you do something similar here, where the forward-compatibility implementation and the divergent behavior are maintained separately (even in separate files)?

This comment has been minimized.

@timheap

timheap Oct 14, 2016

Contributor

It would not be possible to ship the glob library from Py3.5 without any modifications, as yield from is not valid syntax in older Pythons, so at least that needs to change regardless.

To make the library compatible with Py2, I also had to use six for dealing with strings/bytes. These changes require modifications to internal function code, so are not easily patched from "outside". I notice that I did not mention this in the list of changes, I will add it now.

The same problem exists with including hidden files - is_hidden is called from a number of functions, so these functions need to be changed to not call it. Alternatively, is_hidden could be changed to always return False, which would have the same effect.

Finally, all the functions in the glob library call each other directly. There is no good way of doing dependency injection or something similar to override the specific functions that need patching.

I completely agree with you that this adds some undesirable technical debt. If you can think of any good solutions to the problems above I am very happy to implement them.

This comment has been minimized.

@timheap

timheap Oct 14, 2016

Contributor

It is also worth noting that a diff -U5 /usr/lib/python3.5/glob.py setuptools/glob.py shows a very nice summary of the changes.

@timheap timheap force-pushed the timheap:fast_egg_info_command branch 2 times, most recently from d31f12e to 44630f1 Oct 14, 2016

@timheap timheap force-pushed the timheap:fast_egg_info_command branch from 44630f1 to bb45468 Oct 15, 2016

@jaraco jaraco merged commit 566dc35 into pypa:master Oct 15, 2016

0 of 2 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
continuous-integration/travis-ci/pr The Travis CI build failed
Details
@timheap

This comment has been minimized.

Contributor

timheap commented Oct 15, 2016

Thanks for the merge!

timheap added a commit to timheap/setuptools that referenced this pull request Nov 20, 2016

Fix pypa#849 global-exclude globbing
After pypa#764, `global-exclude .pyc` no longer excluded `.pyc` files.
This fixes that regression, and adds a test for this behaviour.
@spookylukey

This comment has been minimized.

Contributor

spookylukey commented Mar 7, 2017

@timheap thanks for taking this on and succeeding where I ran out of steam! and thanks @jaraco for the merge. This has made a world of difference to my use case - tox-dev/tox#293 (comment)

reinout added a commit to nens/cookiecutter-djangosite-template that referenced this pull request Feb 9, 2018

Removed now-unneeded monkeypatch
Fixed in setuptools (quite some time ago) in pypa/setuptools#764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment