New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

egg_info command is very slow if there are lots of non source files in directory #450

Closed
bb-migration opened this Issue Oct 16, 2015 · 11 comments

Comments

Projects
None yet
1 participant
@bb-migration

bb-migration commented Oct 16, 2015

Originally reported by: spookylukey (Bitbucket: spookylukey, GitHub: spookylukey)


The approach taken by egg_info is to list all the files in the source directory, recursively, and then apply rules. (via manifest_maker and FileList.findall)

If you have a lot of files, this can take a really long time. It combines especially badly with tools like 'tox', which put lots of virtualenvs in the current dir, putting many thousands of files there.

On one small project I maintain, running tox from a cold start has a 5 minute start up time (because it runs setup.py, which runs egg_info or something similar), which is a real pain when the tests themselves run far more quickly.

Would you consider a patch to improve this? Attempting a full fix is very difficult, due to the way that MANIFEST rules are essentially an imperative language in which order matters, as files are added and removed.

One strategy that might work would be to attempt to optimise very few MANIFEST commands - for example, only 'prune' would be considered. That is, a custom FileList.find_all command would understand the 'prune' command, and avoid recursing into that directory, but only if the prune command comes after any commands that would add files. From my understanding of how MANIFEST is processed https://docs.python.org/2/distutils/sourcedist.html, I believe this would preserve correctness.

This would not be much of an optimisation out-of-the box, but it would allow you to add "prune .tox" and "prune .git" etc. at the end of your MANIFEST.in to get big speedups.

Would this approach be accepted if I worked on it?


@bb-migration

This comment has been minimized.

bb-migration commented Oct 17, 2015

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Thanks for the thorough and thoughtful writeup.

I can think of one possible complication that I'd like to be sure we address.

The issue of file finding is already addressed in another way in the sdist command with the 'file_finders' entry point, which allows for the re-use of a source code repository metadata for locating relevant files for distribution. I wonder if that mechanism could be re-used, as these painful directories are already managed by .ignore files in most projects' SCM tool. Perhaps that functionality could be re-used.

I generally consider the MANIFEST.in technique to be deprecated in favor of file finders, because a MANIFEST.in file has to be maintained redundantly and can get out of sync with the SCM metadata. Regardless of one's preference for managing project-relevant sources, there should be a clear way to address the issue you raise, and I think it could be confusing if not a conflict to rely on MANIFEST.in for optimizing findall but not for managing the MANIFEST.

Reviewing the code a bit more, I do see a relationship here, though the order and naming seems awkward to me.

    def run(self):
        self.filelist = FileList()
        if not os.path.exists(self.manifest):
            self.write_manifest()  # it must exist so it'll get in the list
        self.filelist.findall()
        self.add_defaults()
        if os.path.exists(self.template):
            self.read_template()
        self.prune_file_list()
        self.filelist.sort()
        self.filelist.remove_duplicates()
        self.write_manifest()

findall runs before add_defaults and it's in add_defaults that the SCM is queried.

In any case, I'm open to ways to improve it, and will happily accept pull requests if they can be made viable for release and especially if they include tests to enhance the reliability and expectation.

@bb-migration

This comment has been minimized.

bb-migration commented Oct 18, 2015

Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey):


Thanks for your reply, I'll look into this more. I wasn't aware of file finders at all.

@bb-migration

This comment has been minimized.

bb-migration commented Oct 21, 2015

Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey):


See pull request #151 for initial steps.

After that, I'm getting a bit stuck. In particular, these lines 352-354 of egg_info.py:

https://bitbucket.org/pypa/setuptools/src/ab1b60d1c77966140b6ed0df78868c26c16d22b4/setuptools/command/egg_info.py?at=default&fileviewer=file-view-default#egg_info.py-352

These currently have no test coverage. They are clearly important - if you break them, then 'tox' fails to run, with a 'pip install' command failing. But I'm finding it hard working out how to exercise those paths in the context of a test, and getting rather lost. When do you have an external 'egg-base'?

The reason I care is that I want to fix this chunk of code - especially because it actually uses a FileList internal i.e. allfiles, which is interfering with my attempts to optimise.

Any help would be appreciated.

@bb-migration

This comment has been minimized.

bb-migration commented Oct 26, 2015

Original comment by avdd (Bitbucket: avdd, GitHub: avdd):


I generally like to keep cruft outside my source tree and when a configuration setting is unavailable I use a symlink. Such is the case with node and node_modules which can get very large.

So one potential optimisation to start with is to not traverse symlinks that point outside the source tree.

@bb-migration

This comment has been minimized.

bb-migration commented Oct 26, 2015

Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey):


@avdd I'm not aware that there is any rule against setup.py finding sources files via symlinks, and I can think of valid reasons for having something symlinked into the source directory which needs to be included when running sdist. We can only optimise away things that are guaranteed to be excluded anyway.

@bb-migration

This comment has been minimized.

bb-migration commented Nov 16, 2015

Original comment by reinout (Bitbucket: reinout, GitHub: reinout):


See also #249, that's the same issue.

@bb-migration

This comment has been minimized.

bb-migration commented Nov 16, 2015

Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey):


Hi Jason - any chance you could reply to my comment on 2015-10-21 above?

@bb-migration

This comment has been minimized.

bb-migration commented Nov 18, 2015

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Hi Luke. Sorry for the delay. I just haven't had the time to give this issue the attention it deserves. I'll put in on my short list of things to review sooner than later.

@bb-migration

This comment has been minimized.

bb-migration commented Nov 21, 2015

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


When do you have an external 'egg-base'?

I do see in the pip source one use of egg-base. In this case, it seems pip is generating the package metadata (egg-info) in a distinct folder to avoid its presence creating an implicit installation in ..

I also see here in egg_info that egg_base defaults to the value of the empty package_dir. So if you use the example in distutils where package_dir={'': 'lib'}, that's one case where egg_base will be 'lib'.

@bb-migration

This comment has been minimized.

bb-migration commented Mar 17, 2016

Original comment by reinout (Bitbucket: reinout, GitHub: reinout):


As a temporary measure I've started monkeypatching os.walk in my setup.py files: http://reinout.vanrees.org/weblog/2016/03/17/setuptools-speed.html

I'm starting to think that a new, optional "directories_to_ignore_totally" kwarg to "setup()" might be the best solution...

@bb-migration

This comment has been minimized.

bb-migration commented Mar 18, 2016

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Duplicate of #249.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment