egg_info command is very slow if there are lots of non source files in directory #450
Comments
Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco): Thanks for the thorough and thoughtful writeup. I can think of one possible complication that I'd like to be sure we address. The issue of file finding is already addressed in another way in the sdist command with the 'file_finders' entry point, which allows for the re-use of a source code repository metadata for locating relevant files for distribution. I wonder if that mechanism could be re-used, as these painful directories are already managed by .ignore files in most projects' SCM tool. Perhaps that functionality could be re-used. I generally consider the MANIFEST.in technique to be deprecated in favor of file finders, because a MANIFEST.in file has to be maintained redundantly and can get out of sync with the SCM metadata. Regardless of one's preference for managing project-relevant sources, there should be a clear way to address the issue you raise, and I think it could be confusing if not a conflict to rely on MANIFEST.in for optimizing Reviewing the code a bit more, I do see a relationship here, though the order and naming seems awkward to me.
In any case, I'm open to ways to improve it, and will happily accept pull requests if they can be made viable for release and especially if they include tests to enhance the reliability and expectation. |
Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey): Thanks for your reply, I'll look into this more. I wasn't aware of file finders at all. |
Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey): See pull request #151 for initial steps. After that, I'm getting a bit stuck. In particular, these lines 352-354 of egg_info.py: These currently have no test coverage. They are clearly important - if you break them, then 'tox' fails to run, with a 'pip install' command failing. But I'm finding it hard working out how to exercise those paths in the context of a test, and getting rather lost. When do you have an external 'egg-base'? The reason I care is that I want to fix this chunk of code - especially because it actually uses a Any help would be appreciated. |
Original comment by avdd (Bitbucket: avdd, GitHub: avdd): I generally like to keep cruft outside my source tree and when a configuration setting is unavailable I use a symlink. Such is the case with node and node_modules which can get very large. So one potential optimisation to start with is to not traverse symlinks that point outside the source tree. |
Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey): @avdd I'm not aware that there is any rule against setup.py finding sources files via symlinks, and I can think of valid reasons for having something symlinked into the source directory which needs to be included when running sdist. We can only optimise away things that are guaranteed to be excluded anyway. |
Original comment by spookylukey (Bitbucket: spookylukey, GitHub: spookylukey): Hi Jason - any chance you could reply to my comment on 2015-10-21 above? |
Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):
I do see in the pip source one use of egg-base. In this case, it seems pip is generating the package metadata (egg-info) in a distinct folder to avoid its presence creating an implicit installation in I also see here in egg_info that egg_base defaults to the value of the empty package_dir. So if you use the example in distutils where |
Original comment by reinout (Bitbucket: reinout, GitHub: reinout): As a temporary measure I've started monkeypatching os.walk in my setup.py files: http://reinout.vanrees.org/weblog/2016/03/17/setuptools-speed.html I'm starting to think that a new, optional "directories_to_ignore_totally" kwarg to "setup()" might be the best solution... |
Originally reported by: spookylukey (Bitbucket: spookylukey, GitHub: spookylukey)
The approach taken by egg_info is to list all the files in the source directory, recursively, and then apply rules. (via
manifest_maker
andFileList.findall
)If you have a lot of files, this can take a really long time. It combines especially badly with tools like 'tox', which put lots of virtualenvs in the current dir, putting many thousands of files there.
On one small project I maintain, running tox from a cold start has a 5 minute start up time (because it runs setup.py, which runs egg_info or something similar), which is a real pain when the tests themselves run far more quickly.
Would you consider a patch to improve this? Attempting a full fix is very difficult, due to the way that MANIFEST rules are essentially an imperative language in which order matters, as files are added and removed.
One strategy that might work would be to attempt to optimise very few MANIFEST commands - for example, only 'prune' would be considered. That is, a custom
FileList.find_all
command would understand the 'prune' command, and avoid recursing into that directory, but only if the prune command comes after any commands that would add files. From my understanding of how MANIFEST is processed https://docs.python.org/2/distutils/sourcedist.html, I believe this would preserve correctness.This would not be much of an optimisation out-of-the box, but it would allow you to add "prune .tox" and "prune .git" etc. at the end of your MANIFEST.in to get big speedups.
Would this approach be accepted if I worked on it?
The text was updated successfully, but these errors were encountered: