New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-35936: Updates to modulefinder #11787

Open
wants to merge 36 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@brandtbucher
Copy link

brandtbucher commented Feb 8, 2019

This patch includes a few useful fixes to Lib/modulefinder.py. Namely:

  • It doesn't crash if it encounters a syntax error (bpo-17396).
  • It doesn't report certain name collisions as bad (bpo-35376).
  • It doesn't use mutable default arguments in its initializer.
  • Most importantly, it now uses importlib instead of imp, which is deprecated (bpo-25160, bpo-20020) As a benefit, frozen built-in modules (zipimport, for example) are now correctly reported.

With the exception of these bug fixes, I've been very careful to preserve the original behavior, and have not changed any part of the API. This patch also includes 2 new regression tests.

https://bugs.python.org/issue35936

brandtbucher added some commits Feb 7, 2019

Properly handle SyntaxErrors in Python source files.
SyntaxErrors in the target module will rise normally, while SyntaxErrors in dependencies will be added to badmodules. This includes a new regression test.
Fix name collision bug.
This fixes an issue where a "fromlist" import with the same name as a previously failed import would be incorrectly added to badmodules. This includes a new regression test.
Replace mutable default values.
Bound empty lists have been replaced with the "if param is None" idiom.
Replace deprecated imp usage.
Constants imported from imp have been moved to private module-level constants, and ModuleFinder.find_module has been refactored to use importlib. Other than an improvement on how frozen builtin imports are reported (as the frozen imports they are, rather than the stdlib modules they *may* have originated from), these changes maintain complete compatibility with past versions... including odd behavior for returning relative (below current directory, but not a C extension) vs. absolute (above current directory, or a C extension) paths.

brandtbucher and others added some commits Feb 8, 2019

Fixed whitespace.
Whoops.
Handle differing drives on Windows.
Specifically, when checking whether the returned path should be relative or absolute.
Fix relative imports.
Hopefully, this will help the failing tests on Windows.
@ronaldoussoren

This comment has been minimized.

Copy link
Contributor

ronaldoussoren commented Feb 8, 2019

This changeset makes a substantial change to the semantics of modulefinder: using importlib.util.find_spec can change the import state of the script. In particular find_spec("package.module") results in the importing of package (and adds it to sys.modules).

Furthermore using importlib makes it close to impossible to use an alternate search path (the "path" argument for the constructor) due to the API of importlib. There are workarounds for this, but that probably requires replacing the entire import state (not just sys.path as this pull request does) around calls to importlib functions.

P.S. For modulegraph2, which is a rewrite of modulegraph which itself is a library that uses code heavily inspired by modulefinder I choosen to take away the possibility to use an alternate search path.

@brandtbucher

This comment has been minimized.

Copy link
Author

brandtbucher commented Feb 8, 2019

@ronaldoussoren I see, thanks for the input. As a concrete example, we can see import side-effects of the parent module when running $ python -c 'import importlib; importlib.util.find_spec("this.that")'. I'll dig into workarounds a bit more today. If nothing seems viable, as you mentioned, I'll probably revert those changes and leave the other fixes for merging.

It may be worth exploring a ground-up redesign, since the documented API is so small. I need a side-project, so I'd be willing to work on that if others feel it would be valuable.

brandtbucher added some commits Feb 9, 2019

Fix import side-effects.
importlib.util.find_spec(fullname, parent), where fullname contains a dot, results in the parent module being imported. This fix ensures that children are correctly located while also avoiding import side-effects by temporarily clearing sys.modules and modifying sys.path. Now, neither a dotted name nor parent are required.
Wait for sys.path to update.
I have a hunch the failing tests are due to a race condition in how sys.path updates. Sleep a bit here to see.
Bump sleep time.
More test are passing, which supports my theory.
WIP: Replace sleep with busy wait.
*crosses fingers*
Remove busy wait.
I guess that wasn't the issue...
@brandtbucher

This comment has been minimized.

Copy link
Author

brandtbucher commented Feb 9, 2019

I’ve tweaked the implementation to not need a dotted fullname, avoiding the issues mentioned above. I am still getting (what appear to be) random test failures, I suspect somehow stemming from my modifications to sys.path. This weird behavior is even referenced in an old comment in test_modulefinder.py. I’m not able to reproduce the failures on my own machine.

Unless someone sees a clear fix for this behavior, I think I’ll just revert the new importlib bit of this PR. Bummer.

brandtbucher and others added some commits Feb 9, 2019

Remove NEWS entry.
This will be replaced with several more specific entries detailing the various bug fixes.
Mutate sys.path in-place.
Perhaps there is some underlying reference issue with straight-up sys.path replacement. Once again, fingers crossed!
Enter RLock during import search.
Perhaps test threads are stomping on each others' sys.path?
Try multiprocessing.RLock instead of threading.
These are processes, after all...
Try a couple of possible fixes for failing lookups.
This is a longshot, but it's pretty much all I have left in terms of ideas.
Remove sys._clear_type_cache calls.
Is it a refleak issue?
Break out new code into standalone function.
I've also added helpful documentation outlining the reasoning behind various design decisions.

brandtbucher added some commits Feb 10, 2019

Don't modify import state in-place.
Changing references seems like a safer option and doesn't require extra copy operations.
Reorder new import.
Minor. Keep our imports in ASCIIbetical order.
@brandtbucher

This comment has been minimized.

Copy link
Author

brandtbucher commented Feb 10, 2019

Alright, I've gotten all of the tests passing for the new importlib implementation.

It turns out that the cached importers in sys.path_importer_cache were causing trouble, especially since so many of our tests use the same module names. My workaround is to copy and reset the entire import state (sys.path, sys.modules, and sys.path_importer_cache) before each search, and swap the old values back in after. A nice benefit of this approach is that path modifications work properly, so there is no longer a risk of accidentally importing a parent package when searching for a child module.

I broke these modifications out into a new private function, _find_module, to make it clear that this fix is a simple drop-in replacement for imp.find_module.

Let me know if there's anything I've missed or could improve. Otherwise, I feel my work here is done!

@vstinner vstinner requested a review from python/import-team Feb 11, 2019

@vstinner

This comment has been minimized.

Copy link
Member

vstinner commented Feb 11, 2019

@brettcannon @ncoghlan @warsaw: Would you mind to review this PR?

Minor style stuff.
Remove a comment, change a docstring, and add an ACK.
@pablogsal

This comment has been minimized.

Copy link
Member

pablogsal commented Feb 13, 2019

Why there are 3 NEWS entries for this PR?

@brandtbucher

This comment has been minimized.

Copy link
Author

brandtbucher commented Feb 13, 2019

@pablogsal see @pganssle’s resolved comment above. I originally only had one, but they noted that this PR might benefit from more since it covers a few related (but separate) issues.

@vstinner

This comment has been minimized.

Copy link
Member

vstinner commented Feb 13, 2019

As @pablogsal, I would also prefer a single NEWS entry ;-)

@pganssle

This comment has been minimized.

Copy link
Contributor

pganssle commented Feb 13, 2019

The original NEWS entry made reference to "several long standing bugs", I assumed they had their own bpo entries.

That said if the PR fixes three things it's not uncommon in other projects to use three separate news entries. I haven't looked closely at the code of the PR, but the text of the NEWS entries seems to indicate this is fixing three mostly unrelated things.

Show resolved Hide resolved Lib/modulefinder.py Outdated
@ncoghlan

This comment has been minimized.

Copy link
Contributor

ncoghlan commented Feb 13, 2019

As @pganssle notes - it makes sense to have multiple NEWS entries, since the issue and PR are a "roll-up" PR that allows a few different changes to be reviewed and tested together.

However, the news entries should be tweaked so their file names reference the more specific bug reports, rather than the summary issue.

brandtbucher and others added some commits Feb 13, 2019

Replace importlib.util.find_spec with importlib.machinery.PathFinder.…
…find_spec.

This swap avoids behavior changes and concurrency issues. The reverted changes will be proposed and reviewed separately. Thanks @ncoghlan!
Invalidate caches before find_spec call.
It appears that we're still hitting some caches. This *should* help...
Fix whitespace.
Whoops.
Remove duplicate-bpo NEWS entries.
These will be replaced with new ones properly referencing their respective bpo issue numbers.
Empty commit.
This is to force a rebuild on the PR.

@brettcannon brettcannon changed the title bpo-35936: Updates to modulefinder. bpo-35936: Updates to modulefinder Feb 13, 2019

@brandtbucher

This comment has been minimized.

Copy link
Author

brandtbucher commented Feb 14, 2019

I've gone ahead with @ncoghlan's suggestion and updated two of the NEWS entries with filenames that reference their corresponding BPO issue numbers.

I've also updated the _find_module implementation to use importlib.machinery.PathFinder.find_spec. With that said, I feel that the existing scope of this PR justifies replacing this with my suggested threading.Lock/sys/importlib.util.finc_spec fix, rather than breaking out that change into a separate issue.

Any other thoughts?

@ncoghlan
Copy link
Contributor

ncoghlan left a comment

One minor release docs comment, but otherwise this version looks good to me. However, I'd really like @ronaldoussoren to take another look before we merge it.

Show resolved Hide resolved Lib/modulefinder.py

brandtbucher and others added some commits Feb 16, 2019

Remove NEWS entry.
This is going to have info about the mutable defaults added.

importlib.machinery.PathFinder.invalidate_caches()

spec = importlib.machinery.PathFinder.find_spec(name, path+sys.path)

This comment has been minimized.

@ronaldoussoren

ronaldoussoren Feb 17, 2019

Contributor

(path+sys.path) changes the semantics that existed before this patch (_find_module might find modules on sys.path that aren't found on path).

Why is it necessary to include sys.path on the search path here?

def _find_module(name, path=None):
"""An importlib reimplementation of imp.find_module (for our purposes)."""

importlib.machinery.PathFinder.invalidate_caches()

This comment has been minimized.

@ronaldoussoren

ronaldoussoren Feb 17, 2019

Contributor

I'd add a comment here that explains why clearing the cache is necessary, the need is not immediately obvious.

if spec is None:
raise ImportError("No module named {name!r}".format(name=name), name=name)

if isinstance(spec.loader, type): # Some special cases:

This comment has been minimized.

@ronaldoussoren

ronaldoussoren Feb 17, 2019

Contributor

Not sure if using issubclass necessary, just testing for these two special cases is good enough:

if spec.loader is import lib.machinery.BuiltinImporter:
  ...

Another special case is the loader for implicit namespace packages (which is currently an instance of a private class, see bpo-35673). Those are probably not handled at all at the moment by this module.


# If the path is a descendant of the CWD, it is reported as a relative filepath
# ...unless it is an extension, in which case it will be absolute (see below)
# ...otherwise, it is fully resolved.

This comment has been minimized.

@ronaldoussoren

ronaldoussoren Feb 17, 2019

Contributor

I wonder if this behaviour should be emulated, although there's bound to be someone that relies on the current behaviour.

pass

base, suffix = os.path.splitext(file_path)

This comment has been minimized.

@ronaldoussoren

ronaldoussoren Feb 17, 2019

Contributor

I'm not sure about the code that follows. The kind can be deduced from the type of loader (at least for loaders in the stdlib).

elif suffix in importlib.machinery.SOURCE_SUFFIXES:

if name != "__init__" and os.path.basename(base) == "__init__":
return (None, os.path.dirname(file_path), ("", "", _PKG_DIRECTORY))

This comment has been minimized.

@ronaldoussoren

ronaldoussoren Feb 17, 2019

Contributor
  • Loader has a method is_package() that does this test.

  • The package __init__ can also be an extension (although I'm not aware of any code that actually does this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment