Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-104212: Add importlib.util.load_source_path() function #105755

Closed
wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jun 13, 2023

@vstinner
Copy link
Member Author

imp.load_source() has a 3rd parameter file to pass a file-like object, whereas my proposed function has only 2 arguments. Would it be worth it to implement this parameter?

imp.load_source() uses this special loader to implement this parameter:

class _HackedGetData:

    """Compatibility support for 'file' arguments of various load_*()
    functions."""

    def __init__(self, fullname, path, file=None):
        super().__init__(fullname, path)
        self.file = file

    def get_data(self, path):
        """Gross hack to contort loader to deal w/ load_*()'s bad API."""
        if self.file and path == self.path:
            # The contract of get_data() requires us to return bytes. Reopen the
            # file in binary mode if needed.
            if not self.file.closed:
                file = self.file
                if 'b' not in file.mode:
                    file.close()
            if self.file.closed:
                self.file = file = open(self.path, 'rb')

            with file:
                return file.read()
        else:
            return super().get_data(path)


class _LoadSourceCompatibility(_HackedGetData, machinery.SourceFileLoader):

    """Compatibility support for implementing load_source()."""

@vstinner
Copy link
Member Author

The function was already proposed in 2012 but nobody actually implemented it: #58756 Maybe less people cared since imp.load_source() was still available :-) But the imp module got removed in Python 3.12!

It was recently requested when someone tried to port their code to Python 3.12: #104212

@vstinner
Copy link
Member Author

@warsaw @brettcannon: What do you think of this late Python 3.12 addition? I propose to add it after beta1 since I consider that it's a regression, not a new feature: it replaces removed imp.load_source().

I thought that imp.load_source() supported file-like objects which are not on disk, like ZIP archives or anything else. But the imp implementation can reopen the file from its filename, so I'm not really convinced by its usefulness.

@brettcannon
Copy link
Member

It's a new feature so I wouldn't want it slipped into Python 3.12 (just because imp had it doesn't mean it was a good idea 馃槈). Plus the code required to do this is already documented at https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly and it's 4 lines.

@vstinner
Copy link
Member Author

vstinner commented Jun 15, 2023

It's a new feature so I wouldn't want it slipped into Python 3.12 (just because imp had it doesn't mean it was a good idea wink).

Well, there are projects who need this function: see #104212.

Plus the code required to do this is already documented at https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly and it's 4 lines.

This recipe doesn't work if the filename doesn't end with .py according to #104212.

A recipe in the doc is nice, but people (like me) are unable to find it and so come up with their own bad implementation: https://stackoverflow.com/questions/69884878/replacing-imp-with-importlib-maintaining-old-behavior

Even if the code is short, the code already evolved once: before, load_module() had to be used, and now this method is gone and exec_module() must be used instead.

I'm not going to push too hard to have such function in importlib: I have my own flavor of this function in my project. I already updated it a few time to follow new importlib "trends" and follow deprecations :-) (The first implementation used imp obviously.)

@brettcannon
Copy link
Member

This recipe doesn't work if the filename doesn't end with .py according to #104212.

And they are incorrect. 馃槈 That's only an issue if you don't specify the loader does it fail.

Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to decide whether this is meant to only import a module and be as simple as possible, or can you also import a package?

Regardless, it's missing a module spec on the module which is critical.

It also needs to be decided on whether this is necessary.

@@ -246,3 +247,13 @@ def exec_module(self, module):
loader_state['__class__'] = module.__class__
module.__spec__.loader_state = loader_state
module.__class__ = _LazyModule


def load_source(module_name, filename):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is misleading because you're not loading from source, but from a file that contains source code. I think you're after something like load_source_path().

You're also missing whether this is a package? If that's not possible for simplicity, then that should probably be stated. Otherwise it's a piece of information that a module spec contains.

def load_source(module_name, filename):
"""Load a module from a filename."""
loader = SourceFileLoader(module_name, filename)
module = types.ModuleType(loader.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should get the module via the spec.

"""Load a module from a filename."""
loader = SourceFileLoader(module_name, filename)
module = types.ModuleType(loader.name)
module.__file__ = filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be set by the loader, not manually. And you're missing the spec.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@brettcannon
Copy link
Member

I don't know how anyone else feels, but we can see if there's any other responses.

One key thing, though, is the current implementation in the PR is incomplete as it's missing a spec. It also has to decided whether supporting packages is important or not.

@vstinner
Copy link
Member Author

As an user, I have no idea of what a spec is. I have no idea how to use the importlib API correctly. So I copy/paste random recipes on the Internet and hack them for my needs. That's why I like the idea of a clean API to use the importlib API to "load a Python source and import it".

I don't want to load a package. For me, the most common case would be that it's a .py file. I don't see how i would load a package from a filename, since it's a directory. Maybe we need a test to ensure that it doesn't work.

@vstinner
Copy link
Member Author

I updated my PR:

  • PR rebased on main
  • function renamed to load_source_path()
  • use a spec
  • define the behavior (in doc and with tests) for packages: loading a directory raises an error, loading an __init__.py file works as expected (it's a regular file at the end)

New implementation:

def load_source_path(module_name, filename):
    """Load a module from a filename."""
    loader = SourceFileLoader(module_name, filename)
    spec = spec_from_loader(module_name, loader)
    module = module_from_spec(spec)
    sys.modules[module.__name__] = module
    loader.exec_module(module)
    return module

@vstinner
Copy link
Member Author

Another imp.load_source() user who doesn't know how to update the code to Python 3.12 importlib API: https://discuss.python.org/t/how-do-i-migrate-from-imp/27885

Lib/importlib/util.py Outdated Show resolved Hide resolved
@brettcannon
Copy link
Member

As an user, I have no idea of what a spec is.

https://docs.python.org/3/library/importlib.html#importlib.machinery.ModuleSpec 馃槈

Joking aside, it is documented.

I don't want to load a package.

Then you probably should make sure the name doesn't have a . in it, otherwise you're also not making sure any submodules or subpackages end up on the parent.

Another imp.load_source() user who doesn't know how to update the code to Python 3.12 importlib API: https://discuss.python.org/t/how-do-i-migrate-from-imp/27885

As I said over on discuss.python.org, Miro's ask is actually different from yours. You want to load from a specific path, Miro wants to search and load from a specific directory (note how your use case lacks a search component).

A package cannot be imported by its directory path, whereas its
``__init__.py`` file (ex: ``package/__init__.py``) can be imported.

.. versionadded:: 3.12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has no needs backport to 3.12 label. Should the version be 3.13 instead?

Suggested change
.. versionadded:: 3.12
.. versionadded:: 3.13

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems there is no support for the backport to satisfy a specific use case:

As I said over on discuss.python.org, Miro's ask is actually different from yours. You want to load from a specific path, Miro wants to search and load from a specific directory (note how your use case lacks a search component).

@vstinner vstinner added the needs backport to 3.12 bug and security fixes label Jun 19, 2023
@vstinner
Copy link
Member Author

The PR has no needs backport to 3.12 label. Should the version be 3.13 instead?

I propose to add this function to Python 3.12 to replace the removed imp.load_source(). If it's too late, it can be added to Python 3.13. But it's unpleasant that Python 3.12 users have no solution (in form of a single function call) in the stdlib: otherwise, people will copy/paste these 3-5 lines to their code, and adding a function to Python 3.13 stdlib becomes less relevant?

@vstinner
Copy link
Member Author

vstinner commented Jun 19, 2023

I updated my PR:

  • Rebased on main
  • Squashed commits
  • Test that the module is always executed, even if it's already cached in sys.modules. Add a "side effect" for that.
  • Reimplement the function with spec_from_file_location(), instead of spec_from_spec(), to make sure that the module.__file__ attribute is always set. spec_from_file_location() behavior fits better to load_source_path() use case, so future spec_from_file_location() changes are less likely to change the load_source_path() behavior.
  • The tests check also module.__spec__ attributes and module.__loader__ attributes: relative filename vs absolute filename, check loader.is_package(), check also module.__path__ list.

@brettcannon:

Then you probably should make sure the name doesn't have a . in it, otherwise you're also not making sure any submodules or subpackages end up on the parent.

Ok, I added a check to reject names which contain a dot character.

@vstinner
Copy link
Member Author

My implementation is different than Python 3.11 imp.load_source() if the module is already cached in sys.modules: the imp function did not re-execute the module, whereas my implement always execute the module. Which behavior is the "correct" one?

Python 3.11 imp.load_source():

def load_source(name, pathname, file=None):
    loader = _LoadSourceCompatibility(name, pathname, file)
    spec = util.spec_from_file_location(name, pathname, loader=loader)
    if name in sys.modules:
        module = _exec(spec, sys.modules[name])
    else:
        module = _load(spec)
    # To allow reloading to potentially work, use a non-hacked loader which
    # won't rely on a now-closed file object.
    module.__loader__ = machinery.SourceFileLoader(name, pathname)
    module.__spec__.loader = module.__loader__
    return module

See the if name in sys.modules: test.

@vstinner vstinner changed the title gh-104212: Add importlib.util.load_source() function gh-104212: Add importlib.util.load_source_path() function Jun 19, 2023
@vstinner
Copy link
Member Author

My implementation is different than Python 3.11 imp.load_source() if the module is already cached in sys.modules: the imp function did not re-execute the module, whereas my implement always execute the module. Which behavior is the "correct" one?

A different design would be to no use sys.modules at all: always create a new module and always execute the module. Getting an module in sys.modules and caching the new module in sys.modules would be the responsibility of the caller.

If load_source_path() returns the module in sys.modules, the issue is that if the name is generic and/or reused to load Python modules under the same name, we get the "wrong module". For example, a buggy test loader using "test" module name, without del sys.modules["test"], would run the same tests.

Moreover, the function disallows the usage of dots in the module name, which makes the name collision even more likely.

@vstinner
Copy link
Member Author

Examples of projects using imp.load_source():

  • coremltools: setup.py : sys.modules is not used, hardcoded "coremltools.version" name (with a dot!)
  • pyinvoke : sys.modules is not used, hardcoded "mod" module name, it was already replaced with SourceFileLoader("mod", path).load_module() before.

@brettcannon
Copy link
Member

My implementation is different than Python 3.11 imp.load_source() if the module is already cached in sys.modules: the imp function did not re-execute the module, whereas my implement always execute the module. Which behavior is the "correct" one?

There is no "correct" semantics since this isn't a typical use case. It's trying to work outside of the import system to accomplish something while still using the import system to do some of the work. I think the key question is why would someone want this functionality? If it's to import some module in an odd spot, then it should go into sys.modules. If it's just to get some source file into a module object to call the code in there, then putting the module in sys.modules isn't necessary (and neither is the module name as that could be a dummy value). And if it's being done just to execute the code, then use exec() appropriately.

There's a reason why I haven't put something like this into importlib. 馃槈

@vstinner
Copy link
Member Author

@brettcannon:

There is no "correct" semantics since this isn't a typical use case. It's trying to work outside of the import system to accomplish something while still using the import system to do some of the work. (...)

The more I dig into the details, the more I understand why it's hard to design a single function implementing "exclusive" use cases, especially the question about caching the module in sys.modules or not (3 cases):

  • don't use the cache and don't store in the cache
  • don't use the cache and store in the cache
  • use the cache and store in the cache

Morever, here we are already about replacing existing imp.load_source() calls. But also code written manually (using the imp and importlib modules).

For all these reasons, I give up on my attempt to add importlib.util.load_source_path() function. Instead, I propose to add a "recipe" explaining how to port code using the removed imp module in What's New in Python 3.12: PR #105951.

@vstinner vstinner closed this Jun 20, 2023
@vstinner vstinner deleted the importlib_load_source branch June 27, 2023 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants