Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un-deprecate functional API for importlib resources & add subdirectory support #116608

Closed
encukou opened this issue Mar 11, 2024 · 27 comments
Closed
Labels
topic-importlib type-feature A feature request or enhancement

Comments

@encukou
Copy link
Member

encukou commented Mar 11, 2024

Feature or enhancement

Proposal:

The importlib.resources functions {open,read}_{text,binary}, path, is_resource and contents, deprecated in 3.11 and removed in 3.13 alphas, are, anecdotally, missed by quite a few users.
They provide a simple API for simple tasks, while the full-featured Traversable API is better suited for complex ones -- especially for implementing new resources-aware loaders.

I'm now in a position where I can add these functions back and support them.

Their main drawback -- not allowing subdirectories -- can be solved by taking multiple path components as positional arguments, for example:

importlib.resources.read_text('modulename', 'subdirectory', 'subsubdir', 'resource.txt')

The additional arguments (encoding and errors) would become keyword-only.


There is a wrinkle in this: in Python 3.9-3.11, the above would mean:

importlib.resources.read_text(
    'modulename', 'subdirectory',
    encoding='subsubdir',
    errors='resource.txt',
)

I believe that this is acceptable, since:

  • pragmatically: typical file names do not match typical encoding/errorhandler names
  • lawyerly: the functions have already been deprecated for 2 releases; no one is using them now, right?

However, if this is a problem, I can

[edit: This is solved by:]

  • make the encoding argument required if a text-reading function more than one path component is given.
  • plan to lift this limitation around 3.15.
importlib.resources.read_text(
    'modulename', 'subdirectory', 'subsubdir', 'resource.txt',
    encoding='utf-8',
)
importlib.resources.read_text('modulename', 'resource.txt')  # OK
importlib.resources.read_text('modulename', 'subdirectory', 'utf-8')  # error

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/deprecating-importlib-resources-legacy-api/11386/29

Linked PRs

@FFY00
Copy link
Member

FFY00 commented Mar 11, 2024

Wouldn't this be a bit late for that? We already went through the deprecation period, and removed the feature in the alpha releases, bringing them back now would be a bit confusing.

The importlib.resources functions {open,read}_{text,binary}, path, is_resource and contents, deprecated in 3.11 and removed in 3.13 alphas, are, anecdotally, missed by quite a few users.

Can you actually show a couple examples of this affecting users downstream? I think that's the most viable argument to bring that API back.

@zooba
Copy link
Member

zooba commented Mar 11, 2024

taking multiple path components as positional arguments

Why not just take multiple components with separators in a single argument? It's easy enough to require forward slash, disallow .. and even to normalise backslashes on Windows if you feel like it.

If they didn't allow subdirectories before (I never noticed, tbh), then presumably using a slash here would have either failed completely or worked. Either way, we can enable them in a new release.

(And add me to the anecdotal list of people who missed them. It's easy enough to add a few lines of code to bring them back, which is how I have been handling it so far, but I'd be happier to have those few lines in the stdlib.)

@jaraco
Copy link
Member

jaraco commented Mar 11, 2024

At least one audience that would like to keep the legacy APIs is in mesonbuild/meson#12401.

I admit, I prefer this approach over keeping the legacy APIs with the cruft that it still had lying around. It adds a mostly-compatible layer and restores these wrappers in a supported way.

On one hand, this approach violates the "preferably one way" to do things; users will need to decide which way works best for them, creating a variety of supported approaches. On the other hand, I do appreciate that it offers a friendlier interface for certain operations (esp. path(...)).

FFY00 and I put a lot of work into this deprecation process, so it'll be disappointing to now see this reversed at the last minute, but it does feel like the right thing to do, especially since someone else is willing to own the implementation (thanks encukuo!). We will have to backport the change to importlib_resources, but that should be fairly straightforward.

Overall, I'm +0 on the change. I'd really like to see more vocal support from other core devs before committing to this approach.

@encukou
Copy link
Member Author

encukou commented Mar 12, 2024

I've made the encoding argument mandatory for _text functions when multiple path names are given.

Wouldn't this be a bit late for that?

Yes, sorry. Previously I couldn't commit to supporting this API.

Why not just take multiple components with separators in a single argument?

I'd rather not derail discussion on this issue. Support for separators can be added later if necessary. If they will, allowing multiple arguments will still be useful.

FFY00 and I put a lot of work into this deprecation process

Sorry to hear that. Sunk costs suck :(
This makes it seem that implementing the deprecation process was similarly (or more) time-consuming as keeping the API working. That's not a good situation to be in, especially considering all the work users need to put in to update their code.

@pfmoore
Copy link
Member

pfmoore commented Mar 12, 2024

(And add me to the anecdotal list of people who missed them. It's easy enough to add a few lines of code to bring them back, which is how I have been handling it so far, but I'd be happier to have those few lines in the stdlib.)

I’ll add a “me too” here as well. Being able to do simple things simply is an advantage.

@pradyunsg
Copy link
Member

pradyunsg commented Mar 12, 2024

+1 from me on the (updated) proposed API as well as un-deprecating these -- for reasons that have been discussed on the d.p.o thread as well as mentioned here by others.

@eli-schwartz
Copy link
Contributor

Their main drawback -- not allowing subdirectories -- can be solved by taking multiple path components as positional arguments, for example:

importlib.resources.read_text('modulename', 'subdirectory', 'subsubdir', 'resource.txt')

I've always sort of wondered why this is a drawback at all, compared to simply doing this:

with importlib.resources.path('modulename.subdirectory.subsubdir', 'resources.txt') as f:
    ...

I'm not objecting to the new API! It's more ergonomic than pretending everything is a namespace module. But for backwards compatibility with python < 3.13 it seems practical to use the two-argument form, and the lack of a new API doesn't seem like it should have been a killer problem before now.

@zooba
Copy link
Member

zooba commented Mar 12, 2024

I don't understand the reason we can't reimplement it as:

def read_text(module, filename, *args, **kwargs): #use proper args if you want here, I just don't know them all off the top of my head
    with (path(module) / filename).open("r", *args, **kwargs) as f:
        return f.read()

Why do we need the module and filename as multiple args instead of just two?

@zooba
Copy link
Member

zooba commented Mar 12, 2024

I'd rather not derail discussion on this issue.

How is it derailing this issue? You're bringing back an API, which I like, and changing the design in a potentially backwards-incompatible way in the process, which I don't. Why is it derailing to ask why it has to have a different design now?

@barneygale
Copy link
Contributor

Why do we need the module and filename as multiple args instead of just two?

I'm catching myself up on this. I think the answer is (somewhere) in this thread: https://gitlab.com/python-devs/importlib_resources/-/issues/58

@jaraco
Copy link
Member

jaraco commented Mar 12, 2024

Why do we need the module and filename as multiple args instead of just two?

I'm catching myself up on this. I think the answer is (somewhere) in this thread: https://gitlab.com/python-devs/importlib_resources/-/issues/58

Which was migrated to python/importlib_resources#58.

@barneygale
Copy link
Contributor

barneygale commented Mar 12, 2024

IIUC:

  • Support for resources in subdirectories required new APIs, e.g. something resembling Traversible.iterdir().
  • The pathlib API was considered a good fit and chosen for the task.
  • The functional interface wasn't enhanced to support subdirectories; instead it was earmarked for eventual removal.

So there's perhaps four levels of support we could offer for the functional APIs:

  1. Deprecate + remove (@jaraco's original plan, already landed)
  2. Restore old APIs + functionality as it was.
  3. Restore old APIs, and add support for subdirectories to existing functions (this issue and @encukou's PR)
  4. Restore old APIs, add add support for subdirectories, including adding new functions where needed (e.g. functional equivalent of traversable.iterdir())

Personally I'd lean towards option 2. If folks need subdirectory support they can use the OOP API - that's it's whole reason to exist!

@jaraco
Copy link
Member

jaraco commented Mar 12, 2024

Early in the design, I'd deemed it infeasible implement path for anything that's expecting subdirectories, because what does it mean to get a path for a directory? And indeed, the replacement, as_file(files(...)) would only allow access to a single file on disk, even though the Traversable API provided access to traverse through a tree of files. Later (and only fairly recently in python/importlib_resources#255), as_file was expanded to add support for directories (by manifesting a directory and all of its contents in a temporary directory when needed). Until that support was added, the meaning of path() for a directory would have been broken. I believe that explains why the functional API was intended to be deprecated but only later seems viable to be revived with subdirectory support.

@pfmoore
Copy link
Member

pfmoore commented Mar 12, 2024

I'm going to backtrack slightly on my support for this. I do find the Traversable API conceptually complex when all I want to do is "read a file from my package" - the functional API was simpler, and as I said above, I believe that "simple things should be simple to do". So I remain +1 on un-deprecating that API.

The complexities around "add subdirectory support" are where I'm less certain. I like the idea of supporting subdirectories without needing the full Traversable API, but I'm not entirely happy with the proposal here (specifically the somewhat-weird compromise of making the encoding parameter mandatory). As an alternative, what's wrong with allowing /-separated resource names, and translating them into the equivalent Traversable invocation? This has, by definition, no more security issues than the joinpath approach, so the problems discussed in the linked issue are addressed at least as well as the Traversable API addresses them. And it matches what people were reported as expecting to be able to do with the functional API.

So there's perhaps four levels of support we could offer for the functional APIs:

  1. Deprecate + remove (@jaraco's original plan, already landed)
  2. Restore old APIs + functionality as it was.
  3. Restore old APIs, and add support for subdirectories to existing functions (this issue and @encukou's PR)
  4. Restore old APIs, add add support for subdirectories, including adding new functions where needed (e.g. functional equivalent of traversable.iterdir())

Personally I'd lean towards option 2. If folks need subdirectory support they can use the OOP API - that's it's whole reason to exist!

I would be happy with that.

I will note that option 2 was essentially what a lot of people asked for in the Discourse thread at the time of the original deprecation, and it felt very much to me that positions got entrenched because the comments were possibly too heated. Maybe with the passage of time, and experience of the impact, all we're doing here is confirming that it turned out that the wrong choice was made? Luckily, we still have time to reverse that choice before the functional API actually gets removed.

@jaraco
Copy link
Member

jaraco commented Mar 12, 2024

One issue that was discussed in comments leading up to this comment was that, by adding support for subdirectories, functions like path and read_text could be providing access to content outside of the package. The files() API avoided this issue by leaving it up to the caller to do any traversal. If there are plans for path() and other calls to accept path parameters, it will re-open those (possibly security-related) concerns.

@pfmoore
Copy link
Member

pfmoore commented Mar 12, 2024

I don't see how - you shouldn't be accepting untrusted input to the Traversable API in just the same way as for the functional API. joinpath("..") is no more secure than a .. component in a /-separated path, surely?

@jaraco
Copy link
Member

jaraco commented Mar 12, 2024

I don't see how - you shouldn't be accepting untrusted input to the Traversable API in just the same way as for the functional API. joinpath("..") is no more secure than a .. component in a /-separated path, surely?

I'm not convinced it's a concern meriting intervention, but it was something that @warsaw raised.

@barneygale
Copy link
Contributor

From the linked comment:

I'll assume that the implementation prevents navigating to non-subdirectories; i.e. you can't escape the enclosing resource package directory, or find your way to /etc, etc. :)

If I'm right in thinking that files() returns a pathlib.Path or zipfile.Path, then in the former case, nothing prevents users from "escaping" the resource root by calling joinpath('..') or joinpath('/etc/passwd'), right?

@zooba
Copy link
Member

zooba commented Mar 12, 2024

Maybe, but in cases where that would work, there are easier ways for someone to do it.

Much easier to say that resources shouldn't be looked up using untrusted input. The vast majority of uses will be using static strings here anyway, because arbitrary files do not exist in your own package.

But I believe it uses a different type anyway and only converts to an actual path at the very end, so it may not even be possible to traverse outside of the package. Which makes this entirely a moot point (and would be a great design if so).

@eli-schwartz
Copy link
Contributor

eli-schwartz commented Mar 12, 2024

I'm pretty sure it produces a traversable that also inherits from Path and can be used as either one (modulo isinstance checks to see whether it's from pathlib or zipfile for cases where the difference matters).

I'd assume that the standard path joining operations use the traversable impl which then checks for attempts to escape the resource root.

@eli-schwartz
Copy link
Contributor

Maybe, but in cases where that would work, there are easier ways for someone to do it.

Much easier to say that resources shouldn't be looked up using untrusted input. The vast majority of uses will be using static strings here anyway, because arbitrary files do not exist in your own package.

Right, there's no reason someone couldn't be passing the same inputs to os.path.join and then to builtin open() directly, instead of importlib.resources, so it's basically splitting hairs to say that the resources API needs super special protections for security.

... Also a distraction from the main topic I guess.

@encukou
Copy link
Member Author

encukou commented Mar 13, 2024

@jaraco, do you think allowing path separators is OK? I'm happy to delegate the decision to you.

My opinion, for what it's worth: it is OK.
I don't think we need to normalize slashes or disallow backslashes. I've often wished for a more opinionated resource system that would enforce names that work the same on all platforms (e.g. [a-z0-9]+ segments separated by [-/._]), as that would eliminate common beginner packaging mistakes. But, importlib.resources isn't that, and I don't think it should try to be that.
I can see the case for avoiding path traversal attacks (disallowing ..), though. Tests will catch a platform-specific FileNotFoundError, but they won't catch a vulnerability.

I prefer allowing multiple *path_parts, as in pathlib's joinpath(), but I'm OK deferring that to a follow-up PR.

@jaraco
Copy link
Member

jaraco commented Mar 13, 2024

But I believe it uses a different type anyway and only converts to an actual path at the very end

In actuality, all importlib.resources.files() returns in most cases is a pathlib.Path or a zipfile.Path. There's no wrapper or additional protections, such as to enforce that only the Traversable interface is supplied. In fact in python/importilb_resources#291, I explored wrapping the result in something that extended the interface but found it is difficult or impossible to do in general. The Traversable interface exists mainly for providers to know the minimum interface they must supply, but for stdlib loaders, the Path classes provide that interface. If we wanted to provide traversal limitations for objects returned from files(), we'd probably want to augment both pathlib.Path and zipfile.Path to supply those limitations and then enable that behavior in files().

If I'm right in thinking that files() returns a pathlib.Path or zipfile.Path, then in the former case, nothing prevents users from "escaping" the resource root by calling joinpath('..') or joinpath('/etc/passwd'), right?

Correct.

@jaraco, do you think allowing path separators is OK?

Yes, I believe it's okay. I believe there's existing precedence for supporting it, in part because 'pathlib.Path' supports it, but also because it's explicitly documented that Traversables should support paths separated by posixpath.sep (forward slash, platform independent).

I've often wished for a more opinionated resource system that would enforce names that work the same on all platforms (e.g. [a-z0-9]+ segments separated by [-/._]), as that would eliminate common beginner packaging mistakes.

importlib.resources has been somewhat opinionated, as seen above, by supporting one path separator (posixpath.sep, which often works universally), but unopinionated in that it allows any valid path names or separators. That is, importlib.resources doesn't want to get into the business of regulating what is a valid resource. If a use-case demands to have names that contain spaces or unicode characters as they're meaningful to the application, they should be allowed to do so.

I prefer allowing multiple *path_parts, as in pathlib's joinpath(), but I'm OK deferring that to a follow-up PR.

Since Traversable.join_path accepts (*descendants), it should be readily supportable. That interface only evolved that behavior somewhat recently, so it may not be available in all Pythons (and thus custom providers may not be reasonably expected to fully support it yet).


One more comment - you may consider contributing this change to importlib_resources simultaneous to or prior to submitting it with CPython as importlib_resources provides a fuller test suite, providing coverage on older Pythons and checks for other aspects like code coverage, type checking, style consistency, and more. Totally your call, though.

@encukou
Copy link
Member Author

encukou commented Mar 20, 2024

Since Traversable.join_path accepts (*descendants), it should be readily supportable.

Oh, that's great! I'll add it to the docs.

The semantics are clear then: I'll delegate to joinpath :)

you may consider contributing this change to importlib_resources

Roger, will send the patch there too.

@encukou
Copy link
Member Author

encukou commented Mar 21, 2024

I've updated the PR to allow path separators.

A PR for importlib_resources is here: python/importlib_resources#303

@encukou
Copy link
Member Author

encukou commented Apr 5, 2024

Thanks for the discussion everyone, and sorry I couldn't make everyone happy.

I merged the PR. If you disagree with that, please preferably continue the discussion here: https://discuss.python.org/t/11386/47

@encukou encukou closed this as completed Apr 5, 2024
@encukou
Copy link
Member Author

encukou commented Apr 5, 2024

Apparently the tests fail on big-endian machines; will fix

@encukou encukou reopened this Apr 5, 2024
encukou added a commit that referenced this issue Apr 5, 2024
…GH-117569)

gh-116609: Ignore UTF-16 BOM in importlib.resources._functional tests

To test the `errors` argument, we read a UTF-16 file as UTF-8
with "backslashreplace" error handling. However, the utf-16
codec adds an endian-specific byte-order mark, so on big-endian
machines the expectation doesn't match the test file (which was
saved on a little-endian machine).

Use endswith to ignore the BOM.
@encukou encukou closed this as completed Apr 8, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
… tests (pythonGH-117569)

pythongh-116609: Ignore UTF-16 BOM in importlib.resources._functional tests

To test the `errors` argument, we read a UTF-16 file as UTF-8
with "backslashreplace" error handling. However, the utf-16
codec adds an endian-specific byte-order mark, so on big-endian
machines the expectation doesn't match the test file (which was
saved on a little-endian machine).

Use endswith to ignore the BOM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-importlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

8 participants