Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib stubs are unnecessarily strict with file-like objects #4212

Closed
remram44 opened this issue Jun 10, 2020 · 11 comments
Closed

stdlib stubs are unnecessarily strict with file-like objects #4212

remram44 opened this issue Jun 10, 2020 · 11 comments
Labels
topic: io I/O related issues

Comments

@remram44
Copy link

Problem

Currently the IO situation is less than ideal. Not only are IO[str]/TextIO and IO[bytes]/BinaryIO a bit confusing (interchangeable in most cases), but the use of IO through stdlib is inconsistent and doing things like passing an object with a write() method to json.dump() does not work.

This is because the IO object, while describing the actual objects returned by open() perfectly, is not suitable to represent the "file-like object" interface. This interface is well known, documented prominently in the standard library's documentation (glossary: "file object" and "file-like object") and a testament to duck-typing; however it's not compatible with how typeshed is currently written (for the most part).

Proposal

I propose to introduce Protocols (not abstract classes) to be used for parameters where a "file object" is expected, allowing one to correctly type their file-like objects without having to inherit one of the abstract base classes. Furthermore, I think we should have two protocols representing files that can be read from or written to.

This work can be done incrementally, and I am willing to spend time doing this if there is no veto to this ticket.

Pros

This would allow a file-like object to be passed to json.dump(), zipfile.ZipFile, and others (like it already can to csv.write()).

Using Protocols of this small scale would allow objects that already conform to be used in interfaces expecting file-like object, without having to implement too many methods (or explicitly inherit from the base class, as is required now). This should lower the effort of bringing libraries to the typing world. Using two separate protocols is similar to how most languages do this, off the top of my head:

It is interesting to note that the protocols I describe already exist in typeshed. Not wanting to put IO where the documentation called for file-like object, protocols have already been introduced:

Introducing those protocols would also allow us to remove some of the IO[str]/TextIO complexity: while TextIO and BinaryIO are still needed for the native file objects (they have additional methods compared to IO), the protocols used for function parameters everywhere can be only Read[str] and Read[bytes].

Cons

This is a sizeable change, and people are likely to use both the base class and those protocols for some time. However code using the base class should not break when passed to functions expecting the protocol.

Another caveat is that this might give a false sense of security: libraries in the wild do their own check to determine if an object conform to the interface, and for example pandas will not accept to write on a file object that does implement __iter__. Therefore objects conforming to the protocol might still not be accepted by (IMHO buggy) libraries, while inheriting the base class would make their objects look more like file objects (maybe too much, since it gives everything the attributes of both a readable and a writable file!).

Draft

Unfortunately this is where the typeshed-bikeshed starts, but this is my proposal:

AnyStr = typing.TypeVar('AnyStr', str, bytes)  # typing.AnyStr

class WriteIO(Protocol[AnyStr]):
    def write(self, s: AnyStr) -> int: ...
    def flush(self) -> None: ...
    def close(self) -> None: ...
    def __enter__(self) -> 'WriteIO[AnyStr]': ...
    def __exit__(self, exc_type: Optional[Type[BaseException]], exc_val: Optional[BaseException],
                 exc_tb: Optional[TracebackType]) -> Optional[bool]: ...

class ReadIO(Protocol[AnyStr]):
    def read(self, size: typing.Optional[int] = None) -> AnyStr: ...
    def close(self) -> None: ...
    def __enter__(self) -> 'WriteIO[AnyStr]': ...
    def __exit__(self, exc_type: Optional[Type[BaseException]], exc_val: Optional[BaseException],
                 exc_tb: Optional[TracebackType]) -> Optional[bool]: ...
    def __iter__(self) -> Iterator[AnyStr]: ...

Additional protocols can be added to provide seek()/tell() (similar to Rust's io::Seek trait)

@JelleZijlstra
Copy link
Member

We're already moving in this direction, thanks mostly to @srittau's efforts. However, our approach has been to create ad-hoc protocols for individual cases, instead of one-size-fits-all protocols like you propose. The trouble is that while in theory the file-like object concept may be "well known", in practice it is ill-defined, and there are lots of variations in exactly what methods are expected to exist on file-like objects.

@remram44
Copy link
Author

I unfortunately experienced that first-hand with pandas, however:

  • It is unlikely that the protocols used in the standard library vary that much (for example shutil._Writer and _csv._Writer are identical)
  • We have an opportunity to give library author a sane model for what those interfaces should be. This will make it easier to go to pandas and tell them that their attribute-checking is unusual, for example

Also adding methods on top of those core protocols is easy (e.g. extend the protocol) so there is probably no need to redefine those from scratch every place they are needed...


In any case I am glad to hear that an effort is going on. So long as I can get my custom writers to type-check when I pass them to json.dump() and similar, I can move forward with typing. Is that effort documented or tracked somewhere? Would PRs fixing individual module stubs like json be accepted?

@JelleZijlstra
Copy link
Member

Yes, such PRs would be accepted. There's no centralized tracking, but we recently (#4161) added a _typeshed package for internal types, so that would be a good place for IO protocols.

@srittau
Copy link
Collaborator

srittau commented Jun 10, 2020

One of the goals of using ad-hoc protocols for now is to determine which protocols are needed in practice and then move those to _typeshed.

@remram44
Copy link
Author

Once we converge on a set of protocols, wouldn't it be better to expose them publicly? Them being protocols, I can duplicate them in my code, but that still feels wrong.

@JelleZijlstra
Copy link
Member

Maybe once we make typeshed modular we can also generate .py code for _typeshed and publish it as a package.

@ramalho
Copy link
Contributor

ramalho commented Jun 14, 2020

@remram44 this is a great proposal. I've been teaching Python for 20+ years and "a file-like object" has always been one of the best examples of the informal protocols that appear often in the standard library.

I think we should seek inspiration with Go, which popularized static duck typing several years before PEP 544. Their philosophy is that interfaces should be narrow, often just a single method, and it works very well for them. Take a look at:

And combinations like:

I also like very much their naming convention of turning a verb into a noun by affixing "er".

@srittau started a discussion about naming in #4174.

@srittau srittau added the topic: io I/O related issues label Jun 14, 2020
@srittau
Copy link
Collaborator

srittau commented Jun 14, 2020

Some previous discussion in python/typing#564.

@srittau
Copy link
Collaborator

srittau commented May 16, 2021

I'll just copy my comment from python/typing#213 here:

Intersection types could be useful for fast ad-hoc protocols, especially IO protocols:

def foo(file: HasRead & HasSeek) -> None:
    pass

@srittau
Copy link
Collaborator

srittau commented Aug 11, 2021

I don't think there's anything immediately actionable here for typeshed. We are already moving in the directory of small ad-hoc protocols in typeshed that can be used for this purpose. Larger protocols only make sense if they are currently used.

@srittau srittau closed this as completed Aug 11, 2021
@twoertwein
Copy link
Contributor

twoertwein commented Nov 17, 2021

Pandas 1.4 will use protocols for its IO functions pandas-dev/pandas#43951

for example pandas will not accept to write on a file object that does implement __iter__

This will be slightly relaxed in 1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: io I/O related issues
Projects
None yet
Development

No branches or pull requests

5 participants