New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trio.Path.iterdir wrapping is broken #501

Closed
touilleMan opened this Issue Apr 16, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@touilleMan
Member

touilleMan commented Apr 16, 2018

Given pathlib.Path.iterdir returns a generator that does IO access on each iteration, trio.Path.iterdir is currently broken given it currently only generates the generator asynchronously (which I suppose is pointless given there is no need for IO at generator creation)

The solution would be to modify trio.Path.iterdir to return an async generator, however this means creating a special case given the current implementation is only an async wrapper on pathlib.Path.iterdir.

@njsmith

This comment has been minimized.

Member

njsmith commented Apr 16, 2018

You're right, good catch.

@njsmith

This comment has been minimized.

Member

njsmith commented Apr 16, 2018

Also this will require changing the signature – right now it's an async method that returns a synchronous iterator, and it needs to become a synchronous method that returns an async iterator.

I think we can just make the change if we do it soon – you're probably the first person to hit this, and the old behavior is simply wrong.

@kyrias

This comment has been minimized.

Contributor

kyrias commented Apr 16, 2018

It feels like there's no real reason that the listdir call needs to wait until iteration starts, so it feels like the simplest solution would be to have trio.Path.iterdir be an async constructor that calls listdir in a worker thread and then returns a regular generator that yields the results.

Doing some fancy wrapping of the stdlib function feels like it'd be a lot trickier.

It would essentially entail something like the following, which was ripped from the stdlib almost verbatim:

async def iterdir(self):
    func = partial(os.listdir, self._wrapped)
    files = await trio.run_sync_in_worker_thread(func)
    def gen():
        for name in files:
            if name in {'.', '..'}:
                continue
            yield self._wrapped._make_child_relpath(name)
            if self._wrapped._closed:
                self._wrapped._raise_closed()
    return gen()
@njsmith

This comment has been minimized.

Member

njsmith commented Apr 16, 2018

Hmm, I assumed that the point of this function was that it didn't pull the whole directory listing into memory at once. Is that wrong?

@kyrias

This comment has been minimized.

Contributor

kyrias commented Apr 16, 2018

No, it just calls os.listdir and then yields each element. The following is what's in the stdlib:

def iterdir(self):
    """Iterate over the files in this directory.  Does not yield any
    result for the special paths '.' and '..'.
    """
    if self._closed:
        self._raise_closed()
    for name in self._accessor.listdir(self):
        if name in {'.', '..'}:
            # Yielding a path object for these makes little sense
            continue
        yield self._make_child_relpath(name)
        if self._closed:
            self._raise_closed()

where self._accessor.listdir is just os.listdir unless you pass something else as the pathlib.Path template argument, and os.listdir just returns a plain list.

@kyrias

This comment has been minimized.

Contributor

kyrias commented Apr 16, 2018

Anyway, so doing it this way would change the semantics slightly as we'd be getting the directory listing when creating the generator object rather than when we later start iterating over it, but pathlib.Path.iterdir doesn't actually define the semantics of when it actually gets the directory contents, it just happens to be that it currently gets it when you start iterating over it rather than when creating the generator.

(Furthermore, the template argument to pathlib.Path isn't actually documented, so anyone who would be trying to override the accessors to use a lazy generator rather than os.list would be depending on a non-public API.)

@njsmith

This comment has been minimized.

Member

njsmith commented Apr 17, 2018

Hmm, tricky. The underlying operating systems do expose incremental reads for large directories. And Python even exposes it via os.scandir. So ideally in the future iterdir might use this. (Also, it'd be nice if pathlib exposed the richer scandir interface too: it gives you DirEntry objects, which have more information that just the name.) So even if right now iterdir just naively calls listdir and loads up all the results at once, there's an argument that we should make this an async iterator for future-proofing, so that we have the option to switch to using incremental disk reads in the future.

There are some extra complications though:

  • os.scandir holds open a file handle that has to be closed eventually. If iterdir used it, then users would have to be careful to close iterdir to avoid a ResourceWarning on CPython / fd leak on PyPI. Not necessarily a big deal, but adding this requirement later would be a pretty subtle change.

  • The theoretical benefit of using an incremental OS API for reading the directory is that if you have a tremendously huge directory, we don't have to load the full list into memory all at once. But the naive version, where you keep calling run_sync_in_worker_thread(scandir.__next__), will be incredibly slow, because you're adding a thread dispatch overhead for every file, even though the vast majority of them don't have to hit the disk. What you really want to do is amortize this by using a single call to run_sync_in_worker_thread to load a whole batch of results into memory all at once. Which... is basically what iterdir is already doing, with its naive use of listdir. In fact using listdir like this is probably optimal for 99.9% of cases, because it's extraordinarily rare that you have a directory listing that's so large that simply holding the list of names causes serious memory problems.

....So I think maybe I've convinced myself that trio.Path.iterdir should keep using listdir, and then potentially in the future we could imagine adding a trio.Path.scandir or something with a cleverer batching algorithm, DirEntry support, and an API that nudges you to close the scandir fd.


Proposed TODO list for this bug

  • Add a note to the docs for trio.Path.iterdir, warning people that despite having iter in the name it does load up the whole list into memory (by keeping the call signature the same, we're committing to never changing that, so it makes sense to document)

  • Call listdir immediately before returning the iterator, like @kyrias suggested

  • Fix another obvious bug while we're at it: currently trio.Path.iterdir yields pathlib.Path objects, not trio.Path objects :-)

@njsmith njsmith changed the title from trio.Path.iterdir is not really async to trio.Path.iterdir wrapping is broken Apr 17, 2018

@touilleMan

This comment has been minimized.

Member

touilleMan commented Apr 17, 2018

@njsmith I've submitted a PR for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment