Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-44691: efficient backward read operations for IOBase [WIP] #120728

Draft
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

picnixz
Copy link
Contributor

@picnixz picnixz commented Jun 19, 2024

Per Guido's comment (#44691 (comment)):

Last activity was 2014. Anyone interested in getting this over the finish line?

This PR is a WIP where I wanted to revive an old issue, namely reverse reading for IO objects in general (I've added a minimal blurb entry just for the CI/CD but I'll update it when the PR is done).

While the original issue is only for lines, #44691 (comment) suggested to have the interface on BufferedReader rather than TextIOWrapper since [for] full-speed scanning of log files, you probably want to open them in binary mode. Similarly, it was suggested, instead of a full-blown iterator, to implement simpler primitives.

It might make sense to have the possibility for backward iteration independently of whether it is a FileIO, BytesIO or BufferedIO. The Python implementation is complete and documented (at the code level) and the C interface is complete only for simple primitives.

The design choices are as follows:

  • The backread method behaves like read but reads bytes from right to left (e.g., abcdef -> fedbca). It should be roughly equivalent to read the whole buffer in memory, and read from the end (obviously, I don't read the whole buffer first...). A similar observation holds for the backreadinto method.

  • There is an equivalent of readall that I called backreadall which simply reads the whole buffer from the right. It's mostly used as a shortcut to avoid duplications, e.g., when you call backread() without a limit.

There is also an equivalent of readline() named backreadline() which is a bit different than backread() and backreadall(). I'll illustrate the behaviour with an example. Say that the text is "abc\n1234\nthis is the last line". Then, backreadline(9) would return "last line" and backreadline() would return "this is the last line\n". As you can see, it's similar to read characters from the right and put them in a LIFO queue, stopping just before we encounter a new line character, process that queue and print '\n' if needed.

In practice, backreadline() would be used to retrieve the lines of a log file for instance, so I think it would make sense that it behaves like that. I don't really know whether it's preferrable, when halting in the middle of the line, to return the last bytes of the line or to return the first bytes. In the first case, I don't need to read the entire line if the number of bytes I want to read are less than the line's length but in the second case, I need to first know where my line ended in order to read its first n bytes.

Tasks

  • Add Python implementation
  • Add documentation
    • Document behaviours
    • Add usage of backread() & co (seek, cursor behavior, etc)1
    • Add examples
  • Add Whats New's entry
  • Add tests
  • Add C implementation
    • io.IOBase.backreadline
    • io.IOBase.__reversed__
    • io.RawIOBase.backread (should use the abstract backreadinto())
    • io.RawIOBase.backreadall (should use the abstract backreadinto())
    • io.RawIOBase.backreadinto (should be abstract)
    • io.BufferedIOBase.backread (should be abstract)
    • io.BufferedIOBase.backreadinto (should use the abstract backread())
    • io.BytesIO.backread (specialized)
    • io.BytesIO.backreadinto (specialized)
    • io.BytesIO.backreadline (optional?)
    • io.BufferedReader.backread (specialized)
    • io.BufferedReader.backreadinto (specialized)
    • io.BufferedRandom.backread (should by handled by inside BufferedReader.backread by flushing writable streams)
    • io.BufferedRandom.backreadinto (should be handled by BufferedReader.backreadinto() by flushing writable streams)
    • io.FileIO.backread (should work only if seekable, specialized)
    • io.FileIO.backreadall (should work only if seekable, specialized)
    • io.FileIO.backreadinto (should work only if seekable, specialized)
    • io.TextIOBase.backread (should be un-supported for now)
    • io.TextIOBase.backreadline (should be un-supported for now)
    • io.TextIOWrapper.backread (should be un-supported for now)
    • io.TextIOWrapper.backreadline (should be un-supported for now)
  • (optional) benchmarks

📚 Documentation preview 📚: https://cpython-previews--120728.org.readthedocs.build/

Footnotes

  1. https://github.com/python/cpython/pull/120728#discussion_r1646047980

@picnixz
Copy link
Contributor Author

picnixz commented Jun 19, 2024

@nineteendo Ok, I won't force-push anymore :') Heavens told me that it wasn't worth the effort :D

Doc/library/io.rst Outdated Show resolved Hide resolved
@picnixz

This comment was marked as resolved.

Doc/whatsnew/3.14.rst Outdated Show resolved Hide resolved
@nineteendo
Copy link
Contributor

I would just document to use seek():

from io import SEEK_END

file.seek(0, SEEK_END)
data = file.backread()

@picnixz
Copy link
Contributor Author

picnixz commented Jun 19, 2024

Mmh... ok! that's what I'm using currently in the tests but I'm fine with that (though, maybe people would not be really happy in the usability?)

@picnixz
Copy link
Contributor Author

picnixz commented Jun 19, 2024

NB: Test failures are known (it's just because I don't have the C-implementation yet)

Lib/test/test_memoryio.py Outdated Show resolved Hide resolved
Lib/test/test_memoryio.py Outdated Show resolved Hide resolved
@picnixz
Copy link
Contributor Author

picnixz commented Jun 20, 2024

Ah sure. I'm actually writing incrementally and only pushing to save the work. I always forget to update the global objects also...

Lib/_pyio.py Outdated Show resolved Hide resolved
@picnixz
Copy link
Contributor Author

picnixz commented Jun 20, 2024

The functionalities that need to be implemented in C will be a bit more complicate, so for now, I'll leave the PR as it is. I'd like to have some feedback on the functionality in Python and its implementation so that I don't implement something in C that is not in line with should be decided.

@pitrou You were the one that suggested implementing simpler primitives. Here, I decided to implement the same primitives as the read(), readinto() and readline() primitives, so that I can have a very generic interface that may be extended in the future. Strictly speaking, it's also possible to only have a backreadline() without implementing backread() or backreadinto() separately.

Lib/_pyio.py Outdated
Comment on lines 592 to 595
bytes_rev_res = bytes(rev_res)[::-1]
if eol is None:
return bytes_rev_res
# reverse the characters in the line, except the new line character
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already reversed.

Suggested change
bytes_rev_res = bytes(rev_res)[::-1]
if eol is None:
return bytes_rev_res
# reverse the characters in the line, except the new line character
# reverse the characters in the line, except the new line character
bytes_rev_res = bytes(rev_res)[::-1]
if eol is None:
return bytes_rev_res

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oupsi, the comment was not updated!

@picnixz
Copy link
Contributor Author

picnixz commented Jun 26, 2024

The implementation of BufferedReader.backread() is buggy and I'm still working on making it work. The premises that I thought of were incorrect and I discovered the algorthmic issues by trying some tests. So this PR is still a WIP and should not be reviewed yet in its current form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants