gh-44691: efficient backward read operations for IOBase [WIP] #120728

picnixz · 2024-06-19T09:36:29Z

Last activity was 2014. Anyone interested in getting this over the finish line?

This PR is a WIP where I wanted to revive an old issue, namely reverse reading for IO objects in general (I've added a minimal blurb entry just for the CI/CD but I'll update it when the PR is done).

While the original issue is only for lines, #44691 (comment) suggested to have the interface on BufferedReader rather than TextIOWrapper since [for] full-speed scanning of log files, you probably want to open them in binary mode. Similarly, it was suggested, instead of a full-blown iterator, to implement simpler primitives.

It might make sense to have the possibility for backward iteration independently of whether it is a FileIO, BytesIO or BufferedIO. The Python implementation is complete and documented (at the code level) and the C interface is complete only for simple primitives.

The design choices are as follows:

The backread method behaves like read but reads bytes from right to left (e.g., abcdef -> fedbca). It should be roughly equivalent to read the whole buffer in memory, and read from the end (obviously, I don't read the whole buffer first...). A similar observation holds for the backreadinto method.
There is an equivalent of readall that I called backreadall which simply reads the whole buffer from the right. It's mostly used as a shortcut to avoid duplications, e.g., when you call backread() without a limit.

There is also an equivalent of readline() named backreadline() which is a bit different than backread() and backreadall(). I'll illustrate the behaviour with an example. Say that the text is "abc\n1234\nthis is the last line". Then, backreadline(9) would return "last line" and backreadline() would return "this is the last line\n". As you can see, it's similar to read characters from the right and put them in a LIFO queue, stopping just before we encounter a new line character, process that queue and print '\n' if needed.

In practice, backreadline() would be used to retrieve the lines of a log file for instance, so I think it would make sense that it behaves like that. I don't really know whether it's preferrable, when halting in the middle of the line, to return the last bytes of the line or to return the first bytes. In the first case, I don't need to read the entire line if the number of bytes I want to read are less than the line's length but in the second case, I need to first know where my line ended in order to read its first n bytes.

Tasks

Issue: Efficient reverse line iterator #44691

📚 Documentation preview 📚: https://cpython-previews--120728.org.readthedocs.build/

https://github.com/python/cpython/pull/120728#discussion_r1646047980 ↩

picnixz · 2024-06-19T09:38:59Z

@nineteendo Ok, I won't force-push anymore :') Heavens told me that it wasn't worth the effort :D

Doc/library/io.rst

Doc/whatsnew/3.14.rst

nineteendo · 2024-06-19T10:56:43Z

I would just document to use seek():

from io import SEEK_END

file.seek(0, SEEK_END)
data = file.backread()

picnixz · 2024-06-19T10:59:42Z

Mmh... ok! that's what I'm using currently in the tests but I'm fine with that (though, maybe people would not be really happy in the usability?)

picnixz · 2024-06-19T11:18:55Z

NB: Test failures are known (it's just because I don't have the C-implementation yet)

Doc/library/io.rst

Lib/test/test_memoryio.py

picnixz · 2024-06-20T14:16:45Z

Ah sure. I'm actually writing incrementally and only pushing to save the work. I always forget to update the global objects also...

Lib/_pyio.py

picnixz · 2024-06-20T16:53:12Z

The functionalities that need to be implemented in C will be a bit more complicate, so for now, I'll leave the PR as it is. I'd like to have some feedback on the functionality in Python and its implementation so that I don't implement something in C that is not in line with should be decided.

@pitrou You were the one that suggested implementing simpler primitives. Here, I decided to implement the same primitives as the read(), readinto() and readline() primitives, so that I can have a very generic interface that may be extended in the future. Strictly speaking, it's also possible to only have a backreadline() without implementing backread() or backreadinto() separately.

nineteendo · 2024-06-25T17:01:20Z

Lib/_pyio.py

+        bytes_rev_res = bytes(rev_res)[::-1]
+        if eol is None:
+            return bytes_rev_res
+        # reverse the characters in the line, except the new line character


It's already reversed.

Suggested change

bytes_rev_res = bytes(rev_res)[::-1]

if eol is None:

return bytes_rev_res

# reverse the characters in the line, except the new line character

# reverse the characters in the line, except the new line character

bytes_rev_res = bytes(rev_res)[::-1]

if eol is None:

return bytes_rev_res

Oupsi, the comment was not updated!

picnixz · 2024-06-26T12:51:43Z

The implementation of BufferedReader.backread() is buggy and I'm still working on making it work. The premises that I thought of were incorrect and I discovered the algorthmic issues by trying some tests. So this PR is still a WIP and should not be reviewed yet in its current form.

picnixz added 6 commits June 19, 2024 11:34

add Python implementation

420c9a4

add abstract interface

b995f77

update test

0d9d5b2

add doc

a99bd07

blurb

80ca2f5

add NEWS

e349297

bedevere-app bot mentioned this pull request Jun 19, 2024

Efficient reverse line iterator #44691

Open

fix RST

7c21f65

nineteendo reviewed Jun 19, 2024

View reviewed changes

Doc/library/io.rst Outdated Show resolved Hide resolved

picnixz added 8 commits June 19, 2024 11:57

update C implementation order

dc21a78

clinic

633b674

update C documentation

523411b

update Py documentation

97f6911

update refs

e91e270

update RST documentation

31ada63

clinic again

d1c1729

fix ref

49869c5

This comment was marked as resolved.

Sign in to view

nineteendo reviewed Jun 19, 2024

View reviewed changes

Doc/whatsnew/3.14.rst Outdated Show resolved Hide resolved

add tests for Python BytesIO

f13513e

nineteendo reviewed Jun 19, 2024

View reviewed changes

Doc/library/io.rst Show resolved Hide resolved

picnixz added 2 commits June 19, 2024 16:53

Add C implementation for BytesIO.backread[into]()

8640635

update tests

712fcb0

nineteendo reviewed Jun 19, 2024

View reviewed changes

Lib/test/test_memoryio.py Outdated Show resolved Hide resolved

Lib/test/test_memoryio.py Outdated Show resolved Hide resolved

picnixz added 2 commits June 19, 2024 17:16

use 'BOF' sentinels

6b42a56

make code more readable

7b75186

picnixz added 6 commits June 20, 2024 17:43

add C implementation for reverse read operations for BytesIO objects

51518ec

update tests

516392f

regenerate objects

cefc446

fix range

593c792

update method order

a1dc240

add TODO for me

4e4882c

nineteendo reviewed Jun 20, 2024

View reviewed changes

Lib/_pyio.py Outdated Show resolved Hide resolved

picnixz added 2 commits June 20, 2024 18:34

add C implementation for io.RawIOBase.backread[all,into] methods

d507d12

regenerate objects

df89fc5

picnixz added 10 commits June 23, 2024 13:53

Merge branch 'main' into reverse-read-io

740e5b8

regenerate objects

82b4e8d

update tests

0d5702a

optimize BytesIO.backreadline

3ca17f2

update tests for backreadinto

3a6276b

optimize BytesIO.backreadinto

71bdf54

revert private API exposure

4ad94b1

fix warnings

53dca79

remove un-necessary import

3c38929

Use [::-1] instead of reversed()

387669b

nineteendo reviewed Jun 25, 2024

View reviewed changes

picnixz added 6 commits June 26, 2024 10:02

change method order

74ff54a

update comment for IOBase.backreadline

aa743d1

fix __reversed__

287466e

remove un-necessary test

60dc998

remove un-necessary test

f855737

fix backread() of python

ccc1674

Merge remote-tracking branch 'upstream/main' into reverse-read-io

a1a6835

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-44691: efficient backward read operations for IOBase [WIP] #120728

gh-44691: efficient backward read operations for IOBase [WIP] #120728

picnixz commented Jun 19, 2024 •

edited

Loading

picnixz commented Jun 19, 2024

This comment was marked as resolved.

nineteendo commented Jun 19, 2024

picnixz commented Jun 19, 2024

picnixz commented Jun 19, 2024

picnixz commented Jun 20, 2024

picnixz commented Jun 20, 2024 •

edited

Loading

nineteendo Jun 25, 2024

picnixz Jun 26, 2024

picnixz commented Jun 26, 2024

gh-44691: efficient backward read operations for IOBase [WIP] #120728

Are you sure you want to change the base?

gh-44691: efficient backward read operations for IOBase [WIP] #120728

Conversation

picnixz commented Jun 19, 2024 • edited Loading

Tasks

Footnotes

picnixz commented Jun 19, 2024

This comment was marked as resolved.

nineteendo commented Jun 19, 2024

picnixz commented Jun 19, 2024

picnixz commented Jun 19, 2024

picnixz commented Jun 20, 2024

picnixz commented Jun 20, 2024 • edited Loading

nineteendo Jun 25, 2024

Choose a reason for hiding this comment

picnixz Jun 26, 2024

Choose a reason for hiding this comment

picnixz commented Jun 26, 2024

picnixz commented Jun 19, 2024 •

edited

Loading

picnixz commented Jun 20, 2024 •

edited

Loading