Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

StringIO#getc is very slow for non-fixed-size encodings #2281

Closed
tsion opened this Issue · 1 comment

2 participants

@tsion
Collaborator

Here's a little benchmark:

      user     system      total        real
ascii  0.063334   0.000000   0.063334 (  0.064369)
utf-8  8.616666   0.000000   8.616666 (  8.643624)

(Code is here.)

getbyte and each_char both run in about the same amount of time for ascii and utf-8.

@dbussink
Owner

The fix probably involves making sure we track the current position based on bytes and be able to get a character using that byte offset.

@dbussink dbussink closed this issue from a commit
@tsion tsion Use byte indexes in StringIO#getc.
Previously it treated d.pos as a character index and indexed into the string,
even though StringIO#getbyte treats d.pos as a byte index.

This makes getc work properly with getbyte (fixes #2282).

This also makes it much more efficient, since string indexing for
non-fixed-width encoded strings such as UTF-8 strings takes linear time
(fixes #2281).
53290f9
@dbussink dbussink closed this in 53290f9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.