Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextIOWrapper.tell extremely slow #55323

Closed
Laurens mannequin opened this issue Feb 4, 2011 · 15 comments
Closed

TextIOWrapper.tell extremely slow #55323

Laurens mannequin opened this issue Feb 4, 2011 · 15 comments
Labels
performance Performance or resource usage topic-IO

Comments

@Laurens
Copy link
Mannequin

Laurens mannequin commented Feb 4, 2011

BPO 11114
Nosy @amauryfa, @pitrou, @ericvsmith
Files
  • tell v01.py: example program
  • textiotell.patch
  • textiotell2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-02-25.20:29:37.152>
    created_at = <Date 2011-02-04.08:52:36.587>
    labels = ['expert-IO', 'performance']
    title = 'TextIOWrapper.tell extremely slow'
    updated_at = <Date 2011-02-25.20:29:37.151>
    user = 'https://bugs.python.org/Laurens'

    bugs.python.org fields:

    activity = <Date 2011-02-25.20:29:37.151>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-02-25.20:29:37.152>
    closer = 'pitrou'
    components = ['IO']
    creation = <Date 2011-02-04.08:52:36.587>
    creator = 'Laurens'
    dependencies = []
    files = ['20674', '20676', '20680']
    hgrepos = []
    issue_num = 11114
    keywords = ['patch']
    message_count = 15.0
    messages = ['127874', '127877', '127881', '127889', '127892', '127893', '127894', '127901', '127907', '127908', '127914', '127915', '127933', '127934', '129418']
    nosy_count = 5.0
    nosy_names = ['amaury.forgeotdarc', 'pitrou', 'dsm001', 'eric.smith', 'Laurens']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue11114'
    versions = ['Python 3.3']

    @Laurens
    Copy link
    Mannequin Author

    Laurens mannequin commented Feb 4, 2011

    file.tell() has become extremely slow in version 3.2, both rc1 and rc2. This problem did not exist in version 2.7.1, nor in version 3.1. It could be reproduced both on mac and windows xp.

    @Laurens Laurens mannequin added the topic-IO label Feb 4, 2011
    @ericvsmith
    Copy link
    Member

    Do you have a benchmark program you can post?

    @Laurens
    Copy link
    Mannequin Author

    Laurens mannequin commented Feb 4, 2011

    Correction: the problem also exists in version 3.1. I created a benchmark program an ran it on my machine (iMac, snow leopard 10.6), with the following results:

    ------------------------------------------
    2.6.6 (r266:84292, Dec 30 2010, 09:20:14)
    [GCC 4.2.1 (Apple Inc. build 5664)]
    result: 0.0009 s.
    ------------------------------------------
    2.7.1 (r271:86832, Jan 13 2011, 07:38:03)
    [GCC 4.2.1 (Apple Inc. build 5664)]
    result: 0.0008 s.
    ------------------------------------------
    3.1.3 (r313:86882M, Nov 30 2010, 09:55:56)
    [GCC 4.0.1 (Apple Inc. build 5494)]
    result: 9.5682 s.
    ------------------------------------------
    3.2rc2 (r32rc2:88269, Jan 30 2011, 14:30:28)
    [GCC 4.2.1 (Apple Inc. build 5664)]
    result: 8.3531 s.

    Removing the line containing "tell" gives the following results:

    ------------------------------------------
    2.6.6 (r266:84292, Dec 30 2010, 09:20:14)
    [GCC 4.2.1 (Apple Inc. build 5664)]
    result: 0.0007 s.
    ------------------------------------------
    2.7.1 (r271:86832, Jan 13 2011, 07:38:03)
    [GCC 4.2.1 (Apple Inc. build 5664)]
    result: 0.0006 s.
    ------------------------------------------
    3.1.3 (r313:86882M, Nov 30 2010, 09:55:56)
    [GCC 4.0.1 (Apple Inc. build 5494)]
    result: 0.0093 s.
    ------------------------------------------
    3.2rc2 (r32rc2:88269, Jan 30 2011, 14:30:28)
    [GCC 4.2.1 (Apple Inc. build 5664)]
    result: 0.0007 s.

    (Apparently, reading a file became a lot faster from 3.1 to 3.2.)

    Conclusion: Execution of file.tell() makes the program about 10000 times slower.

    Remark: the file mdutch.txt is a dummy text file containing 1000 lines with one word on each line.

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 4, 2011

    I found that adding "infile._CHUNK_SIZE = 20" makes the test much faster - 'only' 5 times slower than 2.7.

    @dsm001
    Copy link
    Mannequin

    dsm001 mannequin commented Feb 4, 2011

    With a similar setup (OS X 10.6) I see the same problem. It seems to go away if the file is opened in binary mode for reading. @laurens, can you confirm?

    @dsm001
    Copy link
    Mannequin

    dsm001 mannequin commented Feb 4, 2011

    (By "go away" I mean "stop being pathological", not "stop differing": I still see a factor of 2.)

    @pitrou
    Copy link
    Member

    pitrou commented Feb 4, 2011

    That's expected. seek() and tell() on text (unicode) files are slow by construction. You should open your file in binary mode instead, if you want to do any seeking.

    Maybe I should add a note in http://docs.python.org/dev/library/io.html#performance

    @pitrou
    Copy link
    Member

    pitrou commented Feb 4, 2011

    That said, I think it is possible to make algorithmic improvements to TextIOWrapper.tell() so that at least performance becomes acceptable.

    @pitrou pitrou changed the title file.tell extremely slow TextIOWrapper.tell extremely slow Feb 4, 2011
    @pitrou pitrou added the performance Performance or resource usage label Feb 4, 2011
    @Laurens
    Copy link
    Mannequin Author

    Laurens mannequin commented Feb 4, 2011

    First of all, thanks to all for your cooperation, it is very much appreciated.

    I made some minor changes to the benchmark program. Conclusions are:

    • setting file._CHUNK_SIZE to 20 has a dramatic effect, changing execution time in 3.2rc2 from 8.4s to 0.06s, so more than a factor 100 improvement. It's a bit clumsy, but acceptable as a performance work around.

    • opening file binary has a dramatic effect as well, I would say. After 2 minutes I stopped execution of the program, concluding that this change made the program at least a factor 10 *slower* instead of faster. So I cannot confirm DSM's statement that the performance hit would be a factor 2. Instead, I see a performance hit of at least a factor 100000 (10e5) compared tot 2.7.1, which is presumably not by construction ;-).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 4, 2011

    opening file binary has a dramatic effect as well, I would say. After 2
    minutes I stopped execution of the program

    Hint: b'' is not equal to '' ;)

    @pitrou
    Copy link
    Member

    pitrou commented Feb 4, 2011

    Here is a proof-of-concept patch for the pure Python version of TextIOWrapper.tell(). It turns the O(CHUNK_SIZE) operation into an O(1) operation most of time (still O(CHUNK_SIZE) worst-case - weird decoders and/or crazy input).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 4, 2011

    Here is a proof-of-concept patch for the pure Python version of
    TextIOWrapper.tell(). It turns the O(CHUNK_SIZE) operation into an
    O(1) operation most of time (still O(CHUNK_SIZE) worst-case - weird
    decoders and/or crazy input).

    Actually, that's wrong. The patch is still O(CHUNK_SIZE) but with a
    vastly different multiplier (and, optimistically, much smaller, as
    common codecs are codecs in C).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 4, 2011

    New patch also optimizing the C version. tell() can be more than 100x faster now (still much slower than binary tell()).

    @Laurens
    Copy link
    Mannequin Author

    Laurens mannequin commented Feb 4, 2011

    All,

    thanks for your help. Opening the file in binary mode worked immediately in the toy program (that is, the benchmark code I sent you). (Antoine, thanks for the hint.) In my real world program, I solved the problem by reading a line from a binary input file, and decode it explicitly and immediately to a string. The performance has become acceptable now, and I propose to close this issue.

    Thanks again, cheers,

    Laurens

    @pitrou
    Copy link
    Member

    pitrou commented Feb 25, 2011

    Committed in r88607 (3.3).

    @pitrou pitrou closed this as completed Feb 25, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage topic-IO
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants