Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking to EOF is too inefficient! #65837

Closed
yanlinlin82 mannequin opened this issue Jun 2, 2014 · 6 comments
Closed

Seeking to EOF is too inefficient! #65837

yanlinlin82 mannequin opened this issue Jun 2, 2014 · 6 comments
Labels
performance Performance or resource usage topic-IO

Comments

@yanlinlin82
Copy link
Mannequin

yanlinlin82 mannequin commented Jun 2, 2014

BPO 21638
Nosy @pitrou, @vstinner

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-06-03.18:59:26.116>
created_at = <Date 2014-06-02.13:39:16.725>
labels = ['expert-IO', 'performance']
title = 'Seeking to EOF is too inefficient!'
updated_at = <Date 2014-06-03.18:59:26.114>
user = 'https://bugs.python.org/yanlinlin82'

bugs.python.org fields:

activity = <Date 2014-06-03.18:59:26.114>
actor = 'yanlinlin82'
assignee = 'none'
closed = True
closed_date = <Date 2014-06-03.18:59:26.116>
closer = 'yanlinlin82'
components = ['IO']
creation = <Date 2014-06-02.13:39:16.725>
creator = 'yanlinlin82'
dependencies = []
files = []
hgrepos = []
issue_num = 21638
keywords = []
message_count = 6.0
messages = ['219586', '219637', '219654', '219709', '219715', '219716']
nosy_count = 4.0
nosy_names = ['pitrou', 'vstinner', 'neologix', 'yanlinlin82']
pr_nums = []
priority = 'normal'
resolution = 'third party'
stage = None
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue21638'
versions = ['Python 2.7']

@yanlinlin82
Copy link
Mannequin Author

yanlinlin82 mannequin commented Jun 2, 2014

I noticed this problem when I run a Python2 program (MACS: http://liulab.dfci.harvard.edu/MACS/) very inefficiently on a large storage on a high performace server (64-bit Linux). It was much slower (more than two days) than running it on a normal PC (less than two hours).

After ruling out many optimizing conditions, I finally located the problem on the seek() function of Python2. Now I can reproduce the problem in a very simple example:

#!/usr/bin/python2
f = open("Input.sort.bam", "rb")
f.seek(0, 2)
f.close()

Here, the size of file 'Input.sort.bam' is 4,110,535,920 bytes. When I run the program with 'strace' to see the system calls on Linux:

$ strace python2 foo.py
...
open("Input.sort.bam", O_RDONLY)        = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=4110535920, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=4110535920, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f23d4492000
fstat(3, {st_mode=S_IFREG|0644, st_size=4110535920, ...}) = 0
lseek(3, 4110532608, SEEK_SET)          = 4110532608
read(3, "f\203\337<\334\350\313\315\345&T\227\211\fC\212a\260\204P\235\366\326\353\230\327>\373\361\221\357\373"..., 3312) = 3312
close(3)                                = 0
...

It seems that python2 just move file cursor to a specific position (4110532608 in this case) and read ahead the rest bytes, rather than seek to the file end directly. I tried to run the exact the same program on the large storage, the position changed to 1073741824, left 889310448 bytes to read to reach the file end, which reduced the performance a lot!

@yanlinlin82 yanlinlin82 mannequin added topic-IO performance Performance or resource usage labels Jun 2, 2014
@vstinner
Copy link
Member

vstinner commented Jun 2, 2014

I don't think that Python calls directly read(). Python 2 uses fopen / fread.

Python 3 doesn't use buffered files, but call open / read directly.

@yanlinlin82
Copy link
Mannequin Author

yanlinlin82 mannequin commented Jun 3, 2014

I agree that Python 2 should use fopen / fread rather than directly read().
But you may misunderstand this. The 'strace' tool reports Linux system
calls, including read() rather than fread(), and I guess that read() should
be finally called in fread() implementation.

What I mean is that Python 2's seek(0, 2) does not use fseek(0, SEEK_END),
but fseek(somewhere, SEEK_SET) and fread(rest-bytes) instead, which is too
inefficient in some kind of storage.

By the way, Python 3 does not behavior like this.

On Tue, Jun 3, 2014 at 4:35 AM, STINNER Victor <report@bugs.python.org>
wrote:

STINNER Victor added the comment:

I don't think that Python calls directly read(). Python 2 uses fopen /
fread.

Python 3 doesn't use buffered files, but call open / read directly.

----------
nosy: +haypo, neologix


Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue21638\>


@neologix
Copy link
Mannequin

neologix mannequin commented Jun 3, 2014

I agree that Python 2 should use fopen / fread rather than directly read().
But you may misunderstand this. The 'strace' tool reports Linux system
calls, including read() rather than fread(), and I guess that read() should
be finally called in fread() implementation.

What I mean is that Python 2's seek(0, 2) does not use fseek(0, SEEK_END),
but fseek(somewhere, SEEK_SET) and fread(rest-bytes) instead, which is too
inefficient in some kind of storage.

Actually, Python does use fopen(), and fseek(): the culprit is the libc:
$ cat /tmp/test.c; gcc -o /tmp/test /tmp/test.c -Wall; strace /tmp/test
open("/etc/fstab", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=809, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77ae000
fstat64(3, {st_mode=S_IFREG|0644, st_size=809, ...}) = 0
_llseek(3, 0, [0], SEEK_SET) = 0
read(3, "# /etc/fstab: static file system"..., 809) = 809
close(3) = 0

By the way, Python 3 does not behavior like this.

That's because in Python 3, the IO stack is implemented directly on top of open()/read()/lseek().

It's not the first time we stumble upon glibc stdio bugs.

I'd suggest closing this.

@yanlinlin82
Copy link
Mannequin Author

yanlinlin82 mannequin commented Jun 3, 2014

Thanks! I agree with that.

@yanlinlin82
Copy link
Mannequin Author

yanlinlin82 mannequin commented Jun 3, 2014

I ensured that the problem is in libc. I will try to figure out it by updating libc or optimizing some related parameters.

@yanlinlin82 yanlinlin82 mannequin closed this as completed Jun 3, 2014
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-IO
Projects
None yet
Development

No branches or pull requests

1 participant