performance: can we update io.DEFAULT_BUFFER_SIZE to make python IO 3 times faster? :) #117151

morotti · 2024-03-22T11:41:23Z

Bug report

Bug description:

Hello,

I was doing some benchmarking of python and package installation.
That got me down a rabbit hole of buffering optimizations between between pip, requests, urllib and the cpython interpreter.

TL;DR I would like to discuss updating the value of io.DEFAULT_BUFFER_SIZE. It was set to 8192 since 16 years ago.
original commit: https://github.com/python/cpython/blame/main/Lib/_pyio.py#L27

It was a reasonable size given hardware and OS at the time. It's far from optimal today.
Remember, in 2008 you'd run a 32 bits operating system with less than 2 GB memory available and to share between all running applications.
Buffers had to be small, few kB, it wasn't conceivable to have buffer measured in entire MB.

I will attach benchmarks in the next messages showing 3 to 5 times write performance improvement when adjusting the buffer size.

I think the python interpreter can adopt a buffer size somewhere between 64k to 256k by default.
I think 64k is the minimum for python and it should be safe to adjust to.
Higher is better for performance in most cases, though there may be some cases where it's unwanted
(seek and small read/writes, unwanted trigger of write ahead, slow devices with throughput in measured in kB/s where you don't want to block for long)

In addition, I think there is a bug in open() on Linux.
open() sets the buffer size to the device block size on Linux when available (st_blksize, 4k on most disks), instead of io.DEFAULT_BUFFER_SIZE=8k.
I believe this is unwanted behavior, the block size is the minimal size for IO operations on the IO device, it's not the optimal size and it should not be preferred.
I think open() on Linux should be corrected to use a default buffer size of max(st_blksize, io.DEFAULT_BUFFER_SIZE) instead of st_blksize?

Related, the doc might be misleading for saying st_blksize is the preferred size for efficient I/O. https://github.com/python/cpython/blob/main/Doc/library/os.rst#L3181
The GNU doc was updated to clarify: "This is not guaranteed to give optimum performance" https://www.gnu.org/software/gnulib/manual/html_node/stat_002dsize.html

Thoughts?

Annex: some historical context and technical considerations around buffering.

On the hardware side:

HDD had 512 bytes blocks historically, then HDD moved to 4096 bytes blocks in the 2010s.
SSD have 4096 bytes blocks as far as I know.

On filesystems:

buffer size should never be smaller than device and filesystem blocksize
I think ext3, ext4, xfs, ntfs, etc... follow the device block size of 4k as default, though they can be configured for any block size.
NTFS is capped to 16TB maximum disk size with 4k blocks.
microsoft recommends 64k block size for windows server 2019+ and larger disks https://learn.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview
RAID setups and assimilated with zfs/btrfs/xfs can have custom block size, I think anywhere 4kB-1MB. I don't know if there is any consensus, I think anything 16k-32k-64k-128k can be seen in the wild.

On network filesystems:

shared network home directories are common on linux (NFS share) and windows (SMB share).
entreprise storage vendors like Pure/Vast/NetApp recommend 524488 or 1048576 bytes for IO.
see rsize wsize in mount settings:
host:path on path type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,acregmin=60,acdirmin=60,hard,proto=tcp,nconnect=8,mountproto=tcp, ...)
for windows I cannot find documentation for network clients, though the windows server should have the NTFS filesystem with at least 64k block size as per microsoft recommendation above.

On pipes:

buffering is used by pipes and for interprocess communications. see subprocess.py
posix guarantees that writes to pipes are atomic up to PIPE_BUF, 4096 bytes on Linux kernel, guaranteed to be at least 512 bytes by posix.
Python had a default of io.DEFAULT_BUFFER_SIZE=8192 so it never benefitted from that atomic property :D

on compression code, they probably all need to be adjusted:

the buffer size is used by compression code in cpython: gzip.py lzma.py bz2.py
I think lzma and bz2 are using the default size.
gzip is using a 128kb read buffer, somebody realized it was very slow 2 years ago and rewrote the buffering to 128k.
then somebody else realized last year it was still very slow to write and added an arbitrary write buffer 4*io.DEFAULT_BUFFR_SIZE.
eae7dad
GzipFile.write should be buffered #89550
base64 is reading in chunks of 76 characters???
https://github.com/python/cpython/blob/main/Lib/base64.py#L532

On network IO:

On Linux, TCP read and write buffers were a minimum of 16k historically. The read buffer was increased to 64k in kernel v4.20, year 2018
the buffer is resized dynamically with the TCP window upto 4MB write 6M read, let's not get into TCP. see sysctl_tcp_rmem sysctl_tcp_wmem
linux code: https://github.com/torvalds/linux/blame/master/net/ipv4/tcp.c#L4775
commit Sep 2018: torvalds/linux@a337531
I think socket buffers are managed separately by the kernel, the io.DEFAULT_BUFFER_SIZE matters when you read a file and write to network, or read from network and write to file.

on HTTP, a large subset of networking:

HTTP is large file transfer and would benefit from a much larger buffer, but that's probably more of a concern for urllib/requests.
requests.content is 10k chunk by default.
requests iter_lines(chunk_size=512, decode_unicode=False, delimiter=None) is 512 chunk by default.
requests iter_content(chunk_size=1, decode_unicode=False) is 1 byte by default
source: set in 2012 https://github.com/psf/requests/blame/8dd3b26bf59808de24fd654699f592abf6de581e/src/requests/models.py#L80

note to self: remember to publish code and result in next message

CPython versions tested on:

3.11

Operating systems tested on:

Other

Linked PRs

The text was updated successfully, but these errors were encountered:

morotti · 2024-03-22T11:45:31Z

some benchmarking code I used to debug download and write performance.

import io
import os
import platform
import requests
import sys
import time


def download_file(run, url, filepath, chunksize, buffersize):
    if os.path.exists(filepath):
        os.remove(filepath)
    calls = 0
    start = time.perf_counter()
    write_duration = 0.0
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(filepath, 'wb', buffering=buffersize) as f:
            st_blksize =  os.stat(filepath).st_blksize
            for chunk in r.iter_content(chunk_size=chunksize):
                calls = calls + 1
                t1 = time.perf_counter()
                f.write(chunk)
                t2 = time.perf_counter()
                write_duration = write_duration + (t2 - t1)
    end = time.perf_counter()
    function_duration = end - start
    print(
        "run={} filepath={} total_duration={} download_chunksize={} write_duration={} write_buffersize={} calls={} st_blksize={}".format(
            run, filepath, function_duration, chunksize, write_duration, buffersize, calls, st_blksize
        ))


def main():
    print("python {} running on {}".format(sys.version, platform.platform()))
    NUMPY_WHEEL = "https://example.com/numpy/1.21.6/numpy-1.21.6-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl"
    for run in range(0, 10):
        for download_directory in [os.path.abspath(".")]:
            for url in [NUMPY_WHEEL]:
                download_path = os.path.join(download_directory, url.rsplit("/", 1)[1])
                for download_chunksize in [512, 1024, 2048, 4096, 8192,
                                           10000, 16384, 32768, 65536, 131072, 262144, 524488,
                                           1048576, 2097152, 4194304, 8388608, 16777216]:
                    for file_buffersize in [0, 4096, 8192, 65536, 262144, 1048576]:
                        download_file(run, url, download_path, download_chunksize, file_buffersize)


if __name__ == "__main__":
    main()

morotti · 2024-03-22T11:49:00Z

benchmark results, running on python 3.11

various OS and storage.

Fidget-Spinner · 2024-03-22T15:25:40Z

I think your argument makes sense, consumer RAM sizes have more than quadrupled in the past 16 years IIRC, so it shouldn't hurt to increase buffer sizes.

I cannot champion this though, because I am currently wrapped up in too many things. Sorry.

masklinn · 2024-03-22T19:52:00Z

SSD have 4096 bytes blocks as far as I know.

AFAIK SSDs have 4 to 8k pages. An SSD block contains up to 256 pages. The NVMe capabilities of the drive are also a factor, as an NVM command can generally transfer a multiple of the page size.

… are equal to the buffer size. avoid extra memory copy. BufferedWriter() was buffering calls that are the exact same size as the buffer. it's a very common case to read/write in blocks of the exact buffer size. it's pointless to copy a full buffer, it's costing extra memory copy and the full buffer will have to be written in the next call anyway.

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE) performance:

eendebakpt · 2024-04-19T13:41:27Z

I can confirm this improves performance. @morotti Could you open a PR?

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE) performance:

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)

…he buffer size (GH-118037) BufferedWriter() was buffering calls that are the exact same size as the buffer. it's a very common case to read/write in blocks of the exact buffer size. it's pointless to copy a full buffer, it's costing extra memory copy and the full buffer will have to be written in the next call anyway. Co-authored-by: rmorotti <romain.morotti@man.com>

morotti · 2024-04-29T13:32:26Z

@eendebakpt I opened a PR, can you review?

#118144

eendebakpt · 2024-04-29T21:49:36Z

@eendebakpt I opened a PR, can you review?

#118144

Yes, i'll have a look in a couple of days.

… to 256k. it was set to 16k in the 1990s. it was raised to 64k in 2019. the discussion at the time mentioned another 5% improvement by raising to 128k and settled for a very conservative setting. it's 2024 now, I think it should be revisited to match modern hardware. I am measuring 0-15% performance improvement when raising to 256k on various types of disk. there is no downside as far as I can tell. this function is only intended for sequential copy of full files (or file like objects). it's the typical use case that benefits from larger operations. for reference, I came across this function while trying to profile pip that is using it to copy files when installing python packages.

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)

serhiy-storchaka · 2024-04-30T15:21:42Z

In what cases st_blksize is larger than 128 KiB?

morotti · 2024-04-30T18:08:06Z

st_blksize is the block size reported by the device.

I see it larger than 128kB on NFS network filesystems, like in the benchmark I submitted above.
The value matches the rsize set in the NFS mount settings.
It is set to 524488 or 1048576 for the two enterprise storage vendors I have hardware from, as per their recommended settings, which are optimal settings for their respective hardware. (apologies, I'm not sure I have permissions to name brands and benchmarks ^^).

It can be seen on any filesystem where a larger block was set. It's a free setting when the filesystem is created. I think most filesystems XFS/ZFS/EXT4 allow to set any block size from 4k to 1M or so. I think more than 128k can be seen for some RAID setups with enough large disks.

Microsoft recommends 64k block size for windows server 2019+, 4k block size is limited to 16 TB volumes, 64k block size is limited to 256 TB volume. The block size can be set up to 2M.
It should be possible to see it on Linux, if mounting a volume remotely and the mount can be configured to expose the block size from the server or set to the same size.
https://learn.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview

I think it should be visible as well for s3 filesystems, but I don't have one to test anymore.
There is a thing to mount s3 buckets directly as a filesystem on Linux. It should hint huge blocks because the HTTP overhead is huge.

Basically, anything involving large disks, storage appliances, network and specialized filesystems.

morotti added the type-bug An unexpected behavior, bug, or error label Mar 22, 2024

hugovk added performance Performance or resource usage stdlib Python modules in the Lib dir labels Mar 22, 2024

Eclips4 added topic-IO and removed type-bug An unexpected behavior, bug, or error labels Mar 22, 2024

morotti pushed a commit to man-group/cpython that referenced this issue Apr 2, 2024

pythongh-117151: IO performance improvement, increase io.DEFAULT_BUFF…

e9b2714

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE) performance:

morotti mentioned this issue Apr 18, 2024

gh-117151: optimize BufferedWriter(), do not buffer writes that are the buffer size #118037

Merged

morotti mentioned this issue Apr 22, 2024

gh-117151: IO performance improvement, increase io.DEFAULT_BUFFER_SIZE to 128k #118144

Open

morotti pushed a commit to man-group/cpython that referenced this issue Apr 22, 2024

pythongh-117151: IO performance improvement, increase io.DEFAULT_BUFF…

33cd532

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE) performance:

morotti pushed a commit to man-group/cpython that referenced this issue Apr 22, 2024

pythongh-117151: IO performance improvement, increase io.DEFAULT_BUFF…

8726798

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)

morotti pushed a commit to man-group/cpython that referenced this issue Apr 30, 2024

pythongh-117151: IO performance improvement, increase io.DEFAULT_BUFF…

3668d1a

…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance: can we update io.DEFAULT_BUFFER_SIZE to make python IO 3 times faster? :) #117151

performance: can we update io.DEFAULT_BUFFER_SIZE to make python IO 3 times faster? :) #117151

morotti commented Mar 22, 2024 •

edited by bedevere-app bot

morotti commented Mar 22, 2024 •

edited by hugovk

morotti commented Mar 22, 2024

Fidget-Spinner commented Mar 22, 2024 •

edited

masklinn commented Mar 22, 2024 •

edited

eendebakpt commented Apr 19, 2024

morotti commented Apr 29, 2024

eendebakpt commented Apr 29, 2024

serhiy-storchaka commented Apr 30, 2024

morotti commented Apr 30, 2024

performance: can we update io.DEFAULT_BUFFER_SIZE to make python IO 3 times faster? :) #117151

performance: can we update io.DEFAULT_BUFFER_SIZE to make python IO 3 times faster? :) #117151

Comments

morotti commented Mar 22, 2024 • edited by bedevere-app bot

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

morotti commented Mar 22, 2024 • edited by hugovk

morotti commented Mar 22, 2024

Fidget-Spinner commented Mar 22, 2024 • edited

masklinn commented Mar 22, 2024 • edited

eendebakpt commented Apr 19, 2024

morotti commented Apr 29, 2024

eendebakpt commented Apr 29, 2024

serhiy-storchaka commented Apr 30, 2024

morotti commented Apr 30, 2024

morotti commented Mar 22, 2024 •

edited by bedevere-app bot

morotti commented Mar 22, 2024 •

edited by hugovk

Fidget-Spinner commented Mar 22, 2024 •

edited

masklinn commented Mar 22, 2024 •

edited