Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GzipFile.write should be buffered #89550

Closed
rhpvorderman mannequin opened this issue Oct 6, 2021 · 2 comments
Closed

GzipFile.write should be buffered #89550

rhpvorderman mannequin opened this issue Oct 6, 2021 · 2 comments
Labels
3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhpvorderman
Copy link
Mannequin

rhpvorderman mannequin commented Oct 6, 2021

BPO 45387
Nosy @marcelm, @rhpvorderman

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-10-06.08:42:34.497>
labels = ['3.8', '3.9', '3.10', 'performance', '3.7', 'library', '3.11']
title = 'GzipFile.write should be buffered'
updated_at = <Date 2021-10-06.08:59:49.938>
user = 'https://github.com/rhpvorderman'

bugs.python.org fields:

activity = <Date 2021-10-06.08:59:49.938>
actor = 'marcelm'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2021-10-06.08:42:34.497>
creator = 'rhpvorderman'
dependencies = []
files = []
hgrepos = []
issue_num = 45387
keywords = []
message_count = 1.0
messages = ['403289']
nosy_count = 2.0
nosy_names = ['marcelm', 'rhpvorderman']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue45387'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']

Linked PRs

@rhpvorderman
Copy link
Mannequin Author

rhpvorderman mannequin commented Oct 6, 2021

Please consider the following code snippet:

    import gzip
    import sys

    with gzip.open(sys.argv[1], "rt") as in_file_h:
        with gzip.open(sys.argv[2], "wt", compresslevel=1) as out_file_h:
            for line in in_file_h:
                # Do processing on line here
                modified_line = line
                # End processing
                out_file_h.write(modified_line)

This is very slow, due to write being called for every line. This is the current implementation of write:

def write(self,data):

It:

  • Checks if the file is not closed
  • Checks if the correct mode is set
  • Checks if the file is not closed (again, but in a different way)
  • Checks if the data is bytes, bytearray or something that supports the buffer protocol
  • Gets the length
  • Compresses the data
  • updates the size and offset
  • updates the checksum

Doing this for every line written is very costly and creates a lot of overhead in Python calls. We spent a lot of time in Python and a lot less in the fast C zlib code that does the actual compression.

This problem is already solved on the read side. A _GzipReader object is used for reading. This is put in an io.BufferedReader which is used as the underlying buffer for GzipFile.read. This way, lines are read quite fast from a GzipFile without the checksum etc. being updated on every line read.

A similar solution should be written for write.
I volunteer (I have done some other work on gzip.py already), although I cannot give an ETA at this time.

@rhpvorderman rhpvorderman mannequin added stdlib Python modules in the Lib dir 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes performance Performance or resource usage labels Oct 6, 2021
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@iritkatriel iritkatriel added type-feature A feature request or enhancement 3.12 bugs and security fixes and removed 3.11 only security fixes 3.10 only security fixes 3.9 only security fixes 3.8 only security fixes 3.7 (EOL) end of life labels Sep 7, 2022
gpshead added a commit that referenced this issue May 8, 2023
…1251)

Use `io.BufferedWriter` to buffer gzip writes.

---------

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
jbower-fb pushed a commit to jbower-fb/cpython-jbowerfb that referenced this issue May 8, 2023
python#101251)

Use `io.BufferedWriter` to buffer gzip writes.

---------

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (47 commits)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172)
  pythongh-103193: Improve `getattr_static` test coverage (python#104286)
  Trim trailing whitespace and test on CI (python#104275)
  pythongh-102500: Remove mention of bytes shorthand (python#104281)
  pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256)
  pythongh-99108: Replace SHA3 implementation HACL* version (python#103597)
  pythongh-104273: Remove redundant len() calls in argparse function (python#104274)
  pythongh-64660: Don't hardcode Argument Clinic return converter result variable name (python#104200)
  pythongh-104265 Disallow instantiation of `_csv.Reader` and `_csv.Writer` (python#104266)
  pythonGH-102613: Improve performance of `pathlib.Path.rglob()` (pythonGH-104244)
  pythongh-103650: Fix perf maps address format (python#103651)
  pythonGH-89812: Churn `pathlib.Path` methods (pythonGH-104243)
  ...
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (29 commits)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172)
  pythongh-103193: Improve `getattr_static` test coverage (python#104286)
  Trim trailing whitespace and test on CI (python#104275)
  pythongh-102500: Remove mention of bytes shorthand (python#104281)
  pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256)
  pythongh-99108: Replace SHA3 implementation HACL* version (python#103597)
  ...
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (156 commits)
  pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304)
  pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441)
  pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096)
  pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217)
  pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312)
  pythongh-104240: return code unit metadata from codegen (python#104300)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  ...
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (35 commits)
  pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304)
  pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441)
  pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096)
  pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217)
  pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312)
  pythongh-104240: return code unit metadata from codegen (python#104300)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  ...
@hugovk
Copy link
Member

hugovk commented Dec 11, 2023

A PR has been merged, thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Status: Done
Development

No branches or pull requests

2 participants