Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-34043: Optimize tarfile uncompress performance #8089

Merged
merged 3 commits into from Jul 6, 2018

Conversation

Projects
None yet
5 participants
@methane
Copy link
Member

methane commented Jul 4, 2018

tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.

This commit bypass compressed buffering.

In this benchmark 1, user time become 250ms from 300ms.

https://bugs.python.org/issue34043

bpo-34043: Optimize tarfile uncompress performance
tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.

This commit bypass compressed buffering.

In this benchmark [1], user time become 250ms from 300ms.

[1]: https://bugs.python.org/msg320763
@vstinner
Copy link
Member

vstinner left a comment

LGTM, just a minor nitpick.

# Skip underlaying buffer to avoid unaligned double
# buffering.
if self.buf:
buf, self.buf = self.buf, b""

This comment has been minimized.

Copy link
@vstinner

vstinner Jul 4, 2018

Member

nitpick: I would prefer to do that on two lines, but it's just a matter of taste.

@serhiy-storchaka serhiy-storchaka self-requested a review Jul 4, 2018

@vstinner
Copy link
Member

vstinner left a comment

LGTM. I just saw a typo in a comment.

buf = self.__read(self.bufsize)
if not buf:
break
# Skip underlaying buffer to avoid unaligned double

This comment has been minimized.

Copy link
@vstinner

vstinner Jul 5, 2018

Member

typo? underlying?

@methane methane merged commit 8d13091 into python:master Jul 6, 2018

9 checks passed

VSTS: Linux-PR Linux-PR_20180705.52 succeeded
Details
VSTS: Linux-PR-Coverage Linux-PR-Coverage_20180705.57 succeeded
Details
VSTS: Windows-PR Windows-PR_20180705.51 succeeded
Details
VSTS: docs docs_20180705.68 succeeded
Details
VSTS: macOS-PR macOS-PR_20180705.52 succeeded
Details
bedevere/issue-number Issue number 34043 found
Details
bedevere/news News entry found in Misc/NEWS.d
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@methane methane deleted the methane:tarfile branch Jul 6, 2018

If size is not defined, return all bytes of the stream
up to EOF.
"""
if size is None:

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jul 6, 2018

Member

I have only one question. Why this branch was removed?

This comment has been minimized.

Copy link
@methane

methane Jul 6, 2018

Author Member

This issue is follow up of bpo-34010, (GH-8020).
See also, https://bugs.python.org/issue34010#msg321040

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jul 6, 2018

Member

Thank you. Instances of _Stream are never leaked to the end user? Then this change LGTM.

This comment has been minimized.

Copy link
@methane

methane Jul 6, 2018

Author Member

It can be accessed via TarFile.fileobj.
But using it directly is not pragramatic, and TarFile.fileobj.read() caused TypeError because of "".join() bug.
So it must not used for previous versions.

This comment has been minimized.

Copy link
@vstinner

vstinner Jul 6, 2018

Member

I saw that the class is private, but I didn't notice that TarFile.fileobj is public. Maybe just replace the assertion with a regular if/raise, just in case? Maybe raise a NotImplementedError?

lisroach added a commit to lisroach/cpython that referenced this pull request Sep 10, 2018

bpo-34043: Optimize tarfile uncompress performance (pythonGH-8089)
tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.

This commit bypass compressed buffering.

In this benchmark [1], user time become 250ms from 300ms.

[1]: https://bugs.python.org/msg320763

dacut added a commit to dacut/cpython that referenced this pull request Sep 16, 2018

bpo-34043: Optimize tarfile uncompress performance (pythonGH-8089)
tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.

This commit bypass compressed buffering.

In this benchmark [1], user time become 250ms from 300ms.

[1]: https://bugs.python.org/msg320763

yahya-abou-imran added a commit to yahya-abou-imran/cpython that referenced this pull request Nov 2, 2018

bpo-34043: Optimize tarfile uncompress performance (pythonGH-8089)
tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.

This commit bypass compressed buffering.

In this benchmark [1], user time become 250ms from 300ms.

[1]: https://bugs.python.org/msg320763
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.