Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't decode files while computing their hash #4957

Merged
merged 2 commits into from Apr 24, 2018

Conversation

Projects
None yet
2 participants
@msullivan
Copy link
Collaborator

msullivan commented Apr 24, 2018

Instead, cache undecoded bytes and decode in client code. Happily,
this allows us to get rid of FileSystemCache's dependency on the
python version.

This is a big performance win for daemon cold runs on codebases that
make use of custom encodings where decoding is expensive.

This gives about a 5s speed up on cold runs on S when the cache mtimes don't match.

@msullivan msullivan requested review from JukkaL and gvanrossum Apr 24, 2018

Don't decode files while computing their hash
Instead, cache undecoded bytes and decode in client code.  Happily,
this allows us to get rid of FileSystemCache's dependency on the
python version.

This is a big performance win for daemon cold runs on codebases that
make use of custom encodings where decoding is expensive.

@msullivan msullivan force-pushed the no-decode branch from 4e3f61b to f27139f Apr 24, 2018

source = source[3:]
else:
_encoding, _ = find_python_encoding(source_start, pyversion)
# check that the coding isn't mypy. We skip it since

This comment has been minimized.

@gvanrossum

gvanrossum Apr 24, 2018

Member

I'm pretty sure we got rid of the mypy codec in 2016.

@@ -3,6 +3,7 @@
import re
import subprocess
import hashlib
from io import BytesIO

This comment has been minimized.

@gvanrossum

gvanrossum Apr 24, 2018

Member

I'd prefer the style import io here (in general for stdlib modules, though I'm fine with escape below).

@@ -121,20 +121,20 @@ def exists(self, path: str) -> bool:
return True


# TODO: Merge FileSystemMetaCache back into this?

This comment has been minimized.

@gvanrossum

gvanrossum Apr 24, 2018

Member

Would it be much work to do that now? Why keep them separate?

This comment has been minimized.

@msullivan

msullivan Apr 24, 2018

Author Collaborator

Not much work, but I didn't want to clutter up this PR. I'll merge them in another PR after this?

return source_text, m.hexdigest()
# look at first two lines and check if PEP-263 coding is present
f = BytesIO(source)
source_start = f.readline() + f.readline()

This comment has been minimized.

@gvanrossum

gvanrossum Apr 24, 2018

Member

Do you really need to capture the first two lines to pass to find_python_encoding()? It would seem that function works just as well if you just pass it source.

This comment has been minimized.

@msullivan

msullivan Apr 24, 2018

Author Collaborator

Looks that way, yeah

# check for BOM UTF-8 encoding and strip it out if present
if source_start.startswith(b'\xef\xbb\xbf'):
encoding = 'utf8'
source = source[3:]

This comment has been minimized.

@gvanrossum

gvanrossum Apr 24, 2018

Member

(For a large file this copies (nearly) the entire file an extra time. But I don't know if anyone cares.)

This comment has been minimized.

@msullivan

msullivan Apr 24, 2018

Author Collaborator

Nobody caring was my theory, yeah

This comment has been minimized.

@msullivan

msullivan Apr 24, 2018

Author Collaborator

What would the right way to fix that be, though? Is there a way to decode a substring of a bytes?

@msullivan msullivan merged commit 9e564e1 into master Apr 24, 2018

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@msullivan msullivan deleted the no-decode branch Apr 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.