New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework cache to key on hash of file contents instead of mtime #3437

Merged
merged 7 commits into from Jun 13, 2017

Conversation

Projects
None yet
2 participants
@pkch
Contributor

pkch commented May 24, 2017

Attempt to fix #3403

At present, I'm only adding the ability to validate the cache using a hash even though the module file mtime doesn't match the meta mtime.

However, I'm NOT getting rid of the use of data_mtime (modified time of the cache file) for the purpose of finding dependencies that tell us that a given module needs to be reparsed. This is much more complex than I thought, and it may also be possible to work around that by using @JukkaL suggestion of putting the enitre cache into a tarball to preserve cache mtimes.

To verify that the approach works, I first create a failing test by changing the tests runner to touch all the source files before running mypy for incremental tests. Then in the second commit, I fix the broken tests.

How this whole thing should be tested is something that I'd like feedback on (running the tests twice is too time consuming I think?). As it stands, this PR only tests the new approach, and loses the tests for the old approach (with mtime/size). Maybe some kind of combination of the two would be good?

@pkch pkch changed the title from [WIP] to [WIP] Rework cache to key on hash of file contents instead of mtime May 24, 2017

@pkch pkch changed the title from [WIP] Rework cache to key on hash of file contents instead of mtime to Rework cache to key on hash of file contents instead of mtime May 25, 2017

@pkch

This comment has been minimized.

Show comment
Hide comment
@pkch

pkch May 25, 2017

Contributor

To summarize: this PR eliminates the reliance on mtime (modified time of source files) but leaves untouched the reliance on data_mtime (modified time of cache files).

Contributor

pkch commented May 25, 2017

To summarize: this PR eliminates the reliance on mtime (modified time of source files) but leaves untouched the reliance on data_mtime (modified time of cache files).

Show outdated Hide outdated mypy/build.py
@@ -833,6 +836,11 @@ def compute_hash(text: str) -> str:
return hashlib.md5(text.encode('utf-8')).hexdigest()
def compute_module_hash(path: str) -> str:
with open(path, 'r') as f:
return compute_hash(f.read())

This comment has been minimized.

@gvanrossum

gvanrossum May 25, 2017

Member

That's pretty inefficient -- we read the source code in text mode (using the default encoding) and then compute_hash() encodes it back to bytes (using utf-8) before hashing.

I also believe that we have the module source code already as an attribute in State in most cases so there's not even a need to read it from the file (it will be GC'ed by parse_file()).

Also the source should probably be read using read_with_python_encoding() which does a few tricks.

Finally we don't need this hash to be compatible with the interface hash (which is why compute_hash() exists currently).

@gvanrossum

gvanrossum May 25, 2017

Member

That's pretty inefficient -- we read the source code in text mode (using the default encoding) and then compute_hash() encodes it back to bytes (using utf-8) before hashing.

I also believe that we have the module source code already as an attribute in State in most cases so there's not even a need to read it from the file (it will be GC'ed by parse_file()).

Also the source should probably be read using read_with_python_encoding() which does a few tricks.

Finally we don't need this hash to be compatible with the interface hash (which is why compute_hash() exists currently).

This comment has been minimized.

@pkch

pkch May 25, 2017

Contributor

Then perhaps we should hash the binary, and not even bother with decoding to a string in case when we need to read the file from disk?

And yes, sometimes .source is already populated; but it's populated with a string. So I'm plannnig to split read_with_python_encoding into reading the binary, and then decoding, so that in between the two steps we can calculate and store the hash.

If that's too troublesome, I can instead standardize the hash to always use the string.

Any preferences?

@pkch

pkch May 25, 2017

Contributor

Then perhaps we should hash the binary, and not even bother with decoding to a string in case when we need to read the file from disk?

And yes, sometimes .source is already populated; but it's populated with a string. So I'm plannnig to split read_with_python_encoding into reading the binary, and then decoding, so that in between the two steps we can calculate and store the hash.

If that's too troublesome, I can instead standardize the hash to always use the string.

Any preferences?

@pkch

This comment has been minimized.

Show comment
Hide comment
@pkch

pkch May 25, 2017

Contributor

@gvanrossum I added the optimization along the lines you suggested. Now whenever the source code is parsed, we also calculate the source hash (based on the byte representation). I don't think there's any need to ever calculate the source hash separately from parsing: we only need the source hash inside write_cache, and by then we of course must have parsed the source.

Contributor

pkch commented May 25, 2017

@gvanrossum I added the optimization along the lines you suggested. Now whenever the source code is parsed, we also calculate the source hash (based on the byte representation). I don't think there's any need to ever calculate the source hash separately from parsing: we only need the source hash inside write_cache, and by then we of course must have parsed the source.

@pkch

This comment has been minimized.

Show comment
Hide comment
@pkch

pkch May 25, 2017

Contributor

Actually, we need to calculate source hash in two places: once when parsing the source file, and once when checking if the cache is in sync with the source. In the first case, we can do it without an extra file read; in the second, we do need to read the file again (assuming mtime/size didn't give us the positive answer).

The new commit implements this optimization.

Contributor

pkch commented May 25, 2017

Actually, we need to calculate source hash in two places: once when parsing the source file, and once when checking if the cache is in sync with the source. In the first case, we can do it without an extra file read; in the second, we do need to read the file again (assuming mtime/size didn't give us the positive answer).

The new commit implements this optimization.

@gvanrossum gvanrossum self-assigned this Jun 5, 2017

@gvanrossum

I really like this! But there are a few issues still... Let me know if you have time to work on those, else I will take over the development of this PR.

Show outdated Hide outdated mypy/build.py
@@ -1421,7 +1433,8 @@ def parse_file(self) -> None:
if self.path and source is None:
try:
path = manager.maybe_swap_for_shadow_path(self.path)
source = read_with_python_encoding(path, self.options.python_version)
source, self.source_hash = read_with_python_encoding(path,
self.options.python_version)

This comment has been minimized.

@gvanrossum

gvanrossum Jun 9, 2017

Member

Fix indent.

@gvanrossum

gvanrossum Jun 9, 2017

Member

Fix indent.

Show outdated Hide outdated mypy/build.py
@@ -710,7 +715,7 @@ def read_with_python_encoding(path: str, pyversion: Tuple[int, int]) -> str:
source_bytearray.decode(encoding)
except LookupError as lookuperr:
raise DecodeError(str(lookuperr))
return source_bytearray.decode(encoding)
return source_bytearray.decode(encoding), hashlib.md5(source_bytearray).hexdigest()

This comment has been minimized.

@gvanrossum

gvanrossum Jun 9, 2017

Member

This is incorrect if a BOM marker is present, since on line 705 above the BOM marker is removed from source_bytearray.

@gvanrossum

gvanrossum Jun 9, 2017

Member

This is incorrect if a BOM marker is present, since on line 705 above the BOM marker is removed from source_bytearray.

This comment has been minimized.

@pkch

pkch Jun 11, 2017

Contributor

Fixed in c115f13. I'm not sure why we make decision about encoding and BOM after just 2 lines rather than after reading the entire file; is it just because source_bytearray = source_bytearray[3:] is expensive when source_bytearray contains the entire file? Anyway, I assume that's the reason, so to avoid calling f.read() twice, I use hashlib.md5.update().

@pkch

pkch Jun 11, 2017

Contributor

Fixed in c115f13. I'm not sure why we make decision about encoding and BOM after just 2 lines rather than after reading the entire file; is it just because source_bytearray = source_bytearray[3:] is expensive when source_bytearray contains the entire file? Anyway, I assume that's the reason, so to avoid calling f.read() twice, I use hashlib.md5.update().

Show outdated Hide outdated mypy/build.py
@@ -809,8 +815,11 @@ def is_meta_fresh(meta: Optional[CacheMeta], id: str, path: str, manager: BuildM
# TODO: Share stat() outcome with find_module()
st = manager.get_stat(path) # TODO: Errors
if st.st_mtime != meta.mtime or st.st_size != meta.size:
manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path))
return False
with open(path, 'rb') as f:

This comment has been minimized.

@gvanrossum

gvanrossum Jun 9, 2017

Member

You only need to compare the hash when the sizes are equal.

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file. Now, the hashing seems so fast that this barely matters, but I want to use this for a huge codebase, so I'm still worried about this (else why bother with the mtime check).

Also I'd like to see a log message if the mtime differs, but the size and hash are the same (this helps validating that it works).

@gvanrossum

gvanrossum Jun 9, 2017

Member

You only need to compare the hash when the sizes are equal.

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file. Now, the hashing seems so fast that this barely matters, but I want to use this for a huge codebase, so I'm still worried about this (else why bother with the mtime check).

Also I'd like to see a log message if the mtime differs, but the size and hash are the same (this helps validating that it works).

This comment has been minimized.

@pkch

pkch Jun 11, 2017

Contributor

You only need to compare the hash when the sizes are equal.

Ah right, I left it as a placeholder but then forgot about it. Originally, I was thinking that maybe we should use a slightly more intelligent hash that ignores comments and indentation non-semantic whitespace (in which case size change doesn't necessarily imply that the file is invalid). Do you think it's worth doing that, or I should just leave the simple hash in place?

Edit: given that interface hash is not too expensive and will take care of simple modifications in the file; and given that removing comments is not super fast (need to use tokenize module); I suppose the "smart hash" isn't worth it.

@pkch

pkch Jun 11, 2017

Contributor

You only need to compare the hash when the sizes are equal.

Ah right, I left it as a placeholder but then forgot about it. Originally, I was thinking that maybe we should use a slightly more intelligent hash that ignores comments and indentation non-semantic whitespace (in which case size change doesn't necessarily imply that the file is invalid). Do you think it's worth doing that, or I should just leave the simple hash in place?

Edit: given that interface hash is not too expensive and will take care of simple modifications in the file; and given that removing comments is not super fast (need to use tokenize module); I suppose the "smart hash" isn't worth it.

This comment has been minimized.

@pkch

pkch Jun 11, 2017

Contributor

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file.

Ah that is very subtle. We should either completely remove mtime from the cache logic, or we should update it in this scenario. I prefer the latter because writing meta.json is faster than reading and hashing the source file, and I'm guessing on average we can expect at least one additional read of the source file before it's completely invalidated.

@pkch

pkch Jun 11, 2017

Contributor

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file.

Ah that is very subtle. We should either completely remove mtime from the cache logic, or we should update it in this scenario. I prefer the latter because writing meta.json is faster than reading and hashing the source file, and I'm guessing on average we can expect at least one additional read of the source file before it's completely invalidated.

This comment has been minimized.

@pkch

pkch Jun 11, 2017

Contributor

I attempted to do the mtime optimization in 52a5888.

@pkch

pkch Jun 11, 2017

Contributor

I attempted to do the mtime optimization in 52a5888.

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

@gvanrossum

gvanrossum Jun 12, 2017

Member

FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

@pkch

This comment has been minimized.

Show comment
Hide comment
@pkch

pkch Jun 10, 2017

Contributor

Yup, I have time, will work on it this weekend ~

Contributor

pkch commented Jun 10, 2017

Yup, I have time, will work on it this weekend ~

@gvanrossum

This comment has been minimized.

Show comment
Hide comment
@gvanrossum

gvanrossum Jun 10, 2017

Member
Member

gvanrossum commented Jun 10, 2017

@gvanrossum

Almost there! I really just have very small refactoring wishes, and they are optional.

Show outdated Hide outdated mypy/build.py
@@ -809,8 +815,11 @@ def is_meta_fresh(meta: Optional[CacheMeta], id: str, path: str, manager: BuildM
# TODO: Share stat() outcome with find_module()
st = manager.get_stat(path) # TODO: Errors
if st.st_mtime != meta.mtime or st.st_size != meta.size:
manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path))
return False
with open(path, 'rb') as f:

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

@gvanrossum

gvanrossum Jun 12, 2017

Member

FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

Show outdated Hide outdated mypy/build.py
@@ -842,6 +875,17 @@ def compute_hash(text: str) -> str:
return hashlib.md5(text.encode('utf-8')).hexdigest()
def atomic_write(filename: str, s: str) -> bool:

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

I think it would be more consistent if atomic_write() preceded the first function that uses it (i.e. before validate_meta()). That would mean random_string() also needs to move up there. It's optional to move these.

@gvanrossum

gvanrossum Jun 12, 2017

Member

I think it would be more consistent if atomic_write() preceded the first function that uses it (i.e. before validate_meta()). That would mean random_string() also needs to move up there. It's optional to move these.

Show outdated Hide outdated mypy/build.py
# This requires two steps. The first is obvious: we check that the module source file
# contents is the same as it was when the cache file was created. The second is not
# obvious: we need to check that the dependencies we relied on when creating that
# cache file have not changed. We use cache file mtime as a way to propagate

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

cache file --> cache data file (because the rest of this function is about the cache meta file).

@gvanrossum

gvanrossum Jun 12, 2017

Member

cache file --> cache data file (because the rest of this function is about the cache meta file).

Show outdated Hide outdated mypy/build.py
else:
manager.log('Metadata ok for {}: file {} (match on size, hash)'.format(id, path))
# Optimization: update meta.mtime (otherwise, this mismatch will not disappear).
meta = meta._replace(mtime = st.st_mtime)

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Whitespace nit: no spaces around = for keyword args.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Whitespace nit: no spaces around = for keyword args.

Show outdated Hide outdated mypy/build.py
else:
meta_str = json.dumps(meta)
meta_json, _ = get_cache_names(id, os.path.abspath(path), manager)
manager.log('Updating mtime {} {} {} {}'.format(id, path, meta_json, meta.mtime))

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Can you update this to follow the format of other similar messages, e.g. "Updating mtime for {}: file {}, meta {}, mtime {}" ? Or use trace() so it only shows up for double -v.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Can you update this to follow the format of other similar messages, e.g. "Updating mtime for {}: file {}, meta {}, mtime {}" ? Or use trace() so it only shows up for double -v.

Show outdated Hide outdated mypy/build.py
os.replace(data_json_tmp, data_json)
data_mtime = os.path.getmtime(data_json)
except os.error as err:
data_str += '\n' # Fast in CPython (it does this in-place if len(data_str) under 10^6)

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Nice bit of research! Alternatively you could make atomic_write() varargs; it could use f.writelines(*args)`. Though in our internal code base, there's only one cache file greater than 1M, so changing that is totally optional.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Nice bit of research! Alternatively you could make atomic_write() varargs; it could use f.writelines(*args)`. Though in our internal code base, there's only one cache file greater than 1M, so changing that is totally optional.

This comment has been minimized.

@pkch

pkch Jun 13, 2017

Contributor

I refactored that, because I found out that it's very unpredictable when this optimization kicks in (what I thought was a ~10**6 cutoff turned out to be OS-dependent zone where realloc starts to fail more and more often).

@pkch

pkch Jun 13, 2017

Contributor

I refactored that, because I found out that it's very unpredictable when this optimization kicks in (what I thought was a ~10**6 cutoff turned out to be OS-dependent zone where realloc starts to fail more and more often).

Show outdated Hide outdated mypy/build.py
'''
# This requires two steps. The first is obvious: we check that the module source file
# contents is the same as it was when the cache file was created. The second is not
# obvious: we need to check that the dependencies we relied on when creating that

This comment has been minimized.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Thanks for adding this big comment, but there is still some confusion possible on this line: the second step is not checking the dependencies (which happens after this function returns), but checking data_mtime.

@gvanrossum

gvanrossum Jun 12, 2017

Member

Thanks for adding this big comment, but there is still some confusion possible on this line: the second step is not checking the dependencies (which happens after this function returns), but checking data_mtime.

@gvanrossum gvanrossum merged commit cbeaeb4 into python:master Jun 13, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gvanrossum

This comment has been minimized.

Show comment
Hide comment
@gvanrossum

gvanrossum Jun 13, 2017

Member

Thanks! I'm going to deploy this ASAP.

Member

gvanrossum commented Jun 13, 2017

Thanks! I'm going to deploy this ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment