-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Rework cache to key on hash of file contents instead of mtime #3437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To summarize: this PR eliminates the reliance on |
mypy/build.py
Outdated
@@ -833,6 +836,11 @@ def compute_hash(text: str) -> str: | |||
return hashlib.md5(text.encode('utf-8')).hexdigest() | |||
|
|||
|
|||
def compute_module_hash(path: str) -> str: | |||
with open(path, 'r') as f: | |||
return compute_hash(f.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty inefficient -- we read the source code in text mode (using the default encoding) and then compute_hash()
encodes it back to bytes (using utf-8
) before hashing.
I also believe that we have the module source code already as an attribute in State
in most cases so there's not even a need to read it from the file (it will be GC'ed by parse_file()).
Also the source should probably be read using read_with_python_encoding() which does a few tricks.
Finally we don't need this hash to be compatible with the interface hash (which is why compute_hash() exists currently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then perhaps we should hash the binary, and not even bother with decoding to a string in case when we need to read the file from disk?
And yes, sometimes .source
is already populated; but it's populated with a string. So I'm plannnig to split read_with_python_encoding
into reading the binary, and then decoding, so that in between the two steps we can calculate and store the hash.
If that's too troublesome, I can instead standardize the hash to always use the string.
Any preferences?
@gvanrossum I added the optimization along the lines you suggested. Now whenever the source code is parsed, we also calculate the source hash (based on the byte representation). |
Actually, we need to calculate source hash in two places: once when parsing the source file, and once when checking if the cache is in sync with the source. In the first case, we can do it without an extra file read; in the second, we do need to read the file again (assuming mtime/size didn't give us the positive answer). The new commit implements this optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this! But there are a few issues still... Let me know if you have time to work on those, else I will take over the development of this PR.
mypy/build.py
Outdated
@@ -1421,7 +1433,8 @@ def parse_file(self) -> None: | |||
if self.path and source is None: | |||
try: | |||
path = manager.maybe_swap_for_shadow_path(self.path) | |||
source = read_with_python_encoding(path, self.options.python_version) | |||
source, self.source_hash = read_with_python_encoding(path, | |||
self.options.python_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indent.
mypy/build.py
Outdated
@@ -710,7 +715,7 @@ def read_with_python_encoding(path: str, pyversion: Tuple[int, int]) -> str: | |||
source_bytearray.decode(encoding) | |||
except LookupError as lookuperr: | |||
raise DecodeError(str(lookuperr)) | |||
return source_bytearray.decode(encoding) | |||
return source_bytearray.decode(encoding), hashlib.md5(source_bytearray).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect if a BOM marker is present, since on line 705 above the BOM marker is removed from source_bytearray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c115f13. I'm not sure why we make decision about encoding and BOM after just 2 lines rather than after reading the entire file; is it just because source_bytearray = source_bytearray[3:]
is expensive when source_bytearray
contains the entire file? Anyway, I assume that's the reason, so to avoid calling f.read()
twice, I use hashlib.md5.update()
.
mypy/build.py
Outdated
@@ -809,8 +815,11 @@ def is_meta_fresh(meta: Optional[CacheMeta], id: str, path: str, manager: BuildM | |||
# TODO: Share stat() outcome with find_module() | |||
st = manager.get_stat(path) # TODO: Errors | |||
if st.st_mtime != meta.mtime or st.st_size != meta.size: | |||
manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path)) | |||
return False | |||
with open(path, 'rb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only need to compare the hash when the sizes are equal.
There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file. Now, the hashing seems so fast that this barely matters, but I want to use this for a huge codebase, so I'm still worried about this (else why bother with the mtime check).
Also I'd like to see a log message if the mtime differs, but the size and hash are the same (this helps validating that it works).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only need to compare the hash when the sizes are equal.
Ah right, I left it as a placeholder but then forgot about it. Originally, I was thinking that maybe we should use a slightly more intelligent hash that ignores comments and indentation non-semantic whitespace (in which case size change doesn't necessarily imply that the file is invalid). Do you think it's worth doing that, or I should just leave the simple hash in place?
Edit: given that interface hash is not too expensive and will take care of simple modifications in the file; and given that removing comments is not super fast (need to use tokenize
module); I suppose the "smart hash" isn't worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file.
Ah that is very subtle. We should either completely remove mtime
from the cache logic, or we should update it in this scenario. I prefer the latter because writing meta.json
is faster than reading and hashing the source file, and I'm guessing on average we can expect at least one additional read of the source file before it's completely invalidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to do the mtime optimization in 52a5888.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.
Yup, I have time, will work on it this weekend ~ |
I hope the refactor needed to update the meta file isn't too painful. :-)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! I really just have very small refactoring wishes, and they are optional.
mypy/build.py
Outdated
@@ -809,8 +815,11 @@ def is_meta_fresh(meta: Optional[CacheMeta], id: str, path: str, manager: BuildM | |||
# TODO: Share stat() outcome with find_module() | |||
st = manager.get_stat(path) # TODO: Errors | |||
if st.st_mtime != meta.mtime or st.st_size != meta.size: | |||
manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path)) | |||
return False | |||
with open(path, 'rb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.
mypy/build.py
Outdated
@@ -842,6 +875,17 @@ def compute_hash(text: str) -> str: | |||
return hashlib.md5(text.encode('utf-8')).hexdigest() | |||
|
|||
|
|||
def atomic_write(filename: str, s: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more consistent if atomic_write() preceded the first function that uses it (i.e. before validate_meta()). That would mean random_string() also needs to move up there. It's optional to move these.
mypy/build.py
Outdated
# This requires two steps. The first is obvious: we check that the module source file | ||
# contents is the same as it was when the cache file was created. The second is not | ||
# obvious: we need to check that the dependencies we relied on when creating that | ||
# cache file have not changed. We use cache file mtime as a way to propagate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache file --> cache data file (because the rest of this function is about the cache meta file).
mypy/build.py
Outdated
else: | ||
manager.log('Metadata ok for {}: file {} (match on size, hash)'.format(id, path)) | ||
# Optimization: update meta.mtime (otherwise, this mismatch will not disappear). | ||
meta = meta._replace(mtime = st.st_mtime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitespace nit: no spaces around =
for keyword args.
mypy/build.py
Outdated
else: | ||
meta_str = json.dumps(meta) | ||
meta_json, _ = get_cache_names(id, os.path.abspath(path), manager) | ||
manager.log('Updating mtime {} {} {} {}'.format(id, path, meta_json, meta.mtime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this to follow the format of other similar messages, e.g. "Updating mtime for {}: file {}, meta {}, mtime {}" ? Or use trace() so it only shows up for double -v
.
mypy/build.py
Outdated
os.replace(data_json_tmp, data_json) | ||
data_mtime = os.path.getmtime(data_json) | ||
except os.error as err: | ||
data_str += '\n' # Fast in CPython (it does this in-place if len(data_str) under 10^6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice bit of research! Alternatively you could make atomic_write() varargs; it could use f.writelines(*args)`. Though in our internal code base, there's only one cache file greater than 1M, so changing that is totally optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored that, because I found out that it's very unpredictable when this optimization kicks in (what I thought was a ~10**6 cutoff turned out to be OS-dependent zone where realloc
starts to fail more and more often).
mypy/build.py
Outdated
''' | ||
# This requires two steps. The first is obvious: we check that the module source file | ||
# contents is the same as it was when the cache file was created. The second is not | ||
# obvious: we need to check that the dependencies we relied on when creating that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this big comment, but there is still some confusion possible on this line: the second step is not checking the dependencies (which happens after this function returns), but checking data_mtime.
395e156
to
52a5888
Compare
Thanks! I'm going to deploy this ASAP. |
Attempt to fix #3403
At present, I'm only adding the ability to validate the cache using a hash even though the module file mtime doesn't match the meta mtime.
However, I'm NOT getting rid of the use of
data_mtime
(modified time of the cache file) for the purpose of finding dependencies that tell us that a given module needs to be reparsed. This is much more complex than I thought, and it may also be possible to work around that by using @JukkaL suggestion of putting the enitre cache into a tarball to preserve cache mtimes.To verify that the approach works, I first create a failing test by changing the tests runner to touch all the source files before running mypy for incremental tests. Then in the second commit, I fix the broken tests.
How this whole thing should be tested is something that I'd like feedback on (running the tests twice is too time consuming I think?). As it stands, this PR only tests the new approach, and loses the tests for the old approach (with mtime/size). Maybe some kind of combination of the two would be good?