Rework cache to key on hash of file contents instead of mtime #3437

pkch · 2017-05-24T10:27:39Z

Attempt to fix #3403

At present, I'm only adding the ability to validate the cache using a hash even though the module file mtime doesn't match the meta mtime.

However, I'm NOT getting rid of the use of data_mtime (modified time of the cache file) for the purpose of finding dependencies that tell us that a given module needs to be reparsed. This is much more complex than I thought, and it may also be possible to work around that by using @JukkaL suggestion of putting the enitre cache into a tarball to preserve cache mtimes.

To verify that the approach works, I first create a failing test by changing the tests runner to touch all the source files before running mypy for incremental tests. Then in the second commit, I fix the broken tests.

How this whole thing should be tested is something that I'd like feedback on (running the tests twice is too time consuming I think?). As it stands, this PR only tests the new approach, and loses the tests for the old approach (with mtime/size). Maybe some kind of combination of the two would be good?

pkch · 2017-05-25T00:17:43Z

To summarize: this PR eliminates the reliance on mtime (modified time of source files) but leaves untouched the reliance on data_mtime (modified time of cache files).

gvanrossum · 2017-05-25T03:43:11Z

mypy/build.py

@@ -833,6 +836,11 @@ def compute_hash(text: str) -> str:
    return hashlib.md5(text.encode('utf-8')).hexdigest()


+def compute_module_hash(path: str) -> str:
+    with open(path, 'r') as f:
+        return compute_hash(f.read())


That's pretty inefficient -- we read the source code in text mode (using the default encoding) and then compute_hash() encodes it back to bytes (using utf-8) before hashing.

I also believe that we have the module source code already as an attribute in State in most cases so there's not even a need to read it from the file (it will be GC'ed by parse_file()).

Also the source should probably be read using read_with_python_encoding() which does a few tricks.

Finally we don't need this hash to be compatible with the interface hash (which is why compute_hash() exists currently).

Then perhaps we should hash the binary, and not even bother with decoding to a string in case when we need to read the file from disk?

And yes, sometimes .source is already populated; but it's populated with a string. So I'm plannnig to split read_with_python_encoding into reading the binary, and then decoding, so that in between the two steps we can calculate and store the hash.

If that's too troublesome, I can instead standardize the hash to always use the string.

Any preferences?

pkch · 2017-05-25T08:45:05Z

@gvanrossum I added the optimization along the lines you suggested. Now whenever the source code is parsed, we also calculate the source hash (based on the byte representation). ~~I don't think there's any need to ever calculate the source hash separately from parsing: we only need the source hash inside write_cache, and by then we of course must have parsed the source.~~

pkch · 2017-05-25T10:13:26Z

Actually, we need to calculate source hash in two places: once when parsing the source file, and once when checking if the cache is in sync with the source. In the first case, we can do it without an extra file read; in the second, we do need to read the file again (assuming mtime/size didn't give us the positive answer).

The new commit implements this optimization.

gvanrossum

I really like this! But there are a few issues still... Let me know if you have time to work on those, else I will take over the development of this PR.

gvanrossum · 2017-06-09T21:15:03Z

mypy/build.py

@@ -1421,7 +1433,8 @@ def parse_file(self) -> None:
            if self.path and source is None:
                try:
                    path = manager.maybe_swap_for_shadow_path(self.path)
-                    source = read_with_python_encoding(path, self.options.python_version)
+                    source, self.source_hash = read_with_python_encoding(path,
+                                                                      self.options.python_version)


Fix indent.

gvanrossum · 2017-06-09T21:17:31Z

mypy/build.py

@@ -710,7 +715,7 @@ def read_with_python_encoding(path: str, pyversion: Tuple[int, int]) -> str:
            source_bytearray.decode(encoding)
        except LookupError as lookuperr:
            raise DecodeError(str(lookuperr))
-        return source_bytearray.decode(encoding)
+        return source_bytearray.decode(encoding), hashlib.md5(source_bytearray).hexdigest()


This is incorrect if a BOM marker is present, since on line 705 above the BOM marker is removed from source_bytearray.

Fixed in c115f13. I'm not sure why we make decision about encoding and BOM after just 2 lines rather than after reading the entire file; is it just because source_bytearray = source_bytearray[3:] is expensive when source_bytearray contains the entire file? Anyway, I assume that's the reason, so to avoid calling f.read() twice, I use hashlib.md5.update().

gvanrossum · 2017-06-09T21:21:36Z

mypy/build.py

@@ -809,8 +815,11 @@ def is_meta_fresh(meta: Optional[CacheMeta], id: str, path: str, manager: BuildM
    # TODO: Share stat() outcome with find_module()
    st = manager.get_stat(path)  # TODO: Errors
    if st.st_mtime != meta.mtime or st.st_size != meta.size:
-        manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path))
-        return False
+        with open(path, 'rb') as f:


You only need to compare the hash when the sizes are equal.

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file. Now, the hashing seems so fast that this barely matters, but I want to use this for a huge codebase, so I'm still worried about this (else why bother with the mtime check).

Also I'd like to see a log message if the mtime differs, but the size and hash are the same (this helps validating that it works).

You only need to compare the hash when the sizes are equal.

Ah right, I left it as a placeholder but then forgot about it. Originally, I was thinking that maybe we should use a slightly more intelligent hash that ignores comments and ~~indentation~~ non-semantic whitespace (in which case size change doesn't necessarily imply that the file is invalid). Do you think it's worth doing that, or I should just leave the simple hash in place?

Edit: given that interface hash is not too expensive and will take care of simple modifications in the file; and given that removing comments is not super fast (need to use tokenize module); I suppose the "smart hash" isn't worth it.

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file.

Ah that is very subtle. We should either completely remove mtime from the cache logic, or we should update it in this scenario. I prefer the latter because writing meta.json is faster than reading and hashing the source file, and I'm guessing on average we can expect at least one additional read of the source file before it's completely invalidated.

I attempted to do the mtime optimization in 52a5888.

FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

pkch · 2017-06-10T00:10:56Z

Yup, I have time, will work on it this weekend ~

gvanrossum · 2017-06-10T01:29:28Z

I hope the refactor needed to update the meta file isn't too painful. :-)

gvanrossum

Almost there! I really just have very small refactoring wishes, and they are optional.

gvanrossum · 2017-06-12T17:11:43Z

mypy/build.py

@@ -809,8 +815,11 @@ def is_meta_fresh(meta: Optional[CacheMeta], id: str, path: str, manager: BuildM
    # TODO: Share stat() outcome with find_module()
    st = manager.get_stat(path)  # TODO: Errors
    if st.st_mtime != meta.mtime or st.st_size != meta.size:
-        manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path))
-        return False
+        with open(path, 'rb') as f:


FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

gvanrossum · 2017-06-12T17:17:30Z

mypy/build.py

@@ -842,6 +875,17 @@ def compute_hash(text: str) -> str:
    return hashlib.md5(text.encode('utf-8')).hexdigest()


+def atomic_write(filename: str, s: str) -> bool:


I think it would be more consistent if atomic_write() preceded the first function that uses it (i.e. before validate_meta()). That would mean random_string() also needs to move up there. It's optional to move these.

gvanrossum · 2017-06-12T17:18:58Z

mypy/build.py

+    # This requires two steps. The first is obvious: we check that the module source file
+    # contents is the same as it was when the cache file was created. The second is not
+    # obvious: we need to check that the dependencies we relied on when creating that
+    # cache file have not changed. We use cache file mtime as a way to propagate


cache file --> cache data file (because the rest of this function is about the cache meta file).

gvanrossum · 2017-06-12T17:20:54Z

mypy/build.py

+        else:
+            manager.log('Metadata ok for {}: file {} (match on size, hash)'.format(id, path))
+            # Optimization: update meta.mtime (otherwise, this mismatch will not disappear).
+            meta = meta._replace(mtime = st.st_mtime)


Whitespace nit: no spaces around = for keyword args.

gvanrossum · 2017-06-12T17:23:19Z

mypy/build.py

+            else:
+                meta_str = json.dumps(meta)
+            meta_json, _ = get_cache_names(id, os.path.abspath(path), manager)
+            manager.log('Updating mtime {} {} {} {}'.format(id, path, meta_json, meta.mtime))


Can you update this to follow the format of other similar messages, e.g. "Updating mtime for {}: file {}, meta {}, mtime {}" ? Or use trace() so it only shows up for double -v.

gvanrossum · 2017-06-12T17:27:33Z

mypy/build.py

-            os.replace(data_json_tmp, data_json)
-            data_mtime = os.path.getmtime(data_json)
-        except os.error as err:
+        data_str += '\n'  # Fast in CPython (it does this in-place if len(data_str) under 10^6)


Nice bit of research! Alternatively you could make atomic_write() varargs; it could use f.writelines(*args)`. Though in our internal code base, there's only one cache file greater than 1M, so changing that is totally optional.

I refactored that, because I found out that it's very unpredictable when this optimization kicks in (what I thought was a ~10**6 cutoff turned out to be OS-dependent zone where realloc starts to fail more and more often).

gvanrossum · 2017-06-12T19:23:35Z

mypy/build.py

+    '''
+    # This requires two steps. The first is obvious: we check that the module source file
+    # contents is the same as it was when the cache file was created. The second is not
+    # obvious: we need to check that the dependencies we relied on when creating that


Thanks for adding this big comment, but there is still some confusion possible on this line: the second step is not checking the dependencies (which happens after this function returns), but checking data_mtime.

gvanrossum · 2017-06-13T15:49:04Z

Thanks! I'm going to deploy this ASAP.

pkch changed the title ~~[WIP]~~ [WIP] Rework cache to key on hash of file contents instead of mtime May 24, 2017

pkch force-pushed the movable-cache branch from b373296 to 9242830 Compare May 24, 2017 10:35

pkch added 3 commits May 24, 2017 10:30

Break incremental tests by updating mtime for all cache files

dec2949

If module mtime doesn't match, try module hash

1a3e059

Revert tests to their original state

de0a794

pkch force-pushed the movable-cache branch from ee319c6 to 1a3e059 Compare May 24, 2017 17:56

pkch changed the title ~~[WIP] Rework cache to key on hash of file contents instead of mtime~~ Rework cache to key on hash of file contents instead of mtime May 25, 2017

gvanrossum reviewed May 25, 2017

View reviewed changes

pkch force-pushed the movable-cache branch from 2035e3d to de0a794 Compare May 25, 2017 08:47

CR: optimize source hash calculation

e37ae6a

pkch force-pushed the movable-cache branch from b94f880 to e37ae6a Compare May 25, 2017 10:19

gvanrossum self-assigned this Jun 5, 2017

gvanrossum requested changes Jun 9, 2017

View reviewed changes

pkch added 2 commits June 11, 2017 02:05

Fix md5 calc in the presence of BOM

c115f13

Update mtime in metadata if hash matched

52a5888

gvanrossum requested changes Jun 12, 2017

View reviewed changes

pkch force-pushed the movable-cache branch 2 times, most recently from 395e156 to 52a5888 Compare June 13, 2017 03:31

CR fixes

dcc87f5

gvanrossum approved these changes Jun 13, 2017

View reviewed changes

gvanrossum merged commit cbeaeb4 into python:master Jun 13, 2017

		@@ -842,6 +875,17 @@ def compute_hash(text: str) -> str:
		return hashlib.md5(text.encode('utf-8')).hexdigest()


		def atomic_write(filename: str, s: str) -> bool:

Uh oh!

Rework cache to key on hash of file contents instead of mtime #3437

Rework cache to key on hash of file contents instead of mtime #3437

Uh oh!

Conversation

pkch commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkch commented May 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkch commented May 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkch commented May 25, 2017

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkch Jun 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkch commented Jun 10, 2017

Uh oh!

gvanrossum commented Jun 10, 2017 via email

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gvanrossum commented Jun 13, 2017

Uh oh!

Uh oh!

pkch commented May 24, 2017 •

edited

Loading

pkch commented May 25, 2017 •

edited

Loading

pkch Jun 11, 2017 •

edited

Loading