Support protocols in fine grained mode #4790

ilevkivskyi · 2018-03-26T10:19:09Z

Support of protocols in fine grained mode is non-trivial because protocols are not very "modular" (can have action at a distance). To support them we introduce several new dependencies and a global state that represents protocol dependency map (serialized in a separate cache file).

There are three kinds of protocol dependencies. For example, after a subtype check:

x: Proto = C()

the following dependencies will be generated:

1. ..., <SuperProto[wildcard]>, <Proto[wildcard]> -> <Proto>
2. ..., <B.attr>, <C.attr> -> <C> [for every attr in Proto members]
3. <C> -> Proto  # this one to invalidate the subtype cache

The first kind is generated immediately per-module. While two other kinds are generated
after all modules are type checked and we have recorded all the subtype checks.

Another change is that we reset subtype caches in protocols scheduled for recheck before processing any other targets. This avoids some false negatives in corner cases that are sensitive to module processing order.

I didn't test the scenario when the protocol cache files are tampered with, it is also hard to write any tests for this.

UPDATE: edited description to reflect the current state of PR.

…more debugging needed: protocol subtype not checked in second run, although CallExpr visited in checkexpr

…ting cache is still too soon, it does many other things

JukkaL

Here's my first round of review. While writing this, I started thinking about an alternative approach which wouldn't require a global cache file. I'll need to think about that more before I can say more -- I'm not even sure if it's feasible yet.

Another general comment that I don't like the proliferation of mutable cache data in TypeInfo objects. An alternative approach would be to put all global caches into a new shared cache object and each TypeInfo would have a reference to this object. I think that it would be cleaner and it would also give us centralized read access to the cache data, which could be helpful.

JukkaL · 2018-03-26T12:31:39Z

mypy/server/deps.py

+                # (unless invalidated by other deps). We mark this kind of deps
+                # by a star, since they are higher priority, i.e., they need
+                # to be processed before any other deps.
+                deps[trigger].add(info.fullname() + '*')


I'm not excited about adding another kind of dependency just for protocols, particularly since it has somewhat magical properties -- this sounds error-prone. I wonder if there is a way to avoid this?

OK, I will try to put this in a separate structure and generalize recursive triggering logic so that it can handle both normal and protocol deps.

JukkaL · 2018-03-26T12:32:28Z

mypy/server/deps.py

+        processed = set()
+    deps = defaultdict(set)  # type: DefaultDict[str, Set[str]]
+    for node in names.values():
+        if isinstance(node.node, TypeInfo) and node.node not in processed:


We should probably guard against processing imported classes that are defined in another module -- these would be processed as part of that other module?

Yes, this will improve speed a bit, will do.

Note however that this means we should never skip processing typing. Because e.g. subtype cache of typing.Iterable also needs to be invalidated if something tested against it changed.

JukkaL · 2018-03-26T12:37:55Z

mypy/subtypes.py

-                return True
+            if right.type.is_protocol:
+                if not left.type.is_protocol:
+                    left.type.checked_against_members.update(right.type.protocol_members)


I'd rather not directly modify checked_against_members here since it violates encapsulation. It would be better to add a TypeInfo method that records the information. Say, something like left.type.record_protocol_subtype_check(right.type).

checked_against_members is a good candidate for using a _ prefix since it servers such a specialized purpose and it's mutated (even we don't much use _ attribute prefixes).

JukkaL · 2018-03-26T12:39:05Z

mypy/subtypes.py

@@ -395,6 +398,7 @@ def f(self) -> A: ...
    as well.
    """
    assert right.type.is_protocol
+    right.type.attempted_implementations.add(left.type.fullname())


Similar to above, it's better to not to directly poke state of objects. Use a method instead.

JukkaL · 2018-03-26T12:48:10Z

mypy/subtypes.py

-                return True
+            if right.type.is_protocol:
+                if not left.type.is_protocol:
+                    left.type.checked_against_members.update(right.type.protocol_members)


Similar to above.

JukkaL · 2018-03-26T12:54:36Z

mypy/build.py

+        manager.log("Error writing protocol deps JSON file {}".format(proto_cache))
+
+
+def get_protocol_deps_cache_name(manager: BuildManager) -> Tuple[str, str]:


This duplicates parts of get_cache_names. It would be better to share the code (such as cache directory naming logic).

Also add dosctring -- it's not immediately clear what is the purpose of having a separate protocol deps cache file.

OK, I will try to refactor and certainly will add the docstring.

JukkaL · 2018-03-26T13:06:31Z

mypy/build.py

+    for id in graph:
+        meta_snapshot[id] = graph[id].source_hash
+    if not atomic_write(proto_meta, json.dumps(meta_snapshot), '\n'):
+        manager.log("Error writing protocol meta JSON file {}".format(proto_cache))


Error handling is a bit sketchy -- the cache is unusable without a protocol meta JSON file. Other cache files can be regenerated by reprocessing individual modules, but the global cache can't be built incrementally. So if we write some cache files but not the global cache file we should probably fail the build. If we don't write any cache files at all, things are probably okay, since we'll need to a full rebuild which will also populate the protocol data structures correctly.

Yes, this is the part where I am not sure what to do. Also should we invalidate the protocol cache separately if cache was written with --strict-optional but is used without --strict-optional?

JukkaL · 2018-03-26T13:06:50Z

mypy/build.py

+        manager.log("Error writing protocol meta JSON file {}".format(proto_cache))
+    listed_proto_deps = {k: list(v) for (k, v) in proto_deps.items()}
+    if not atomic_write(proto_cache, json.dumps(listed_proto_deps), '\n'):
+        manager.log("Error writing protocol deps JSON file {}".format(proto_cache))


Similar to above -- error checking needs to be more involved.

JukkaL · 2018-03-26T13:09:43Z

mypy/build.py

    else:
        process_graph(graph, manager)
+        if manager.options.cache_fine_grained or manager.options.fine_grained_incremental:
+            proto_deps = collect_protocol_deps(graph)


This needs a comment, since the special protocol dependencies are non-trivial.

JukkaL · 2018-03-26T13:29:39Z

mypy/build.py

@@ -2052,8 +2131,13 @@ def dispatch(sources: List[BuildSource], manager: BuildManager) -> Graph:
    # just want to load in all of the cache information.
    if manager.use_fine_grained_cache():
        process_fine_grained_cache_graph(graph, manager)
+        manager.proto_deps = read_protocol_cache(manager, graph)


read_protocol_cache can return None if it can't load the file. Can we handle it here? Without protocol dependencies we won't be able to produce correct results, so it looks like handling it somehow is the right thing to do. Maybe we should do a full build instead of using the cache?

I was thinking about showing a blocking error and stopping the build. There is a TODO about this in update.py.

… logic used by mypy

…members by using fixup.stale_info

ilevkivskyi · 2018-04-25T16:51:59Z

OK, now I updated the PR to use the newly added TypeState. Now protocol deps are stored there. This improves performance (we don't need to recollect all deps after every update) and simplifies the logic (most of it is now in one place).

@JukkaL This is ready for your review.

JukkaL

Partial review, will continue tomorrow.

JukkaL · 2018-04-25T18:26:15Z

mypy/dmypy_server.py

@@ -196,6 +196,7 @@ def serve(self) -> None:

    def free_global_state(self) -> None:
        TypeState.reset_all_subtype_caches()
+        TypeState.reset_protocol_deps()


These who functions could be put in a function somewhere, since the two lines are repeated in mypy.build -- something like reset_global_state. It could be module-level function in mypy.typestate, and if we ever have multiple modules with global state, we can just move the function to another module.

OK, I will make it a global function in typestate.

JukkaL · 2018-04-25T18:27:03Z

mypy/build.py

+    """Read and validate protocol dependencies cache."""
+    proto_meta, proto_cache = get_protocol_deps_cache_name(manager)
+    try:
+        with open(proto_meta, 'r') as f:


Use file system cache here (and elsewhere in this function)?

Actually after doing this I am not sure this will give any speed-up. Protocol deps are currently read only once when we load the cache. So this is rather just a consistency question (to use the same API everywhere) rather than performance.

JukkaL · 2018-04-25T18:29:51Z

mypy/subtypes.py

@@ -392,6 +392,7 @@ def f(self) -> A: ...
    as well.
    """
    assert right.type.is_protocol
+    TypeState.record_protocol_subtype_check(left.type, right.type)


Please add short comment explaining why we do this ("We need to record this check to generate fine-grained dependencies." or something).

JukkaL · 2018-04-25T18:31:19Z

mypy/typestate.py

+    # We also snapshot protocol members of the above protocols.
+    _checked_against_members = {}  # type: Dict[TypeInfo, Set[str]]
+    # TypeInfos that has been reprocessed since latest dependency snapshot update.
+    _reprocessed_types = set()  # type: Set[TypeInfo]


The term 'reprocessed' is a bit unclear here. Can you spell out what it means exactly? Does it only apply to fine-grained incremental mode?

Does it only apply to fine-grained incremental mode?

Essentially yes, this is is an optimization to avoid gathering dependencies from all registered types, but only from those re-checked during last update. For full build, this simply contains all registered types, we can't optimize this.

…l proto deps update; attempt at fixing bytes/str on Python 3.4

ilevkivskyi · 2018-04-25T21:06:23Z

@JukkaL Thanks for review! I addressed all your comments and also made few minor updates (for better separation between normal deps and protocol deps, now they are mixed only immediately before using, or dumping in a test).

JukkaL

This is the next batch of my code review, will continue tomorrow.

JukkaL · 2018-05-01T15:54:28Z

mypy/build.py

+        error = True
+    if error:
+        manager.errors.set_file(_get_prefix(manager), None)
+        manager.errors.report(0, 0, "Error writing protocol dependencies cache.",


Style nit: error messages don't generally have a trailing period.

JukkaL · 2018-05-01T15:55:02Z

mypy/build.py

+
+    Serialize protocol dependencies map for fine grained mode. Also take the snapshot
+    of current sources to later check consistency between protocol cache and individual
+    cache files.


Document which dependencies are included in proto_deps (there are three different kinds which might plausibly be included).

JukkaL · 2018-05-01T15:56:04Z

mypy/build.py

+
+def read_protocol_cache(manager: BuildManager,
+                        graph: Graph) -> Optional[Dict[str, Set[str]]]:
+    """Read and validate protocol dependencies cache."""


Document which dependencies are read (out of the three kinds), or just add a reference to write_protocol_deps_cache if that contains the description.

JukkaL · 2018-05-01T15:57:23Z

mypy/build.py

+    proto_meta, proto_cache = get_protocol_deps_cache_name(manager)
+    try:
+        data = manager.fscache.read(proto_meta).decode()
+        manager.trace('Proto meta {}'.format(data.rstrip()))


Style nit: Move this and the following line out of the try statement. Generally it's better to have only the potentially failing statement within the try block for clarity (and to avoid catching unanticipated exceptions).

JukkaL · 2018-05-01T15:58:23Z

mypy/build.py

+        data = manager.fscache.read(proto_meta).decode()
+        manager.trace('Proto meta {}'.format(data.rstrip()))
+        meta_snapshot = json.loads(data)
+    except IOError:


Maybe handle unicode decode errors here as well?

JukkaL · 2018-05-01T16:08:50Z

mypy/build.py

+
+    Since these dependencies represent a global state of the program, they
+    are serialized per program, not per module, and the corresponding files
+    live at the root of the cache folder for a given Python version.


Maybe document that returns tuple (meta file path, data file path) and describe what's included in the meta file.

JukkaL · 2018-05-01T16:34:16Z

mypy/build.py

+        # after we loaded cache for whole graph. The `read_protocol_cache` will also validate
+        # the protocol cache against the loaded individual cache files.
+        TypeState.proto_deps = read_protocol_cache(manager, graph)
+        if TypeState.proto_deps is None and manager.stats.get('fresh_trees', 0):


Maybe write the latter condition could as manager.stats.get(..., 0) > 0 to be more explicit? Also mention this in the comment, since it looked a bit confusing at first. Also, it could be helpful to document the stats attribute (in BuildManager, not here) so that it's clear that we rely on the stats for correctness.

JukkaL · 2018-05-01T16:35:18Z

mypy/build.py

+        TypeState.proto_deps = read_protocol_cache(manager, graph)
+        if TypeState.proto_deps is None and manager.stats.get('fresh_trees', 0):
+            manager.errors.set_file(_get_prefix(manager), None)
+            manager.errors.report(0, 0, "Error reading protocol dependencies cache. Aborting.",


Again, leave out the final period (maybe like this: "Error reading protocol dependencies cache -- aborting").

JukkaL · 2018-05-01T16:36:35Z

mypy/build.py

+            # processed the whole graph.
+            TypeState.update_protocol_deps()
+            if TypeState.proto_deps is not None:
+                write_protocol_deps_cache(TypeState.proto_deps, manager, graph)


For consistency, we probably shouldn't write any cache files in fine-grained incremental mode.

JukkaL · 2018-05-01T16:39:02Z

mypy/server/update.py


+    The first item in the return tuple is a list of deferred nodes that
+    needs to be reprocessed. If the target represents a TypeInfo corresponding
+    to a protocol. Return it as a second item in the return tuple, otherwise None.


Grammar: Should be "... to a protocol, return it as ..."

JukkaL

Here's another batch of comments -- mostly requests for more comments/docstrings.

JukkaL · 2018-05-02T14:44:57Z

mypy/typestate.py

+
+
+def reset_global_state() -> None:
+    """Reset all existing global states. Currently they are all in this module."""


This doesn't yet all reset all global state -- at least strict optional status is not reset, and we use functools.lru_cache. Maybe reword as "Reset most ..." and discuss known deviations in the docstring.

Grammar nit: use state instead of states.

JukkaL · 2018-05-02T14:56:11Z

mypy/server/update.py

                #
                # TODO: Some __* names cause mistriggers. Fix the underlying issue instead of
                #     special casing them here.
                diff.add(id + WILDCARD_TAG)
-                break
+            if item.count('.') > package_nesting_level + 1:


Add comment about this -- these are for changes within classes, used by protocols?

JukkaL · 2018-05-02T14:57:11Z

mypy/server/update.py

@@ -665,12 +669,14 @@ def calculate_active_triggers(manager: BuildManager,
                # Activate catch-all wildcard trigger for top-level module changes (used for
                # "from m import *"). This also gets triggered by changes to module-private
                # entries, but as these unneeded dependencies only result in extra processing,
-                # it's a minor problem.
+                # it's a minor problem. Also used by protocols.


The "Also used by protocols." doesn't refer to top-level module changes, so this comment seems a little confusing, since this particular things doesn't seem to be used by protocols?

JukkaL · 2018-05-02T14:59:41Z

mypy/server/deps.py

@@ -235,6 +233,13 @@ def process_type_info(self, info: TypeInfo) -> None:
            self.add_type_dependencies(info.typeddict_type, target=make_trigger(target))
        if info.declared_metaclass:
            self.add_type_dependencies(info.declared_metaclass, target=make_trigger(target))
+        if info.is_protocol:
+            for base_info in info.mro[:-1]:
+                self.add_dependency(make_wildcard_trigger(base_info.fullname()),


Worth adding a clarifying comment here. Protocols a class is compatible with are often not in the MRO, so it's worth discussing this as well -- actually, is this even necessary?

6 tests fail if I only add this for the class but not MRO. The idea is that we can have subprotocols, so that an actual interface is determined by the whole MRO, if something is changed in an explicit superprotocol we need to invalidate the protocol. A simple example is [case testInvalidateProtocolViaSuperClass]. There are other ways how to treat this (like including protocol members in a class snapshot), but it seems to me this one is conceptually simplest and the most similar to how no-protocol classes are treated.

JukkaL · 2018-05-02T15:01:16Z

mypy/nodes.py

@@ -2211,6 +2208,18 @@ def get_containing_type_info(self, name: str) -> 'Optional[TypeInfo]':
                return cls
        return None

+    @property
+    def protocol_members(self) -> List[str]:
+        # Protocol members are names of all attributes/methods defined in a protocol


Should this return an empty list if the class is not a protocol, similar to how this was defined earlier?

I would say it is not necessary, but I see no particular danger in this.

JukkaL · 2018-05-02T15:43:31Z

mypy/typestate.py

+    # a dependency snapshot at the very end, so this set will contain all type-checked
+    # TypeInfos. After a fine grained update however, we can gather new dependencies only
+    # from few TypeInfos that were type-checked during this update, because these are
+    # the only that can generate new protocol dependencies.


This comment was a bit unclear. Which dependencies does the comment refer to (of the three kinds)? It's not clear why only type-checked TypeInfos can generate new dependencies, since elsewhere we've established that dependencies are a global property of a program. Please elaborate a bit. What does type-checked mean exactly here? It could plausibly mean, for example, some of 1) triggered ClsName 2) triggered <ClsName> 3) file containing the class was modified 4) the target containing the class definition was processed -- but I suspect it means none of these.

Style/grammar nits: "from few TypeInfos" -> "from the (typically) few TypeInfos". "are the only" -> "are the only ones"

It's not clear why only type-checked TypeInfos can generate new dependencies, since elsewhere we've established that dependencies are a global property of a program

I don't think there is a contradiction, because it is not the TypeInfo that generates the new deps, but the re-checked assignment or a function call (i.e. something where is_subtype was called while re-checking a target). This assignment may be in a module which is neither a module defining protocol, nor defining the implementation. The TypeInfo is just a way to temporary store the new dependency, so yes this is none of 1-4 you mention. I will try to clarify this.

JukkaL · 2018-05-02T15:46:49Z

mypy/typestate.py

+
+    @classmethod
+    def reset_protocol_deps(cls) -> None:
+        cls.proto_deps = {}


Mention when this is expected to be run in a docstring.

JukkaL · 2018-05-02T15:48:53Z

mypy/typestate.py

+
+        The first kind is generated immediately per-module in deps.py. While two other kinds are
+        generated here after all modules are type checked anf we have recorded all the subtype
+        checks.


Can you give examples where each of these dependency kinds are needed to produce correct type checking results?

JukkaL · 2018-05-02T15:50:12Z

mypy/typestate.py

+        """
+        if cls.proto_deps is None:
+            # Unsuccesful cache loading, nothing to do.
+            return


Is silently failing here the right thing to do? Should we instead assert that cls.proto_deps is not-None?

I think you are right, we shouldn't call this after an unsuccessful cache load, I will change to an assert here and below.

JukkaL · 2018-05-02T15:50:58Z

mypy/typestate.py

+        in its __init__.
+        """
+        cls.update_protocol_deps()  # just in case
+        if TypeState.proto_deps is None:


Similar to above, silently failing seems kind of suspicious. I'd rather perform the check in the call site perhaps, and only when it's actually safe to do so.

ilevkivskyi · 2018-05-02T16:56:31Z

@JukkaL I addressed the previous round of comments. I abandoned the idea of using the fscache for reading meta files, because it also caches exceptions and it is not what we want, because it is normal for meta files to not exist at the first run and then be added later, also the exception caching logic is not clear to me. Anyway, as I mentioned before, the file system cache will not give any speed-up with the current design, since protocol caches are read only once.

ilevkivskyi · 2018-05-02T18:27:05Z

@JukkaL Thanks! I fixed all the remaining comments. You can continue reviewing.

msullivan · 2018-05-03T01:02:44Z

Currently we don't use the fscache for any of the cache files. We only use it for source files.

JukkaL

This is my final batch of comments. A lot of thought went into this PR -- sorry for taking so many iterations to review this! This seems almost good to merge (some comments can be addressed later in separate PRs) but the caching issue we discussed offline needs to be addressed first.

JukkaL · 2018-05-03T13:38:15Z

mypy/build.py

+
+
+def _cache_dir_prefix(manager: BuildManager, id: Optional[str] = None) -> str:
+    """Get current cache directory (or file if id is given)."""


Hmm the new name doesn't work with a non-None id argument, since it won't be a directory prefix then? I'd recommended having another function for the file prefix that calls _cache_dir_prefix. (It's okay to do this in a separate PR.)

JukkaL · 2018-05-03T13:43:40Z

mypy/typestate.py

+    # a value of type a.A to a function expecting something compatible with Iterable, we'd have
+    # 'a.A' -> {'__iter__', ...} in the map. This map is also flushed after every incremental
+    # update. This map is needed to only generate dependencies like <a.A.__iter__> -> <a.A>
+    # instead of a wildcard to avoid unnecessary invalidating classes.


Grammar nit: unnecessary -> unnecessarily

JukkaL · 2018-05-03T13:44:12Z

mypy/typestate.py

+    # update. This map is needed to only generate dependencies like <a.A.__iter__> -> <a.A>
+    # instead of a wildcard to avoid unnecessary invalidating classes.
+    _checked_against_members = {}  # type: Dict[str, Set[str]]
+    # TypeInfos that appeared as a left type (subtype) in a subytpe check since latest


Typo: subytpe

JukkaL · 2018-05-03T13:46:32Z

mypy/typestate.py

+    # run we only take a dependency snapshot at the very end, so this set will contain all
+    # subtype-checked TypeInfos. After a fine grained update however, we can gather only new
+    # dependencies generated from (typically) few TypeInfos that were subtype-checked
+    # (i.e. appeared as r.h.snin an assignment or an argument in a function call in


Typo: r.h.sin

JukkaL · 2018-05-03T15:04:27Z

mypy/typestate.py

+        return deps
+
+    @classmethod
+    def update_protocol_deps(cls, second_map: Optional[Dict[str, Set[str]]] = None) -> None:


Maybe the interface or the implementation could be simplified -- this was a bit unclear. One idea would be to split this into a few different methods:

flush_protocol_deps() would do _snapshot_protocol_deps + clear the attributes.

add_protocol_deps(deps, to_add) would add to_add deps to an existing dependency map (corresponds to each of the loops in this method).

This method would then mostly contain calls to the two methods above, making the structure easier to see.

It's okay to do this refactoring in a separate PR.

JukkaL · 2018-05-03T15:55:34Z