Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching for fine-grained incremental mode #4483

Merged
merged 8 commits into from
Feb 7, 2018
Merged

Caching for fine-grained incremental mode #4483

merged 8 commits into from
Feb 7, 2018

Conversation

msullivan
Copy link
Collaborator

Work in progress for cache loading for fine-grained incremental mode.

I've tested it some manually and it seems to work all right. This needs a test suite.

A good improvement would be to load from caches without doing a full build.

@msullivan msullivan force-pushed the fg-cache branch 2 times, most recently from 9a453cf to 61c38dd Compare January 18, 2018 21:25
@msullivan msullivan force-pushed the fg-cache branch 2 times, most recently from d7c3d1e to 5a7add8 Compare February 1, 2018 01:57
@msullivan msullivan changed the title [WIP] Caching for fine-grained incremental mode Caching for fine-grained incremental mode Feb 6, 2018
@msullivan msullivan requested a review from JukkaL February 6, 2018 01:51
@msullivan
Copy link
Collaborator Author

This now has a test suite and supports doing a fine-grained incremental build to initialize.

There is one pretty dodgy hack to disable reading from the cache after the initial build that might need rethinking, and a few test cases have been disabled, but I think this is reviewable.

@gvanrossum
Copy link
Member

gvanrossum commented Feb 6, 2018 via email

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, this should improve the performance of cold runs significantly when cache data is available. Most comments are about adding comments.

It would be nice to somehow verify that caching will improve performance. One idea is to verify which modules are fully refreshed in each update. We could ensure that unrelated modules don't get processed.

Note that I wasn't able to test this for real -- I'll ask about this offline.

mypy/build.py Outdated
@@ -2383,6 +2388,13 @@ def process_graph(graph: Graph, manager: BuildManager) -> None:
manager.log("Processing SCC of size %d (%s) as %s" % (size, scc_str, fresh_msg))
process_stale_scc(graph, scc, manager)

# If we are running in fine-grained incremental mode with caching,
# we need to always process fresh SCCs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also describe why we need to do this in the comment.

@@ -737,7 +743,13 @@ def propagate_changes_using_dependencies(
# TODO: Preserve order (set is not optimal)
for id, nodes in sorted(todo.items(), key=lambda x: x[0]):
assert id not in up_to_date_modules
triggered |= reprocess_nodes(manager, graph, id, nodes, deps)
# TODO: Is there a better way to detect that the file isn't loaded?
if not manager.modules[id].defs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move this check to a helper method within BuildManager, which would be a bit cleaner.

mypy/build.py Outdated
@@ -1131,6 +1131,11 @@ def validate_meta(meta: Optional[CacheMeta], id: str, path: Optional[str],
if not stat.S_ISREG(st.st_mode):
manager.log('Metadata abandoned for {}: file {} does not exist'.format(id, path))
return None

if manager.options.use_fine_grained_cache:
manager.log('Using potentially stale metadata for {}'.format(id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment here describing why we do this.

@@ -262,12 +265,31 @@ def initialize_fine_grained(self, sources: List[mypy.build.BuildSource]) -> Dict
messages = result.errors
manager = result.manager
graph = result.graph
manager.options.cache_dir = os.devnull # XXX: HACK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment describing a potential better way to do this?

@@ -318,6 +319,9 @@ def mark_all_meta_as_memory_only(graph: Dict[str, State],
def get_all_dependencies(manager: BuildManager, graph: Dict[str, State],
options: Options) -> Dict[str, Set[str]]:
"""Return the fine-grained dependency map for an entire build."""
if not options.use_fine_grained_cache:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment about the dependency map being included in the cache if cache is enabled.

triggered |= reprocess_nodes(manager, graph, id, nodes, deps)
# TODO: Is there a better way to detect that the file isn't loaded?
if not manager.modules[id].defs:
# We haven't actually loaded this file! Add it to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to mention that we've only loaded the cache file for this, so there's no AST yet.

@@ -43,15 +43,30 @@ class FineGrainedSuite(DataSuite):
optional_out = True

def run_case(self, testcase: DataDrivenTestCase) -> None:
self.run_case_inner(testcase, cache=False)

# Reset the test case and run it again with caching on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of running two variants of a test case within each test case, I'd prefer a separate suite that has caching on. A reasonable way to do would be to subclass FineGrainedSuite for caching test cases. The subclass would be trivial and only set a flag that turns caching on. This way it would be easy to run only the caching or non-caching variant of a test case, making it easier to debug failures.

main_src = '\n'.join(testcase.input)
sources_override = self.parse_sources(main_src)
messages, manager, graph = self.build(main_src, testcase, sources_override)

print("Testing with cache: ", cache)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you have two test suites for non-caching and caching variants of test cases, this print statement can be removed. We prefer to produce no output when running passing test cases.

if fine_grained_manager is None:
messages, manager, graph = self.build(main_src, testcase, sources_override,
build_cache=False, enable_cache=cache)
manager.options.cache_dir = os.devnull # XXX: HACK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the purpose the hack and potentially a better way to do this.

@@ -237,7 +237,9 @@ a.py:1: error: invalid syntax
b.py:3: error: Too many arguments for "f"
a.py:3: error: Too many arguments for "g"

[case testDeleteFileWithBlockingError]
[case testDeleteFileWithBlockingError-skip]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about only skipping this only in caching test cases? It would give us better test coverage. For example, add another test case name suffix that causes these test cases to be skipped in caching tests only. You could this mechanism to have two variants of each test case with different outputs as well (caching / non-caching) by having two suffixes.

Also, please create issue(s) about fixing the remaining skipped test cases.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, feel free to merge after having looked at my comments (only minor things). I didn't test this yet. It'll be easier to test this and iterate on this once this has been merged.

"""Transitively rechecks targets based on triggers and the dependency map.

Returns a list (module id, path) tuples representing modules that contain
a target that needs to be reprocessed but that has not been parsed yet."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: move """ to a separate line.

# as skipped, not just elided.
def should_skip(self, testcase: DataDrivenTestCase) -> bool:
if self.use_cache:
if testcase.name.endswith("-skip-cache"): return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: move return True to a separate line.

if testcase.name.endswith("-skip-cache"): return True
# TODO: In caching mode we currently don't well support
# starting from cached states with errors in them.
if testcase.output and testcase.output[0] != '==': return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above.

# starting from cached states with errors in them.
if testcase.output and testcase.output[0] != '==': return True
else:
if testcase.name.endswith("-skip-nocache"): return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above.

@@ -90,7 +100,7 @@ def run_case_inner(self, testcase: DataDrivenTestCase, cache: bool) -> None:
# cache, now we need to set it up
if fine_grained_manager is None:
messages, manager, graph = self.build(main_src, testcase, sources_override,
build_cache=False, enable_cache=cache)
build_cache=False, enable_cache=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change to a True argument on purpose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This path only executes when the cache is on.

@msullivan msullivan merged commit 4352a44 into master Feb 7, 2018
@msullivan msullivan deleted the fg-cache branch February 7, 2018 19:02
carljm added a commit to carljm/mypy that referenced this pull request Feb 14, 2018
* master: (32 commits)
  Fix some fine-grained cache/fswatcher problems (python#4560)
  Sync typeshed (python#4559)
  Add _cached suffix to test cases in fine-grained tests with cache (python#4558)
  Add back support for simplified fine-grained logging (python#4557)
  Type checking of class decorators (python#4544)
  Sync typeshed (python#4556)
  When loading from a fine-grained cache, use the real path, not the cached (python#4555)
  Switch all of the fine-grained debug logging to use manager.log (python#4550)
  Caching for fine-grained incremental mode (python#4483)
  Fix --warn-return-any for NotImplemented (python#4545)
  Remove myunit (python#4369)
  Store line numbers of imports in the cache metadata (python#4533)
  README.md: Fix a typo (python#4529)
  Enable generation and caching of fine-grained dependencies from normal runs (python#4526)
  Move argument parsing for the fine-grained flag into the main arg parsing code (python#4524)
  Don't warn about unrecognized options starting with 'x_' (python#4522)
  stubgen: don't append star arg when args list already has varargs appended (python#4518)
  Handle TypedDict in diff and deps (python#4510)
  Fix Options.__repr__ to not infinite recurse (python#4514)
  Fix some fine-grained incremental bugs with newly imported files (python#4502)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants