Update on "recompile fx.GraphModule lazily"

Context: eellison 's review comment [here](#103642 (comment)) complains about my code calling `torch.fx.GraphModule.recompile` after I changed the graph. We didn't simply remove the call to `recompile` at that time since that increases the risk that user see or run stale python code. In this PR, I recompile GraphModule lazily without increasing the risk that user see/run stale python code. When training BertForMaskedLM, the `GraphModule.recompile` is called 707 times and takes 1.8s in total. The whole compilation takes around 60 seconds. By spot checking, I found the main reason we call recompile so frequently is due to inductor pattern matcher. E.g., if we want to replace src_fn with dst_fn, we need trace both src_fn and dst_fn. After tracing is done, we create a GraphModule. The init method of GraphModule will call recompile. By doing recompile lazily, we reduce the number of calls for `GraphModule._real_recompile` (in this PR, `recompile` just mark the class as needing recompilation and is very light weight. `_real_recompile` does the real recompilation) to 37 times and reduces its total execution time to 0.045s. [ghstack-poisoned]
pytorch · Jul 18, 2023 · 963bed2 · 963bed2
2 parents 0acef59 + 1666f26
commit 963bed2
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/torch/fx/graph_module.py b/torch/fx/graph_module.py
@@ -751,7 +751,7 @@ def __reduce__(self):
         code to regenerate the underlying ``Graph``
         """
         dict_without_graph = self.__dict__.copy()
-        python_code = self.recompile()
+        python_code = self._real_recompile()
         import_block = _format_import_block(python_code.globals, sys_importer)
         del dict_without_graph['_graph']
         return (reduce_graph_module, (dict_without_graph, import_block))