New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LINM (Loop Invariant Node Motion) optimization pass in GraphOptim… #16306

Merged
merged 5 commits into from Mar 2, 2018

Conversation

Projects
None yet
10 participants
@minminsun
Copy link
Contributor

minminsun commented Jan 23, 2018

…izer
This change was inspired by LICM (Loop Invariant Code Motion) of compilers. We observed from some public models, e.g. seq2seq (https://github.com/google/seq2seq) and tensor2tensor (https://github.com/tensorflow/tensor2tensor), as well as some of our in-house models that there are many invariant nodes, including expensive MatMul nodes, inside the loop body.
This optimization pass is to apply on Tensorflow computational graph to detect these invariant nodes and move them out of the loop body, that's why we call it LINM (Loop Invariant Node Motion).

Although there's already a LICM pass in XLA (51895fe), we still feel necessary to add this LINM pass in GraphOptimizer because:

  1. The XLA LICM pass is based on XlaWhile instruction, but the conversion from loop nodes (Enter/Exit/Switch/Merge/LoopCond) of tf.while to XlaWhile instruction is not hooked up yet (https://groups.google.com/forum/#!topic/xla-dev/IqLyL67cemI)
  2. We further found out that even if the conversion is hooked up, it works only when all nodes inside the loop has XLA kernel registered. It's a long way to go to get all operators supported by XLA.
  3. The LINM pass in GraphOptimizer is expected to work no matter whether XLA is on or off.

@googlebot googlebot added the cla: yes label Jan 23, 2018

@rmlarsen rmlarsen requested a review from eliben Jan 23, 2018

@rmlarsen rmlarsen self-assigned this Jan 23, 2018

@eliben eliben requested a review from hawkinsp Jan 24, 2018

@eliben

This comment has been minimized.

Copy link
Member

eliben commented Jan 24, 2018

I'm not the best person to review TF graph-level optimizations. Assigning to @hawkinsp for further dispensing

Also @jlebar for visibility

@eliben eliben removed their request for review Jan 24, 2018

@yangjunpro

This comment has been minimized.

Copy link

yangjunpro commented Jan 29, 2018

@eliben @rmlarsen Anyone could help take a look at this PR?
It has been waiting for review for more than one week.
Actually we would like to set-up closer collaboration with community. That's why we push this PR as quick as possible to ensure the community could know the details of our work at the first time.
THanks

@eliben

This comment has been minimized.

Copy link
Member

eliben commented Jan 29, 2018

Jun, apologies for the delay. Complicated PRs take a bit longer to review. They require persons closely familiar with the exact areas of code involved, and these folks can be on vacation / occupied temporarily.

@ebrevdo

This comment has been minimized.

Copy link
Contributor

ebrevdo commented Jan 29, 2018

This looks like it would also make a good grappler graph optimization @benoitsteiner.

@rmlarsen

This comment has been minimized.

Copy link
Member

rmlarsen commented Jan 29, 2018

@minminsun Most current work on graph-level optimizations is done in the Grappler subdirectory. Grappler does graph optimizations (constant folding & materialization, CSE, arithmetic optimizations, graph pruning, layout optimization etc.), and is on by default in the TF runtime. It operates on the same high-level graph representation (GraphDef) as TensorFlow.

Grappler has a number of advantages over the older graph optimizer framework:

  • It is actively maintained by a team at Google.
  • Test coverage is formidable - optimizers are tested against a Google-scale collection of model graphs. Also, by being on by default, new optimizations are run through every TensorFlow unit test.
  • It is backend agnostic: The optimizations will apply whether you use the TensorFlow runtime, XLA, TensorRT, or any other backend, including mobile and web backends like deeplearn.js.
  • It can be applied offline to stored graphs, which can be useful for mobile or serving from stored graphs.
  • It is applied at the full graph level, before partitioning, which allows full graph optimizations, which may not be possible after partitioning.

The individual passes are here:

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/grappler/optimizers

We would be happy to help you migrate your code to the Grappler framework, and change it to operate as another Grappler optimizer pass run by the meta-optimizer. The main difference is that Grappler works at the GraphDef level, but the underlying algorithms for your pass should be the same.

@benoitsteiner

This comment has been minimized.

Copy link
Contributor

benoitsteiner commented Jan 29, 2018

@minminsun I would also encourage you to refactor this PR to work at the GraphDef level in Grappler. In addition to the technical reasons that @rmlarsen mentioned above, we have just started to work on loop optimizations in Grappler (including invariant code motion) and it would be great to collaborate with you on this.

@rmlarsen rmlarsen removed the request for review from hawkinsp Jan 29, 2018

@minminsun

This comment has been minimized.

Copy link
Contributor

minminsun commented Jan 30, 2018

Thanks for your suggestion @ebrevdo @rmlarsen @benoitsteiner . Sure, we'd like to migrate the LINM code to the Grappler framework as a Grappler optimizer pass.

@yangjunpro

This comment has been minimized.

Copy link

yangjunpro commented Jan 30, 2018

Thanks for you guys' response.
We would like to migrate our code from original TF graph optimization passes to grappler optimization passes(actually we have made a small PR related to grappler's memory optimization accepted by r1.4 before, also recently there is another op fusion work done at grappler code base which we would also like to push to TF upstream when internal test is ready enough).
As @benoitsteiner mentioned you are also adding some loop optimization work in grappler, could you share a little bit more detail as to those loop optimization? Since we have already implemented LINM optimization, we would be more than happy to re-implement it as another grappler optimization pass. Also I want to know more details about your grappler-level loop optimization work to ensure that your internal work will not be quite duplicated from ours since engineering resource is always limited and there are still quite a lot of things that we could put at our radar:)
We have already had such issues with TensorRT&TF integration work:), and also I know working with community will not completely avoid such duplicated and overlap work since people seeing the same code base may figure out the same optimization idea based on their specific scenarios. But I do want to set up a mechanism so that we could catch up with the community as soon as possible and then we can steer the direction of our engineering resources in a more productive way.

Thanks

@benoitsteiner

This comment has been minimized.

Copy link
Contributor

benoitsteiner commented Jan 30, 2018

@yangjunpro: Here is where we are with respect to loop optimizations:

  • We have completed the work on shape inference for loops. Since most of the optimizations we do depend on shape information, this allowed us to optimize the body and fanout of loops using existing optimizers.
  • We implemented utilities that are helpful when optimizing loops (frame identification, ...)
  • We barely started to work on loop invariant code motion. Since you're miles ahead of us here, it will be much more productive to leverage your code.
  • We are thinking about optimizing the 0 iteration case by removing the loop body, enter, exit, merge, switch, and next iteration nodes. We haven't found this to be a common pattern though, so we haven't started to work on this.
  • Ultimately, we'd like to infer the iteration count when it is static, or an upper bound otherwise. This will enable us to detect and optimize the 1 iteration case by removing all the loop logic. This will also enable us to apply the memory optimizations we're working on to loops. We haven't started work on this either though.
  • We're also thinking about experimenting with partial loop unrolling. We aren't planning to work on this until we can analyze the loop iteration counter though.

Let us know if any of this is also on your radar, or if you have other optimizations in mind, and we'd be happy to align our efforts with yours. This will avoid further duplication of effort.

@linearhit

This comment has been minimized.

Copy link

linearhit commented Jan 31, 2018

Thanks for the information.

@benoitsteiner , as you said:
Since most of the optimizations we do depend on shape information, this allowed us to optimize the body and fanout of loops using existing optimizers.

Could you provide some more information about why the shape information is important (except XLA)?

@yangjunpro

This comment has been minimized.

Copy link

yangjunpro commented Jan 31, 2018

@benoitsteiner

Thanks for sharing the detailed information as to your plan about loop optimizations.
We have also made a discussion about refactoring LINM into grappler passes. Currently we plan to release the grappler pass based LINM PR by the end of February since there will be a long traditional Chinese sprint festivals:) and also at present there is an ongoing project which is close to its release date and we don't want to make context switch too frequently.
Let us know whether this time slot is suitable for you.

Before the official PR to be submit, we would keep update with the community to ensure what we are doing is at good pace. For example, in your replies, it is mentioned that "We implemented utilities that are helpful when optimizing loops (frame identification, ...)", actually in LINM we have already implemented the same "frame identification" utility function, and if your implementation is already ready, we may base on your utility functions within our LINM implementation. If it is not ready, we would also like to contribute to its implementation since frame identification actually is a somewhat tricky and complex functionality.

As to those loop optimization plans mentioned in your previous reply, they are really interesting work. Internally we had some discussions about loop unrolling but we are not sure about the performance benefit with it since in our understanding loop unrolling in TF graph level may not bring as much performance gain as in traditional IR/low-level language level because the condition check overhead will be mitigated by the long execution of loop body itself. Also another potential performance improvement with loop unrolling is the interleave execution of loop iterations, however with our analysis, it looks that not too many DL workloads will trigger such optimization behavior.
Actually an interesting question we keep asking ourselves is that "which kind of traditional compiler optimization techniques may be suitable for Deep Learning graph level optimization?". For XLA, it could re-implement a lots of traditional compiler optimization passes since HLO IR is quite like traditional programming language(XLA has its LLVLM IR emitter and LLVM backend for different targets, GPU or CPU,etc.). But also I am wondering whether it is more productive by leveraging existing compiler for those mature targets, such as NVIDIA's nvcc or Intel's ICC. I don't know the exact answer.Also I have started a discussion here(https://groups.google.com/forum/#!topic/xla-dev/doFohtEAoLU), and it is still a open discussion thread.

As to our graph-level optimization plan, I can provide a list as following:

  1. We are currently implementing a template-based op fusion engine(somewhat like TensorRT's catalog based behavior), since we found that there are some op fusion patterns with high usage frequency, for those "high frequent" op fusion patterns, we think it deserves to write a macro-op for it then use template-based op fusion to replace the origin subgraph with the macro-op within the graph optimization passes. We have already made some improvement with this op fusion engine, and significant performance improvement is observed. Actually in TensorFlow, we have found there are some offline post-processing tools like graph_transform which will transform the original graph into another one with better performance(such as replace conv+bias+relu with a single op, and micro-op based BN with fused BN, etc.), this is a cheaper way. However, by introducing a post-processing phase, it will require some behavior change of end-users, which may bring some overhead, especially for existing systems. So we choose to add this template-based op fusion engine as a new graph optimization pass. Fortunately, we have caught up with TF team on time, so we will ensure most of the logic will be implemented into grappler passes:).
  2. Another optimization plan wandering in our mind is that we are also wondering whether the "non-mutable" property of model weights at inference phase could be further leveraged for performance boost. For example, in traditional EDA compiler
    community, a specific logic optimization passes will be introduced for doing some logic optimizations to replace the original logic with a new one with better performance(such as mux logic optimization). We are curious whether such optimization could be imported into TF. I know grappler already has its arithmetic_optimization pass which does some strength reduction optimization, but I am wondering whether we could do the optimization in a more principal way, such as formulating it as SAT problem. This is not a easy problem, so after doing some initial investigation, we temporarily switch to other threads. If Google guys have interest, we may collaboratively discuss the possibility of this optimization direction.
  3. Profiling-guided optimization. Since in Alibaba, we had both in-house and public cloud-based environment, and in those environments, various workloads are keeping running. For each workloads, lot of optimization passes may be triggered. We believe that by collecting those online runtime behavior as profiling data, we may guide the optimization of subsequent workloads execution. Of course, based on Google's paper in Micro 2010, it looks that Google already has the company-level online profiling tool earlier, maybe you have already have such optimization enabled within your in-house TF environment.

Thanks

@benoitsteiner

This comment has been minimized.

Copy link
Contributor

benoitsteiner commented Jan 31, 2018

@linearhit One of the main reasons we need shapes to optimize is that TensorFlow does automatic broadcasting. As a result, optimizations such as replacing A+0 or A*1 (where A, 0, and 1 are tensors) with A are only safe if A is a larger tensor than the constants. If for example, A is the vector [a1, a2] and 0 is the tensor [[0,0,][0,0]], A+0 will result in the tensor [[a1, a2][a1,a2]].

@benoitsteiner

This comment has been minimized.

Copy link
Contributor

benoitsteiner commented Jan 31, 2018

@yangjunpro
Releasing the loop invariant code by the end of February would be great. That would leave plenty of time to make it part of the 1.7 release of TensorFlow.
We put common grappler utilities in the grappler/utils directory. In particular, our frame inference api resides in the frame.h header file. Another commonly used tools is our shape and type inference code.

We believe that loop unrolling can help the performance of models during training since it would cut down some of the overhead needed to feed the activations generated during the forward pass to the corresponding gradient computations during the backward pass. We could also take advantage of unrolling to reduce the memory usage: The idea is to unroll the loop k times, and only keep the activation generated at the end of each unrolled forward mini sequence of k iterations. The would reduce the amount of memory needed to keep activations around by a factor of k at the expense of having to recompute the activations at the beginning of each unrolled backward sequence of k iterations.

I've worked on formal verification of ASIC circuits in the past, so I've pondered the use of SAT solvers to drive some of the optimizations we're doing. I believe there is a lot of potential in doing this, but given the initial investment needed to get this off the ground I doubt we will get to this before next year. In any case, we'll get in touch with you before we start.

We've invested a lot of time in making it possible to collect performance data either by running the graph, or by simulating the execution of the graph. We're in the process of releasing the first optimizer that takes advantage these predictions (the memory optimizer). We will start releasing the second one (automated graph placement) later this month. There are many more graph level optimizations that could benefit from this though. The first one that comes to mind is to automatically choose the between the sparse, dense, or semi sparse implementations of common TF operations depending on the input workload.

@yangjunpro

This comment has been minimized.

Copy link

yangjunpro commented Feb 1, 2018

@benoitsteiner
As to the loop unrolling optimization related to memory saving, it looks like the similar idea from DeepMind's paper "Memory-Efficient Backpropagation Through Time". Also there is a related work from OpenAI's https://github.com/openai/gradient-checkpointing.

I agree with the memory optimization idea with loop unrolling. However, I think maybe it will more graceful to be implemented in the grappler MemoryOptimizer? Since in MemoryOptimizer it already has the swap2host and re-computation support, I think this loop unrolling memory optimization could be integrated into the MemoryOptimizer. Thus all memory related optimization code could be placed in the same optimization pass.
Previously, we have use the same alike idea of re-computation for implementing a memory-efficient attention-operator(since it is a specific operator, so we haven't pushed it to the community TF repo). And it has to be admit that by implementing memory optimization in a specific way(such as a dedicated memory-efficient operator or within the customization of a special graph pattern such as loop construct) would require less effort. To generalize all those memory optimization behavior in a principal way, much much more effort may be required. But
as a graph-optimization tool, maybe it would be better to put this loop-unrolling based memory optimization work into the MemoryOptimizer? Please correct me if I am wrong. I think you guys must have deeper thinking about this design.

As to the "automated graph placement" optimization, is it applicable for local execution(I mean single worker) or both for distributed execution? For local execution, I personally feel that the optimization room may not be quite big. For distributed execution, the story is different, since based on the profiling data, we may choose how to allocate the workload computation among different devices(CPU, GPU, FPGA or other NPUs), also based on the profiling data, we may choose a optimal(or perhaps sub-optimal) distributed solution(worker number, ps number, communication strategy, shard strategy, etc.). Actually, the more complex the execution scenario is, the more room we could have with automatic placement. We are also started working on this job but it is not easy and before figuring out a general solution, we need to do a lot of task analysis to ensure the abstracted general solution is good enough.

Thanks

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented Feb 13, 2018

@benoitsteiner what should happen with this PR?

@martinwicke martinwicke requested a review from benoitsteiner Feb 13, 2018

@benoitsteiner

This comment has been minimized.

Copy link
Contributor

benoitsteiner commented Feb 15, 2018

@martinwicke: Alibaba will update the code, at which point we'll start the review/testing/merging process.

@rmlarsen rmlarsen added the stalled label Feb 15, 2018

@yangjunpro

This comment has been minimized.

Copy link

yangjunpro commented Feb 22, 2018

@benoitsteiner @martinwicke

We are already working on moving this optimization pass into grappler, the major code logic is completed and we are adding unit test now, it is hopefully that we may submit the first grappler version pull request around the end of this month.

@minminsun

@minminsun minminsun closed this Feb 28, 2018

@minminsun minminsun force-pushed the minminsun:master branch from feada2a to 57b32ea Feb 28, 2018

@minminsun

This comment has been minimized.

Copy link
Contributor

minminsun commented Feb 28, 2018

Hi @benoitsteiner

The LINM code has been migrated to grappler in this PR. Could you please help to take a look? Thanks!

BTW, I was not aware that (almost empty) loop_optimizer files have been added in grappler in TF master 15 days ago after I started the code migration, as a result my commit has conflicts with them. So I'm starting to prepare a new commit based on the latest TF master.

@benoitsteiner

This comment has been minimized.

Copy link
Contributor

benoitsteiner commented Feb 28, 2018

@minminsun: thanks, I'll start the code review. @rmlarsen added a new loop optimizer to start working on stack push/pop removal when possible. If this introduces further conflicts, they should be minimal.

@@ -93,6 +98,10 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
optimizers.push_back(std::unique_ptr<GraphOptimizer>(
new ArithmeticOptimizer(cfg_.arithmetic_optimization())));
}
if (cfg_.loop_optimization() != RewriterConfig::OFF) {

This comment has been minimized.

@benoitsteiner

benoitsteiner Feb 28, 2018

Contributor

Can you change this to if (cfg_.loop_optimization() == RewriterConfig::ON) for now ? We typically find a few issues whenever we add a new optimization, so we always start by turning new optimizers OFF by default, put the code through its paces (we have a lot of models to try this on internally), solve the problems we uncover, and eventually turn it ON by default.

This comment has been minimized.

@minminsun

minminsun Mar 1, 2018

Contributor

Thanks for the info. I have made this change in the new commit.

#include "tensorflow/core/platform/test.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/grappler/grappler_item.h"
#include "tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"

This comment has been minimized.

@benoitsteiner

benoitsteiner Feb 28, 2018

Contributor

Are these 3 includes needed ?

This comment has been minimized.

@minminsun

minminsun Mar 1, 2018

Contributor

It turned out most of the the includes in my loop_optimizer_test.cc are not needed. So I removed them and keep the includes exactly the same as the loop_optimizer_test.cc in TF master to ease the merging later.

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented Mar 1, 2018

@minminsun can you resolve the conflicts? Thanks!

@yangjunpro

This comment has been minimized.

Copy link

yangjunpro commented Mar 1, 2018

@martinwicke Sure, we will resolve the conflicts ASAP.
@minminsun

@minminsun

This comment has been minimized.

Copy link
Contributor

minminsun commented Mar 2, 2018

@martinwicke I have resolved the conflicts. Let me know if there's any further issue, thanks!

LINM: a minor change in BUILD to fix gen_ci_sanity_out failure, and r…
…emove 'No newline at end of file' warning

@benoitsteiner benoitsteiner merged commit 84fe908 into tensorflow:master Mar 2, 2018

15 checks passed

Android Demo App Internal CI build successful
Details
GPU CC Internal CI build successful
Details
GPU Python3 Internal CI build successful
Details
MacOS Contrib Internal CI build successful
Details
MacOS Python2 and CC Internal CI build successful
Details
Ubuntu CC Internal CI build successful
Details
Ubuntu Makefile Internal CI build successful
Details
Ubuntu Python2 Internal CI build successful
Details
Ubuntu Python3 Internal CI build successful
Details
Ubuntu Python3 PIP Internal CI build successful
Details
Ubuntu Sanity Internal CI build successful
Details
Ubuntu contrib Internal CI build successful
Details
Windows CMake Internal CI build successful
Details
XLA Internal CI build successful
Details
cla/google All necessary CLAs are signed

StanislawAntol pushed a commit to StanislawAntol/tensorflow that referenced this pull request Mar 23, 2018

Add LINM (Loop Invariant Node Motion) optimization pass in GraphOptim… (
tensorflow#16306)

* Add Loop Invariant Node Motion optimization in grappler

* linm: disable loop optimizations by default, remove includes not needed from loop_optimizer_test.cc

* remove redundant lines after merging with master

* LINM: a minor change in BUILD to fix gen_ci_sanity_out failure, and remove 'No newline at end of file' warning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment