Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When GlobalJitLevel is on, disable the Grappler memory opt. #32245

Merged
merged 15 commits into from Sep 19, 2019

Conversation

trentlo
Copy link
Contributor

@trentlo trentlo commented Sep 5, 2019

This commit disables the Grappler memory optimizer when the XLA JIT is detected on.

The (current) XLA clustering can result in loss of concurrency between kernel compute and memory copies. As such, it usually loses the concurrency needed to hide the latencies of the inserted swap-ins and swap-outs and incurs great performance overhead.

You may find more details about the performance degradation in this document:
https://docs.google.com/document/d/1q1UPN2CRRNBoUXM0zORT-cTG9m7ctQRB2xfUPhaK1Ek/edit?usp=sharing

@sanjoy, please also help to take a look.

@tensorflow-bot tensorflow-bot bot added the size:S CL Change Size: Small label Sep 5, 2019
@rthadur rthadur requested a review from sanjoy September 5, 2019 17:36
@rthadur rthadur self-assigned this Sep 5, 2019
@rthadur rthadur added this to Assigned Reviewer in PR Queue via automation Sep 5, 2019
@rthadur rthadur added comp:xla XLA comp:grappler Grappler related issues and removed comp:xla XLA labels Sep 5, 2019
This relaxes the disable check and should be a slightly better behavior, as
users still have some ways to enable the memory optimizer when they want to.
@trentlo
Copy link
Contributor Author

trentlo commented Sep 5, 2019

One question:

According to GetGlobalJitLevelForGraph(), there are two ways to enable the XLA JIT; one is through the ConfigProto and the other is through the environment variable.

This PR can only check the ConfigProto and cannot check the env var flags as they only exist in the XLA world.

Do you think this is an issue? Any suggestions to address it?

@nouiz
Copy link
Contributor

nouiz commented Sep 5, 2019

maybe call getenv and parse the env variable?

@sanjoy
Copy link
Contributor

sanjoy commented Sep 5, 2019

This PR can only check the ConfigProto and cannot check the env var flags as they only exist in the XLA world.

I think we should just expose GetGlobalJitLevelForGraph to to TensorFlow by moving it to a header/cc outside compiler/xla.

@rmlarsen WDYT?

@trentlo
Copy link
Contributor Author

trentlo commented Sep 5, 2019

This PR can only check the ConfigProto and cannot check the env var flags as they only exist in the XLA world.

I think we should just expose GetGlobalJitLevelForGraph to to TensorFlow by moving it to a header/cc outside compiler/xla.

@rmlarsen WDYT?

It is more than that, as the GetGlobalJitLevelForGraph depends on compiler/jit/flags.cc, etc. which do not exist in a Tensorflow build without enabling XLA.

If we want to move the function, we will have to move several xla flag related files with it. Considering that, below are two options we have.

  1. Just check the proto config without checking the env variables. This may make sense as the proto config should be the way for clients to enable XLA. Env variables are more for testing? Let me know if this is not the case though.
  2. Move the xla flags related files into TF.

Personally, I am slightly leaning towards option 1 but please give some guidance. Thanks.

@trentlo
Copy link
Contributor Author

trentlo commented Sep 5, 2019

maybe call getenv and parse the env variable?

IMHO, re-implement the parsing logic in TF may be ugly because there will then be two separate places to be maintained for future env variable change.

@sanjoy
Copy link
Contributor

sanjoy commented Sep 5, 2019

This may make sense as the proto config should be the way for clients to enable XLA. Env variables are more for testing? Let me know if this is not the case though.

Env vars are less elegant but they're super convenient when trying out XLA ("no need to make code changes, just flip this flag") so I'd like to make this work with env vars if possible.

Another possibility is to do some form of dependency injection via a registry -- allow the XLA JIT to register a callback that lets TF query whether the XLA global jit is enabled. This will be more code but should be cleaner overall.

@trentlo
Copy link
Contributor Author

trentlo commented Sep 5, 2019

Another possibility is to do some form of dependency injection via a registry -- allow the XLA JIT to register a callback that lets TF query whether the XLA global jit is enabled. This will be more code but should be cleaner overall.

Good suggestion. Let's proceed with this route.

Callbacks can be registered to this class so that runtime environment
flags can be parsed to change configs in the Tensorflow core. A primary use
of this is for the Tensorflow core to query the XLA JIT level, which can be
configured by some runtime environment flags in addition to ConfigProto.
@trentlo
Copy link
Contributor Author

trentlo commented Sep 10, 2019

I implemented a config proxy in the Tensorflow core per our previous discussion. Please help to take a look when you have a moment. Thanks.

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Sep 11, 2019

// Register callbacks.
XlaConfigProxy::ConfigSetterRegistration<ConfigA>
contrig_setter_registration_a([](ConfigA& config_a) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contrib_setter_registration_a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execute me for the typo. Will fix them immediately.

namespace tensorflow {

// A proxy class of XLA config.
class XlaConfigProxy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how you're planning to use this, do you mind elaborating a bit?

Here is what I had in mind though: have a registry that lets XLA register an std::function<OptimizerOptions::GlobalJitLevel(const GraphOptimizationPassOptions& options)>. Then Grappler can check if XLA is enabled using this registered std::function; if nothing is registered then it can assume that XLA is not enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very similar. Just that we need a singleton (or at least some static functions) for registering this std::function callback. And I slightly generalize the registry class into ConfigSetterRegistry.

ConfigSetterRegistry is a singleton utility class to register a std::function<bool(ConfigType&)>; meaning that it allows each different config type to register a callback. In the current case, XLA registers a std::function<bool(GlobalJitLevel&)>, and TF can then invoke this std::function to update the GlobalJitLevel value.

Perhaps the generalization loses the readability. If that is the case, let me remove the ConfigSetterRegistry and simply make XlaConfigProxy a singleton. Please feel free to let me know what you think.

Copy link
Contributor

@sanjoy sanjoy Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this generality is "YAGNI" for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point.

Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Response inlined.

namespace tensorflow {

// A proxy class of XLA config.
class XlaConfigProxy {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very similar. Just that we need a singleton (or at least some static functions) for registering this std::function callback. And I slightly generalize the registry class into ConfigSetterRegistry.

ConfigSetterRegistry is a singleton utility class to register a std::function<bool(ConfigType&)>; meaning that it allows each different config type to register a callback. In the current case, XLA registers a std::function<bool(GlobalJitLevel&)>, and TF can then invoke this std::function to update the GlobalJitLevel value.

Perhaps the generalization loses the readability. If that is the case, let me remove the ConfigSetterRegistry and simply make XlaConfigProxy a singleton. Please feel free to let me know what you think.


// Register callbacks.
XlaConfigProxy::ConfigSetterRegistration<ConfigA>
contrig_setter_registration_a([](ConfigA& config_a) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execute me for the typo. Will fix them immediately.

@rthadur rthadur requested a review from sanjoy September 11, 2019 15:15
Also, rename XlaConfigProxy to XlaConfigRegistry.
Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please help to take a look again. Thanks.

namespace tensorflow {

// A proxy class of XLA config.
class XlaConfigProxy {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point.

tensorflow/compiler/jit/xla_cluster_util.cc Show resolved Hide resolved
OptimizerOptions::GlobalJitLevel jit_level_in_session_opts) {
XlaGlobalJitLevel xla_global_jit_level =
GetXlaGlobalJitLevel(jit_level_in_session_opts);
// Take the general flag to avoid the dependency on Tensorflow::Graph.
Copy link
Contributor Author

@trentlo trentlo Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply ignore the single_graph flags, as tensorflow::Graph is not accessible in InitializeOptimizers(), where grappler configures its optimizers.

It makes sense to me that we simply expose the general flag (,although both of the general and single_gpu flags are currently the same). Let me know what you think about this though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this makes XlaConfigRegistry::GetGlobalJitLevel "look" misleading.

How about this: we make the callback return an instance of XlaGlobalJitLevel. Then in optimizers/meta_optimizer.cc we do:

XlaGlobalJitLevel xla_global_jit_level = ...;
if (xla_global_jit_level.general is ON) {
  Disable memory opt
}
// We don't care about single GPU because the regression happens only on multi-GPU 

This is a bit of a hack because if the graph happens to be single GPU but the general auto-jit is enabled then we're still going to disable the optimization but I think for getting us started this is fine. Behaviorally this is identical to what you have but I think it is clearer.

Also a bit more context: if only single_gpu is enabled we absolutely don't want any behavior change on multi-GPU graphs. We're using --tf_xla_auto_jit=single-gpu(2) to stage the rollout of XLA and we want --tf_xla_auto_jit=single-gpu(2) to only affect single GPU graphs.

Copy link
Contributor Author

@trentlo trentlo Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Thanks for the good feedback.

I made some changes:

  1. Made the callback return XlaGlobalJitLevel, as this way is indeed less ambiguous.
  2. I added check in the MetaOptimizer such that the memory optimizer is turned off only when JIT is on for both single-gpu and multi-gpu graphs. This is a conservative approach and should be good enough for now.

Thanks for the context about how the single-gpu flag is used. It was not clear to me and it looks good to me.

Note that the swap-in/out latency issue should also exist in single gpu since the copy stream is different from device compute stream in the default Tensorflow setting. (i.e., concurrency between streams can be lost.)

A bit more context: note that the memory optimizer can be triggered in a single-gpu graph, as long as the estimated device memory pressure is high. In the report, it is just that the inserted Horovod nodes increase the memory pressure to trigger the swap-in/out insertion condition. It does not mean that the memory optimization cannot get triggered in a single gpu graph.

@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 13, 2019
@rthadur rthadur added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Sep 13, 2019
@trentlo
Copy link
Contributor Author

trentlo commented Sep 18, 2019

@rthadur

I noticed that the Ubuntu CC test may be stuck. Will that block the merging?

@sanjoy
Copy link
Contributor

sanjoy commented Sep 18, 2019

Will that block the merging?

I'm taking care of the merge. Sorry for the delay, I'm a bit busier than usual.

@trentlo
Copy link
Contributor Author

trentlo commented Sep 18, 2019

I'm taking care of the merge. Sorry for the delay, I'm a bit busier than usual.

Got it. No worries and take time then. Thanks.

tensorflow-copybara pushed a commit that referenced this pull request Sep 19, 2019
@tensorflow-copybara tensorflow-copybara merged commit 67c50b0 into tensorflow:master Sep 19, 2019
PR Queue automation moved this from Approved by Reviewer to Merged Sep 19, 2019
@byronyi
Copy link
Contributor

byronyi commented Sep 23, 2019

@trentlo @goldiegadde I am not sure if this PR catch the r2.0 cherry-pick cycle. It does seem to fix a performance regression running the official ResNet50 model w/ XLA 8-GPU.

I observed significant performance boost (~3900 image/s to ~7500 image/s) comparing 2.0 nightly packages in 2019-09-19 (wo/ this PR) and 2019-09-20 (w/ this PR).

The benchmark I am running:

python3 official/vision/image_classification/resnet_imagenet_main.py \
    --ds=default \
    --bs=2048 \
    --dtype=fp16 \
    --enable_xla \
    --ng=8 \
    --use_tensor_lr \
    --pgtc=2 \
    --gt_mode=gpu_private \
    --datasets_num_private_threads=48

Ping @tfboyd to confirm. Several other CLs on the same day are suspicious, but my hunch is this one.

EDIT: my guess is wrong. Turns out 59436c1 that reverts 53bdcf5 improves the performance greatly.

@byronyi
Copy link
Contributor

byronyi commented Sep 23, 2019

I will cherry-pick this on my own r2.0 branch and report the numbers tomorrow.

@trentlo
Copy link
Contributor Author

trentlo commented Sep 23, 2019

@byronyi FYI as below.

Indeed, cherry-pick the commits and measure the performance difference is the most direct way to verify.

A quick way to verify is to use the following flags. It dumps a bunch of logged files into ./graph_dump.

TF_DUMP_GRAPH_PREFIX="./graph_dump" TF_XLA_FLAGS="--tf_xla_clustering_debug=true" python3 ...

Then you can use grep -r GpuToHost ./graph_dump to see whether any copy ops are inserted by Grappler. If you see them, likely they can cause great performance penalty. These ops should disappear when XLA is ON if this PR is there.

@tfboyd
Copy link
Member

tfboyd commented Sep 23, 2019

I am seeing the same bounce back in the nightly tests. XLA stated they were not focused on multi-GPU tests as their focus was XLA on by default for single GPU so I stopped bothering them with the results. The drop to ~3,900 occurred around 03-AUG in my testing. I see the bounce back to 8,500 with 2.0.0-dev20190922. This is still not back to the ~9K we saw before the drop. I doubt this will get cherry-picked into TF 2.0 as XLA is still experimental and RC2 is likely to be the last RC needed before release.

@byronyi
Copy link
Contributor

byronyi commented Sep 25, 2019

@tfboyd It turns out that 53bdcf5 was the commit that causes bad performance in the nightly package, not the issue this PR fixes. This CL was reverted in 59436c1, so the performance went back to ~7,600 in our environment.

So there must be some other changes that cause performance degradation in r2.0. Do you mind to point to me a snapshot with the expected performance, i.e. before dropping to ~3,900 around 03-AUG?

@tfboyd
Copy link
Member

tfboyd commented Sep 26, 2019

@byronyi Here is the info I have that may be of the most use. It would be nice to get back to the 9K, but it is also possible too many other changes have occurred. I only run on nightly and when nightly works.

Command for the test you would need a copy of imagenet.

# Flags
--batch_size=2048 --data_dir=/data/imagenet --dtype=fp16 --enable_eager --enable_xla --log_steps=10 --model_dir=/workspace/benchmarks/perfzero/workspace/output/2019-08-02-12-50-22-843844/benchmark_xla_8_gpu_fp16 --num_gpus=8 --noreport_accuracy_metrics --skip_eval --train_steps=110

# [PerfZero](https://github.com/tensorflow/benchmarks/tree/master/perfzero) command.  I suggest using perfzero when possible as it helps ensure your
# args are the same and a common docker is used for repeat-ability.  Not perfect but lightweight
# and improving.  
python3 /workspace/benchmarks/perfzero/lib/benchmark.py --benchmark_methods=official.benchmark.keras_imagenet_benchmark.Resnet50KerasBenchmarkReal.benchmark_xla_8_gpu_fp16 --data_downloads=gs://tf-perf-imagenet-uswest1/tensorflow/imagenet --python_path=models --git_repos="https://github.com/tensorflow/models.git;master;6b586a910d74a44f57da4d2335c79a20dc2803ab"

GOOD
07/31
8,723.9 to 9,000 images/sec
tensorflow hash:3e0ad8a004
model garden hash: 13e7c85

BAD
08/03
We had a break in the builds for a few days and a big change to the default code path I believe. Many problems
2,764.3 to 2,900 images/sec.
tensorflow hash: 3205135
model garden: 6b586a9

Current
09/26
8,300 to 8,500 images/sec
tensorflow hash: 1ae01a0
model garden: 6f1e3b38d80f131e21f0721196df7cfc5ced2b74

It might be hard to bisect as I think the model garden code changed significantly between the two dates with the following lines and we also moved the code around recently so running the test now vs back in August there is a different directory:

   # TODO(b/138957587): Remove when force_v2_in_keras_compile is on longer
    # a valid arg for this model. Also remove as a valid flag.
    if flags_obj.force_v2_in_keras_compile is not None:
      model.compile(
          loss='sparse_categorical_crossentropy',
          optimizer=optimizer,
          metrics=(['sparse_categorical_accuracy']
                   if flags_obj.report_accuracy_metrics else None),
          run_eagerly=flags_obj.run_eagerly,
          experimental_run_tf_function=flags_obj.force_v2_in_keras_compile)
    else:
      model.compile(
          loss='sparse_categorical_crossentropy',
          optimizer=optimizer,
          metrics=(['sparse_categorical_accuracy']
                   if flags_obj.report_accuracy_metrics else None),
          run_eagerly=flags_obj.run_eagerly)

@sganeshb has taken over testing. I am not involved in TensorFlow performance going forward.

@byronyi
Copy link
Contributor

byronyi commented Sep 26, 2019

Thanks Toby @tfboyd! That’s much appreciated.

@byronyi
Copy link
Contributor

byronyi commented Sep 27, 2019

For the curious: 1d139ed seems to get the performance of r1.15 back to 7600+. However after 1 single day another regression 53bdcf5 degrades the performance to <4,000 again.

@tfboyd
Copy link
Member

tfboyd commented Sep 27, 2019

@byronyi Wanted you to know I read your findings and nice work tracking it down. While not good, your work shows the pain of finding old issues, they often end up layered. I would not want you to think your work bisecting was not looked at. It was a rough situation. The XLA team is/was focused only on single GPU performance so they could make XLA on by default for that scenario. They seem to be looking more holistically now.

I am not sure there is anything we can do about it at this point other than move forward with TensorFlow 2.1. FYI TensorFlow 2.0 100% has the XLA regression for multi-GPU.

@byronyi
Copy link
Contributor

byronyi commented Sep 27, 2019

Thanks Toby! I admit it could be difficult, but I work hard with my colleagues trying to fix these regressions for our internal 2.0 fork.

Personally I have been looking forward to it for too long. Just can't wait there and see our users have to choose between 2.0 w/ new features and old 1.x wo/ perf regressions. We'd really like to have our cake and eat it :)

@tfboyd
Copy link
Member

tfboyd commented Sep 27, 2019

@byronyi Not irrelevant at all. I have said before if you can get say 10K examples/sec on 8xV100s with FP16 + XLA but you need to scale to 32 GPUs (making up a number) to do that with FP32 then why are you focused on mulit-node and wasting money. Figure out FP16. There are exceptions for sure. This is a simple mental example. The other point being if you scale up 1 gpu and 8 GPUs (whatever your single node is) then you are scaling up everything else.

@tfboyd
Copy link
Member

tfboyd commented Sep 27, 2019

I also agree. It is not cool this exists in 1.x. It came down the XLA team not wanting to focus on mulit-GPU and we stopped TF 1.x testing maybe too early. I do not recall the final data but I stopped nightly perf tests on TF 1.x a couple months ago it was just too much for one person and there was a lack of enthusiasm to resolve issues. :-( Your job/role sounds really cool. I wish we had these groups when I started. I did a lot with multi-node and desperately wanted places to test. All I had was AWS and K80s...yup a long time ago. I eventually just gave up as it was pointless as the P100s came out and I did not have access. A very long story not approved for the public. I was ahead of schedule and burned out when the time came.

Super exciting.

@byronyi
Copy link
Contributor

byronyi commented Sep 28, 2019

@anj-s I saw 53bdcf5 submitted to master again. I assume the performance regression discussed above has been fixed?

EDIT: I checked that the real bug is fixed by cc9938e. Thanks!

@byronyi
Copy link
Contributor

byronyi commented Sep 28, 2019

@tfboyd The aforementioned regression seems to be caused by ef9f0e8, identified by bisecting between 08-01 nightly and 08-04 nightly. This commit is also present in both r1.15 and r2.0.

I am testing with model garden version 127b158. The benchmark I am running is:

python3 official/vision/image_classification/resnet_imagenet_main.py \
  --dd=/path/to/imagenet \
  --ds=default \
  --bs=2048 \
  --dtype=fp16 \
  --enable_xla \
  --ng=8 \
  --use_tensor_lr \
  --pgtc=2 \
  --gt_mode=gpu_private \
  --datasets_num_private_threads=48 \
  --fp16_implementation=graph_rewrite

With this CL the performance drops from ~7300 image/s to ~2000 image/s, tested on a 8x V100-PCIE machine (balanced bus topology) using NCCL and MirroredStrategy.

The nightly 1.x TF packages are built using the following setup:

OS: Debian 9.9
GCC: 6.3
Python: 3.5.3
Bazel: 0.26.1
CUDA: 10.0
cuDNN: 7.6.2.24
TensorRT: 5.1.5

bazel build \
    --action_env=CUDA_TOOLKIT_PATH="/usr/local/cuda" \
    --action_env=GCC_HOST_COMPILER_PATH="/usr/bin/gcc" \
    --action_env=PYTHON_BIN_PATH="/usr/bin/python3" \
    --action_env=PYTHON_LIB_PATH="/usr/local/lib/python3.5/dist-packages" \
    --action_env=TF_CONFIGURE_IOS="0" \
    --action_env=TF_CUDA_COMPUTE_CAPABILITIES="6.1,7.0,7.5" \
    --config=cuda \
    --config=numa \
    --config=tensorrt \
    --copt=-Wno-sign-compare \
    --copt=-march=ivybridge \
    --define=with_default_optimizations=true \
    --define=with_xla_support=true \
    --host_copt=-march=ivybridge \
    --python_path="/usr/bin/python3" \
    //tensorflow/tools/pip_package:build_pip_package

I will test by reverting it on the release branches. In the meanwhile, I'd like to ping @rachellim and see if she got any quick fixes for cherry-picking a fix instead of reverting the violating commit.

UPDATE: reverting the following commits improves 8x V100 performance from ~1,600 to ~7,700 image/s on latest r2.0 branch snapshot 64c3d38:

Ping @goldiegadde; is it still possible to revert these commits with 2.0? It will be very important for professional users to reproduce the expected multi-GPU ResNet50 performance numbers.

I build the 2.0 package using the following options:

bazel build \
    --action_env=CUDA_TOOLKIT_PATH="/usr/local/cuda" \
    --action_env=GCC_HOST_COMPILER_PATH="/usr/bin/gcc" \
    --action_env=PYTHON_BIN_PATH="/usr/bin/python3" \
    --action_env=PYTHON_LIB_PATH="/usr/local/lib/python3.5/dist-packages" \
    --action_env=TF_CONFIGURE_IOS="0" \
    --action_env=TF_CUDA_COMPUTE_CAPABILITIES="6.1,7.0,7.5" \
    --config=cuda \
    --config=numa \
    --config=tensorrt \
    --copt=-Wno-sign-compare \
    --copt=-march=broadwell \
    --define=tf_api_version=2 \
    --define=with_default_optimizations=true \
    --define=with_xla_support=true \
    --host_copt=-march=broadwell \
    --python_path="/usr/bin/python3" \
    //tensorflow/tools/pip_package:build_pip_package

@rachellim
Copy link
Contributor

Does 96a407a fix the performance issue? If so, maybe we can cherrypick it into the release?

@tfboyd
Copy link
Member

tfboyd commented Sep 30, 2019

@rachellim Assuming it does fix the issue cleanly as a cherry-pick. It will not get into TF 2.0 as we are at the end after many weeks. XLA is an experimental feature and doing a cherry-pick for 1.15 (same story with 2.0) may not seem risky is likely not justified form a risk reward stand point. All of that negativity said, you can 100% petition the release owner. I am an XLA fan for additional context.

@byronyi
Copy link
Contributor

byronyi commented Oct 1, 2019

@rachellim Cherry-Picking 96a407a along with 6d8f05a does seem to fix the performance regression. Ping @jsimsa; any possibility we get these fixes into 2.0.1?

Btw, for r1.15 cherry-picking 96a407a only seems fine. I will test the performance and see how it goes. Ping @goldiegadde; at least we can still cherry-pick for r1.15, right?

@rachellim
Copy link
Contributor

rachellim commented Oct 1, 2019

We'll cherry pick this into r1.15. As for 2.x, I'll let @goldiegadde comment further.

Edit: Actually, I'll defer to the release owner @goldiegadde as to how to handle this for both 1.15 and 2.x, since this PR involves pretty significant code change (including behavior changes to datasets with distribution strategy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:grappler Grappler related issues ready to pull PR ready for merge process size:S CL Change Size: Small
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

10 participants