New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When GlobalJitLevel is on, disable the Grappler memory opt. #32245
When GlobalJitLevel is on, disable the Grappler memory opt. #32245
Conversation
This relaxes the disable check and should be a slightly better behavior, as users still have some ways to enable the memory optimizer when they want to.
One question: According to This PR can only check the ConfigProto and cannot check the env var flags as they only exist in the XLA world. Do you think this is an issue? Any suggestions to address it? |
maybe call getenv and parse the env variable? |
I think we should just expose @rmlarsen WDYT? |
It is more than that, as the If we want to move the function, we will have to move several xla flag related files with it. Considering that, below are two options we have.
Personally, I am slightly leaning towards option 1 but please give some guidance. Thanks. |
IMHO, re-implement the parsing logic in TF may be ugly because there will then be two separate places to be maintained for future env variable change. |
Env vars are less elegant but they're super convenient when trying out XLA ("no need to make code changes, just flip this flag") so I'd like to make this work with env vars if possible. Another possibility is to do some form of dependency injection via a registry -- allow the XLA JIT to register a callback that lets TF query whether the XLA global jit is enabled. This will be more code but should be cleaner overall. |
Good suggestion. Let's proceed with this route. |
Callbacks can be registered to this class so that runtime environment flags can be parsed to change configs in the Tensorflow core. A primary use of this is for the Tensorflow core to query the XLA JIT level, which can be configured by some runtime environment flags in addition to ConfigProto.
I implemented a config proxy in the Tensorflow core per our previous discussion. Please help to take a look when you have a moment. Thanks. |
|
||
// Register callbacks. | ||
XlaConfigProxy::ConfigSetterRegistration<ConfigA> | ||
contrig_setter_registration_a([](ConfigA& config_a) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contrib_setter_registration_a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Execute me for the typo. Will fix them immediately.
namespace tensorflow { | ||
|
||
// A proxy class of XLA config. | ||
class XlaConfigProxy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how you're planning to use this, do you mind elaborating a bit?
Here is what I had in mind though: have a registry that lets XLA register an std::function<OptimizerOptions::GlobalJitLevel(const GraphOptimizationPassOptions& options)>
. Then Grappler can check if XLA is enabled using this registered std::function
; if nothing is registered then it can assume that XLA is not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very similar. Just that we need a singleton (or at least some static functions) for registering this std::function
callback. And I slightly generalize the registry class into ConfigSetterRegistry
.
ConfigSetterRegistry
is a singleton utility class to register a std::function<bool(ConfigType&)>
; meaning that it allows each different config type to register a callback. In the current case, XLA registers a std::function<bool(GlobalJitLevel&)>
, and TF can then invoke this std::function
to update the GlobalJitLevel value.
Perhaps the generalization loses the readability. If that is the case, let me remove the ConfigSetterRegistry
and simply make XlaConfigProxy
a singleton. Please feel free to let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this generality is "YAGNI" for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Response inlined.
namespace tensorflow { | ||
|
||
// A proxy class of XLA config. | ||
class XlaConfigProxy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very similar. Just that we need a singleton (or at least some static functions) for registering this std::function
callback. And I slightly generalize the registry class into ConfigSetterRegistry
.
ConfigSetterRegistry
is a singleton utility class to register a std::function<bool(ConfigType&)>
; meaning that it allows each different config type to register a callback. In the current case, XLA registers a std::function<bool(GlobalJitLevel&)>
, and TF can then invoke this std::function
to update the GlobalJitLevel value.
Perhaps the generalization loses the readability. If that is the case, let me remove the ConfigSetterRegistry
and simply make XlaConfigProxy
a singleton. Please feel free to let me know what you think.
|
||
// Register callbacks. | ||
XlaConfigProxy::ConfigSetterRegistration<ConfigA> | ||
contrig_setter_registration_a([](ConfigA& config_a) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Execute me for the typo. Will fix them immediately.
Also, rename XlaConfigProxy to XlaConfigRegistry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Please help to take a look again. Thanks.
namespace tensorflow { | ||
|
||
// A proxy class of XLA config. | ||
class XlaConfigProxy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point.
OptimizerOptions::GlobalJitLevel jit_level_in_session_opts) { | ||
XlaGlobalJitLevel xla_global_jit_level = | ||
GetXlaGlobalJitLevel(jit_level_in_session_opts); | ||
// Take the general flag to avoid the dependency on Tensorflow::Graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply ignore the single_graph flags, as tensorflow::Graph is not accessible in InitializeOptimizers()
, where grappler configures its optimizers.
It makes sense to me that we simply expose the general flag (,although both of the general and single_gpu flags are currently the same). Let me know what you think about this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that this makes XlaConfigRegistry::GetGlobalJitLevel
"look" misleading.
How about this: we make the callback return an instance of XlaGlobalJitLevel
. Then in optimizers/meta_optimizer.cc we do:
XlaGlobalJitLevel xla_global_jit_level = ...;
if (xla_global_jit_level.general is ON) {
Disable memory opt
}
// We don't care about single GPU because the regression happens only on multi-GPU
This is a bit of a hack because if the graph happens to be single GPU but the general auto-jit is enabled then we're still going to disable the optimization but I think for getting us started this is fine. Behaviorally this is identical to what you have but I think it is clearer.
Also a bit more context: if only single_gpu is enabled we absolutely don't want any behavior change on multi-GPU graphs. We're using --tf_xla_auto_jit=single-gpu(2)
to stage the rollout of XLA and we want --tf_xla_auto_jit=single-gpu(2)
to only affect single GPU graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Thanks for the good feedback.
I made some changes:
- Made the callback return
XlaGlobalJitLevel
, as this way is indeed less ambiguous. - I added check in the
MetaOptimizer
such that the memory optimizer is turned off only when JIT is on for both single-gpu and multi-gpu graphs. This is a conservative approach and should be good enough for now.
Thanks for the context about how the single-gpu flag is used. It was not clear to me and it looks good to me.
Note that the swap-in/out latency issue should also exist in single gpu since the copy stream is different from device compute stream in the default Tensorflow setting. (i.e., concurrency between streams can be lost.)
A bit more context: note that the memory optimizer can be triggered in a single-gpu graph, as long as the estimated device memory pressure is high. In the report, it is just that the inserted Horovod nodes increase the memory pressure to trigger the swap-in/out insertion condition. It does not mean that the memory optimization cannot get triggered in a single gpu graph.
I noticed that the |
I'm taking care of the merge. Sorry for the delay, I'm a bit busier than usual. |
Got it. No worries and take time then. Thanks. |
PiperOrigin-RevId: 270071858
@trentlo @goldiegadde I am not sure if this PR catch the r2.0 cherry-pick cycle. It does seem to fix a performance regression running the official ResNet50 model w/ XLA 8-GPU. I observed significant performance boost (~3900 image/s to ~7500 image/s) comparing 2.0 nightly packages in 2019-09-19 (wo/ this PR) and 2019-09-20 (w/ this PR). The benchmark I am running:
Ping @tfboyd to confirm. Several other CLs on the same day are suspicious, but my hunch is this one. EDIT: my guess is wrong. Turns out 59436c1 that reverts 53bdcf5 improves the performance greatly. |
I will cherry-pick this on my own r2.0 branch and report the numbers tomorrow. |
@byronyi FYI as below. Indeed, cherry-pick the commits and measure the performance difference is the most direct way to verify. A quick way to verify is to use the following flags. It dumps a bunch of logged files into
Then you can use |
I am seeing the same bounce back in the nightly tests. XLA stated they were not focused on multi-GPU tests as their focus was XLA on by default for single GPU so I stopped bothering them with the results. The drop to ~3,900 occurred around 03-AUG in my testing. I see the bounce back to 8,500 with 2.0.0-dev20190922. This is still not back to the ~9K we saw before the drop. I doubt this will get cherry-picked into TF 2.0 as XLA is still experimental and RC2 is likely to be the last RC needed before release. |
@tfboyd It turns out that 53bdcf5 was the commit that causes bad performance in the nightly package, not the issue this PR fixes. This CL was reverted in 59436c1, so the performance went back to ~7,600 in our environment. So there must be some other changes that cause performance degradation in r2.0. Do you mind to point to me a snapshot with the expected performance, i.e. before dropping to ~3,900 around 03-AUG? |
@byronyi Here is the info I have that may be of the most use. It would be nice to get back to the 9K, but it is also possible too many other changes have occurred. I only run on nightly and when nightly works. Command for the test you would need a copy of imagenet. # Flags
--batch_size=2048 --data_dir=/data/imagenet --dtype=fp16 --enable_eager --enable_xla --log_steps=10 --model_dir=/workspace/benchmarks/perfzero/workspace/output/2019-08-02-12-50-22-843844/benchmark_xla_8_gpu_fp16 --num_gpus=8 --noreport_accuracy_metrics --skip_eval --train_steps=110
# [PerfZero](https://github.com/tensorflow/benchmarks/tree/master/perfzero) command. I suggest using perfzero when possible as it helps ensure your
# args are the same and a common docker is used for repeat-ability. Not perfect but lightweight
# and improving.
python3 /workspace/benchmarks/perfzero/lib/benchmark.py --benchmark_methods=official.benchmark.keras_imagenet_benchmark.Resnet50KerasBenchmarkReal.benchmark_xla_8_gpu_fp16 --data_downloads=gs://tf-perf-imagenet-uswest1/tensorflow/imagenet --python_path=models --git_repos="https://github.com/tensorflow/models.git;master;6b586a910d74a44f57da4d2335c79a20dc2803ab" GOOD BAD Current It might be hard to bisect as I think the model garden code changed significantly between the two dates with the following lines and we also moved the code around recently so running the test now vs back in August there is a different directory: # TODO(b/138957587): Remove when force_v2_in_keras_compile is on longer
# a valid arg for this model. Also remove as a valid flag.
if flags_obj.force_v2_in_keras_compile is not None:
model.compile(
loss='sparse_categorical_crossentropy',
optimizer=optimizer,
metrics=(['sparse_categorical_accuracy']
if flags_obj.report_accuracy_metrics else None),
run_eagerly=flags_obj.run_eagerly,
experimental_run_tf_function=flags_obj.force_v2_in_keras_compile)
else:
model.compile(
loss='sparse_categorical_crossentropy',
optimizer=optimizer,
metrics=(['sparse_categorical_accuracy']
if flags_obj.report_accuracy_metrics else None),
run_eagerly=flags_obj.run_eagerly) @sganeshb has taken over testing. I am not involved in TensorFlow performance going forward. |
Thanks Toby @tfboyd! That’s much appreciated. |
@byronyi Wanted you to know I read your findings and nice work tracking it down. While not good, your work shows the pain of finding old issues, they often end up layered. I would not want you to think your work bisecting was not looked at. It was a rough situation. The XLA team is/was focused only on single GPU performance so they could make XLA on by default for that scenario. They seem to be looking more holistically now. I am not sure there is anything we can do about it at this point other than move forward with TensorFlow 2.1. FYI TensorFlow 2.0 100% has the XLA regression for multi-GPU. |
Thanks Toby! I admit it could be difficult, but I work hard with my colleagues trying to fix these regressions for our internal 2.0 fork. Personally I have been looking forward to it for too long. Just can't wait there and see our users have to choose between 2.0 w/ new features and old 1.x wo/ perf regressions. We'd really like to have our cake and eat it :) |
@byronyi Not irrelevant at all. I have said before if you can get say 10K examples/sec on 8xV100s with FP16 + XLA but you need to scale to 32 GPUs (making up a number) to do that with FP32 then why are you focused on mulit-node and wasting money. Figure out FP16. There are exceptions for sure. This is a simple mental example. The other point being if you scale up 1 gpu and 8 GPUs (whatever your single node is) then you are scaling up everything else. |
I also agree. It is not cool this exists in 1.x. It came down the XLA team not wanting to focus on mulit-GPU and we stopped TF 1.x testing maybe too early. I do not recall the final data but I stopped nightly perf tests on TF 1.x a couple months ago it was just too much for one person and there was a lack of enthusiasm to resolve issues. :-( Your job/role sounds really cool. I wish we had these groups when I started. I did a lot with multi-node and desperately wanted places to test. All I had was AWS and K80s...yup a long time ago. I eventually just gave up as it was pointless as the P100s came out and I did not have access. A very long story not approved for the public. I was ahead of schedule and burned out when the time came. Super exciting. |
@tfboyd The aforementioned regression seems to be caused by ef9f0e8, identified by bisecting between 08-01 nightly and 08-04 nightly. This commit is also present in both r1.15 and r2.0. I am testing with model garden version 127b158. The benchmark I am running is:
With this CL the performance drops from ~7300 image/s to ~2000 image/s, tested on a 8x V100-PCIE machine (balanced bus topology) using NCCL and MirroredStrategy. The nightly 1.x TF packages are built using the following setup: OS: Debian 9.9
I will test by reverting it on the release branches. In the meanwhile, I'd like to ping @rachellim and see if she got any quick fixes for cherry-picking a fix instead of reverting the violating commit. UPDATE: reverting the following commits improves 8x V100 performance from ~1,600 to ~7,700 image/s on latest r2.0 branch snapshot 64c3d38: Ping @goldiegadde; is it still possible to revert these commits with 2.0? It will be very important for professional users to reproduce the expected multi-GPU ResNet50 performance numbers. I build the 2.0 package using the following options:
|
Does 96a407a fix the performance issue? If so, maybe we can cherrypick it into the release? |
@rachellim Assuming it does fix the issue cleanly as a cherry-pick. It will not get into TF 2.0 as we are at the end after many weeks. XLA is an experimental feature and doing a cherry-pick for 1.15 (same story with 2.0) may not seem risky is likely not justified form a risk reward stand point. All of that negativity said, you can 100% petition the release owner. I am an XLA fan for additional context. |
@rachellim Cherry-Picking 96a407a along with 6d8f05a does seem to fix the performance regression. Ping @jsimsa; any possibility we get these fixes into 2.0.1? Btw, for r1.15 cherry-picking 96a407a only seems fine. I will test the performance and see how it goes. Ping @goldiegadde; at least we can still cherry-pick for r1.15, right? |
Edit: Actually, I'll defer to the release owner @goldiegadde as to how to handle this for both 1.15 and 2.x, since this PR involves pretty significant code change (including behavior changes to datasets with distribution strategy). |
This commit disables the Grappler memory optimizer when the XLA JIT is detected on.
The (current) XLA clustering can result in loss of concurrency between kernel compute and memory copies. As such, it usually loses the concurrency needed to hide the latencies of the inserted swap-ins and swap-outs and incurs great performance overhead.
You may find more details about the performance degradation in this document:
https://docs.google.com/document/d/1q1UPN2CRRNBoUXM0zORT-cTG9m7ctQRB2xfUPhaK1Ek/edit?usp=sharing
@sanjoy, please also help to take a look.