sarkars/Upgrade TF #98

sayantan-nervana · 2019-06-04T20:53:52Z

Upgrade tf to 1.14rc0
Pull in changes from r0.13 so that it compiles with tf1.14
Add FusedBatchNormV3 and FusedBatchNormGradV3 so that resnet test passes in CI
Increased time out for building in buildkite yamls and moved --use_prebuilt_tensorflow test to centos

sayantan-nervana · 2019-06-04T23:11:32Z

test/ci/buildkite/ngtf-cpu_ubuntu_18_04.yaml

@@ -0,0 +1,105 @@
+  - command: |


delete this file later after buildkite pipeline is fixed

sayantan-nervana · 2019-06-05T04:45:19Z

TODO:

~~Enable grappler, var-opt, prebuilt TF and other pipelines.~~
~~Run run-all-models test on TF 1.14~~

sayantan-nervana · 2019-06-05T18:23:40Z

Failures from run-all-models in normal build with TF built from source

run-densenet.sh: seems to run when i just run ./run-densenet.sh
run-fasterRCNN.sh: same error as mobilenet
run-mobilenet-v2.sh: tensorflow.python.framework.errors_impl.UnimplementedError: Unsupported _FusedConv2D FusedBatchNorm,Relu6
Should be fixed when r13 is merged
run-rfcn.sh: same error as mobilenet

Failures in normal build with TF pip installed running on centos (2 extra failures)

run-densenet.sh
run-fasterRCNN.sh
run-mobilenet-v2.sh
run-rfcn.sh

run-a3c.sh: OSError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /localdisk/sarkars/ngbridge/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.6/site-packages/ale_python_interface/libale_c.so)
run-yolo-v2.sh: ImportError: /localdisk/sarkars/ngbridge/ngraph-models/tensorflow_scripts/yolov2/darkflow/darkflow/cython_utils/cy_yolo_findboxes.so: undefined symbol: _Py_ZeroStruct

All in all ./run-all-models seem to be ok

sayantan-nervana · 2019-06-06T01:53:15Z

Some other builds fail with:

[ 78%] Linking CXX shared library ../../libinterpreter_backend.so
[ 78%] Built target interpreter_backend
# Received cancellation signal, interrupting
make[2]: *** [src/ngraph/runtime/cpu/CMakeFiles/cpu_backend.dir/builder/dot.cpp.o] Terminated
make[1]: *** [src/ngraph/runtime/cpu/CMakeFiles/cpu_backend.dir/all] Terminated
make: *** [all] Terminated
🚨 Error: The command was interrupted by a signal
2019-06-05 18:48:01 DEBUG  Terminating bootstrap after cancellation with terminated

This is because builds are timing out!

GPU resnet seems to be failing with

  | 2019-06-05 18:40:22.723656: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
  | I0605 18:40:33.643381 139697568515840 session_manager.py:500] Running local_init_op.
  | I0605 18:40:34.413573 139697568515840 session_manager.py:502] Done running local_init_op.
  | Running warm up
  | terminate called after throwing an instance of 'std::runtime_error'
  | what():
  | error: cudaMalloc(static_cast<void**>(&allocated_buffer_pool), buffer_size) failed with error
  | file: /localdisk/buildkite/nervana-titanxp26-fm-intel-com-1/ngraph/ngtf-gpu-ubuntu-18-04/build_cmake/ngraph/src/ngraph/runtime/gpu/gpu_util.cpp
  | line: 49
  | msg: out of memory
  | Fatal Python error: Aborted

Need to investigate:

Locally tried --enable_variables_and_optimizers and a CPU build. Resnet passes
Locally trying --enable_variables_and_optimizers --build_gpu_backend on a GPU machine. Resnet passes

sayantan-nervana · 2019-06-06T04:31:35Z

From the error message it seems this line fails:

https://github.com/NervanaSystems/ngraph/blob/b9e6b40c356f48616985218a946207c41c39b055/src/ngraph/runtime/gpu/gpu_util.cpp#L49

avijit-nervana

Great job!

Upgrade TF

103f47a

sayantan-nervana added 🌋 experimental This is an experimental PR. May not be merged. wip Work in progress labels Jun 4, 2019

sayantan-nervana added 6 commits June 4, 2019 14:41

Adding files related to upgrading TF

97c0267

TF class change

94abf8c

Class upgrade

3df283c

Builds finally

06326c2

Some so->so.1 changes

cce5d28

Add buildkite yaml file. This file needs to be deleted later

d2c89a4

sayantan-nervana commented Jun 4, 2019

View reviewed changes

sayantan-nervana added 6 commits June 4, 2019 16:13

Comment out a test

ed9d097

Style

454c78a

Added FusedBatchNormV3

39e359d

Adding FusedBatchNormGradV3

a3aadf9

Gtests for FBNV3

350be2a

Gtests for FBNGV3

8d5d2fb

sayantan-nervana removed 🌋 experimental This is an experimental PR. May not be merged. labels Jun 5, 2019

sayantan-nervana requested a review from avijit-nervana June 5, 2019 04:09

sayantan-nervana added 2 commits June 5, 2019 10:09

Fix var-opt build

dba8d5e

Disable a GPU test

d1345f3

Update buildkite yaml

7bdd2bd

Increase build time limit

b172185

sayantan-nervana removed wip Work in progress labels Jun 6, 2019

avijit-nervana approved these changes Jun 6, 2019

View reviewed changes

avijit-nervana merged commit ab6b999 into master Jun 6, 2019

avijit-nervana deleted the sarkars/upgrade_tf1.14 branch June 6, 2019 04:54

gopoka pushed a commit that referenced this pull request Oct 28, 2019

sarkars/Upgrade TF (#98)

fdc5c27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sarkars/Upgrade TF #98

sarkars/Upgrade TF #98

sayantan-nervana commented Jun 4, 2019 •

edited

sayantan-nervana Jun 4, 2019

sayantan-nervana commented Jun 5, 2019 •

edited

sayantan-nervana commented Jun 5, 2019 •

edited

sayantan-nervana commented Jun 6, 2019 •

edited

sayantan-nervana commented Jun 6, 2019

avijit-nervana left a comment

sarkars/Upgrade TF #98

sarkars/Upgrade TF #98

Conversation

sayantan-nervana commented Jun 4, 2019 • edited

sayantan-nervana Jun 4, 2019

Choose a reason for hiding this comment

sayantan-nervana commented Jun 5, 2019 • edited

sayantan-nervana commented Jun 5, 2019 • edited

sayantan-nervana commented Jun 6, 2019 • edited

sayantan-nervana commented Jun 6, 2019

avijit-nervana left a comment

Choose a reason for hiding this comment

sayantan-nervana commented Jun 4, 2019 •

edited

sayantan-nervana commented Jun 5, 2019 •

edited

sayantan-nervana commented Jun 5, 2019 •

edited

sayantan-nervana commented Jun 6, 2019 •

edited