Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sarkars/Upgrade TF #98

Merged
merged 17 commits into from Jun 6, 2019
Merged

sarkars/Upgrade TF #98

merged 17 commits into from Jun 6, 2019

Conversation

sayantan-nervana
Copy link
Contributor

@sayantan-nervana sayantan-nervana commented Jun 4, 2019

  1. Upgrade tf to 1.14rc0
  2. Pull in changes from r0.13 so that it compiles with tf1.14
  3. Add FusedBatchNormV3 and FusedBatchNormGradV3 so that resnet test passes in CI
  4. Increased time out for building in buildkite yamls and moved --use_prebuilt_tensorflow test to centos

@sayantan-nervana sayantan-nervana added 🌋 experimental This is an experimental PR. May not be merged. wip Work in progress labels Jun 4, 2019
@@ -0,0 +1,105 @@
- command: |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this file later after buildkite pipeline is fixed

@sayantan-nervana sayantan-nervana removed 🌋 experimental This is an experimental PR. May not be merged. labels Jun 5, 2019
@sayantan-nervana
Copy link
Contributor Author

sayantan-nervana commented Jun 5, 2019

TODO:

  1. Enable grappler, var-opt, prebuilt TF and other pipelines.
  2. Run run-all-models test on TF 1.14

@sayantan-nervana
Copy link
Contributor Author

sayantan-nervana commented Jun 5, 2019

Failures from run-all-models in normal build with TF built from source

run-densenet.sh: seems to run when i just run ./run-densenet.sh
run-fasterRCNN.sh: same error as mobilenet
run-mobilenet-v2.sh: tensorflow.python.framework.errors_impl.UnimplementedError: Unsupported _FusedConv2D FusedBatchNorm,Relu6
Should be fixed when r13 is merged
run-rfcn.sh: same error as mobilenet

Failures in normal build with TF pip installed running on centos (2 extra failures)

run-densenet.sh
run-fasterRCNN.sh
run-mobilenet-v2.sh
run-rfcn.sh

run-a3c.sh: OSError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /localdisk/sarkars/ngbridge/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.6/site-packages/ale_python_interface/libale_c.so)
run-yolo-v2.sh: ImportError: /localdisk/sarkars/ngbridge/ngraph-models/tensorflow_scripts/yolov2/darkflow/darkflow/cython_utils/cy_yolo_findboxes.so: undefined symbol: _Py_ZeroStruct

All in all ./run-all-models seem to be ok

@sayantan-nervana
Copy link
Contributor Author

sayantan-nervana commented Jun 6, 2019

Some other builds fail with:

[ 78%] Linking CXX shared library ../../libinterpreter_backend.so
[ 78%] Built target interpreter_backend
# Received cancellation signal, interrupting
make[2]: *** [src/ngraph/runtime/cpu/CMakeFiles/cpu_backend.dir/builder/dot.cpp.o] Terminated
make[1]: *** [src/ngraph/runtime/cpu/CMakeFiles/cpu_backend.dir/all] Terminated
make: *** [all] Terminated
🚨 Error: The command was interrupted by a signal
2019-06-05 18:48:01 DEBUG  Terminating bootstrap after cancellation with terminated

This is because builds are timing out!

GPU resnet seems to be failing with

  | 2019-06-05 18:40:22.723656: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
  | I0605 18:40:33.643381 139697568515840 session_manager.py:500] Running local_init_op.
  | I0605 18:40:34.413573 139697568515840 session_manager.py:502] Done running local_init_op.
  | Running warm up
  | terminate called after throwing an instance of 'std::runtime_error'
  | what():
  | error: cudaMalloc(static_cast<void**>(&allocated_buffer_pool), buffer_size) failed with error
  | file: /localdisk/buildkite/nervana-titanxp26-fm-intel-com-1/ngraph/ngtf-gpu-ubuntu-18-04/build_cmake/ngraph/src/ngraph/runtime/gpu/gpu_util.cpp
  | line: 49
  | msg: out of memory
  | Fatal Python error: Aborted

Need to investigate:

  1. Locally tried --enable_variables_and_optimizers and a CPU build. Resnet passes
  2. Locally trying --enable_variables_and_optimizers --build_gpu_backend on a GPU machine. Resnet passes

@sayantan-nervana
Copy link
Contributor Author

@sayantan-nervana sayantan-nervana removed wip Work in progress labels Jun 6, 2019
Copy link
Contributor

@avijit-nervana avijit-nervana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@avijit-nervana avijit-nervana merged commit ab6b999 into master Jun 6, 2019
@avijit-nervana avijit-nervana deleted the sarkars/upgrade_tf1.14 branch June 6, 2019 04:54
gopoka pushed a commit that referenced this pull request Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants