Skip to content

Conversation

@saudet
Copy link
Contributor

@saudet saudet commented Nov 9, 2021

Builds fine and all tests pass for me on Linux.

I've also updated the proto patch and ran the java_api_import tool, which generated a couple of things without prompting...

@karllessard
Copy link
Collaborator

Thanks a lot @saudet , I'll rebase it on the new master branch and I'll take care of classifying the new ops, looks like the java_api_import got confused because the new ops are located in raw_ops in Python and applied the same package in Java but since in Java all our ops are "raw", we don't want that... I might need to update the tool as well to prevent that

@rnett
Copy link
Contributor

rnett commented Nov 11, 2021

Patches from custom gradients are going to need to be looked at too, I think one or two of my core PRs made it into this release.

@karllessard
Copy link
Collaborator

Yes no problem @rnett , I've noticed that and already fixed it

@karllessard
Copy link
Collaborator

When building on my Mac, I have this garbled code being added at the end of a JavaCPP generated class NativeStatus:

  // Iterates over the stored payloads and calls the
  // `visitor(type_key, payload)` callable for each one.
  //
  // The order of calls to `visitor()` is not specified and may change at
  // any time and any mutation on the same Status object during visitation is
  // forbidden and could result in undefined behavior.
  public native void ForEachPayload(
        @Const @ByRef std::function<void(tensorflow::StringPiece,tensorflow::StringPiece)> visitor);

I guess if you did not encountered this issue before @saudet , then maybe this mapping was added by @rnett in his last custom gradient PR that has been just merged?

But anyway, what would be the right way to fix this? Right now, there's only this in the TF presets:

.put(new Info("tensorflow::Status").javaNames("NativeStatus").purify())`

@karllessard
Copy link
Collaborator

Ah found it, by adding it to the skip list...

@rnett
Copy link
Contributor

rnett commented Nov 12, 2021

Yeah, my guess is that was added in tensorflow core this release and as such wasn't on our ignore list.

@karllessard karllessard force-pushed the upgrade-tensorflow-270 branch from 0ba57ac to 5b58c58 Compare November 12, 2021 14:17
@karllessard karllessard added the CI build Triggers a full native build on a pull request label Nov 12, 2021
@karllessard
Copy link
Collaborator

@saudet , I've pushed an updated version into your branch which is rebased with the latest changes + I've reclassified the ops a little bit (looks like the java_api_import is not doing a good job right now, maybe something have changed in TF, I need to check...)

I'm having trouble to build locally but for various strange reasons, including timeouts, so I'm giving it a try on the CI build instead.

@karllessard
Copy link
Collaborator

Ok, I'm not too sure what to do here.. I'm having also strange errors in the CI Build but they seem related to the Bazel cache timing out... https://github.com/tensorflow/java/runs/4194390356?check_suite_focus=true

I'll give it one more run but if someone can checkout locally that branch and confirm again that it builds successfully for them I would be more confident to merge it.

@rnett
Copy link
Contributor

rnett commented Nov 13, 2021

I'll run it tonight and see.

@karllessard
Copy link
Collaborator

Oh, just found one source of confusion. One of your patch @rnett is no longer required, they applied the same change in 2.7, I've initially thought you had reversed it... Let's try again without it now.

@karllessard
Copy link
Collaborator

Patches from custom gradients are going to need to be looked at too, I think one or two of my core PRs made it into this release.

Ok now I understand what you meant by this :D Sorry!

@rnett
Copy link
Contributor

rnett commented Nov 13, 2021

Here's what I got, likely because of the patch:

[7,549 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 2070s local, remote-cache ... (9 actions, 6 running)
[7,571 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 2911s local, remote-cache ... (16 actions running)
WARNING: Reading from Remote Cache:
io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
	at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2017)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
[7,572 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 3821s local, remote-cache ... (16 actions running)
[7,572 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 4876s local, remote-cache ... (16 actions running)
[7,572 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 6050s local, remote-cache ... (16 actions running)
[7,572 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 7398s local, remote-cache ... (16 actions running)
[7,572 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 8939s local, remote-cache ... (16 actions running)
[7,572 / 9,916] Compiling org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops_a_m.cc; 10718s local, remote-cache ... (16 actions running)
ERROR: /root/.cache/bazel/_bazel_root/06374913dc8363bfff1f77c5f1edf456/external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/BUILD:572:11: C++ compilation of rule '@org_tensorflow//tensorflow/compiler/mlir/tensorflow:tensorflow_ops' failed (Exit 4): gcc failed: error executing command 
  (cd /root/.cache/bazel/_bazel_root/06374913dc8363bfff1f77c5f1edf456/execroot/tensorflow_core_api && \
  exec env - \
    LD_LIBRARY_PATH=/opt/rh/httpd24/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:/opt/rh/devtoolset-7/root/usr/lib64/dyninst:/opt/rh/devtoolset-7/root/usr/lib/dyninst:/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib \
    PATH=/opt/rh/rh-git218/root/usr/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
    PWD=/proc/self/cwd \
    TF2_BEHAVIOR=1 \
  /opt/rh/devtoolset-7/root/usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections -fdata-sections '-std=c++11' -MD -MF bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/_objs/tensorflow_ops/tf_ops.pic.d '-frandom-seed=bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/_objs/tensorflow_ops/tf_ops.pic.o' -fPIC '-DLLVM_ON_UNIX=1' '-DHAVE_BACKTRACE=1' '-DBACKTRACE_HEADER=<execinfo.h>' '-DLTDL_SHLIB_EXT=".so"' '-DLLVM_PLUGIN_EXT=".so"' '-DLLVM_ENABLE_THREADS=1' '-DHAVE_SYSEXITS_H=1' '-DHAVE_UNISTD_H=1' '-DHAVE_STRERROR_R=1' '-DHAVE_LIBPTHREAD=1' '-DHAVE_PTHREAD_GETNAME_NP=1' '-DHAVE_PTHREAD_SETNAME_NP=1' '-DHAVE_PTHREAD_GETSPECIFIC=1' '-DHAVE_REGISTER_FRAME=1' '-DHAVE_DEREGISTER_FRAME=1' -D_GNU_SOURCE '-DHAVE_LINK_H=1' '-DHAVE_LSEEK64=1' '-DHAVE_MALLINFO=1' '-DHAVE_POSIX_FALLOCATE=1' '-DHAVE_SBRK=1' '-DHAVE_STRUCT_STAT_ST_MTIM_TV_NSEC=1' '-DLLVM_NATIVE_ARCH="X86"' '-DLLVM_NATIVE_ASMPARSER=LLVMInitializeX86AsmParser' '-DLLVM_NATIVE_ASMPRINTER=LLVMInitializeX86AsmPrinter' '-DLLVM_NATIVE_DISASSEMBLER=LLVMInitializeX86Disassembler' '-DLLVM_NATIVE_TARGET=LLVMInitializeX86Target' '-DLLVM_NATIVE_TARGETINFO=LLVMInitializeX86TargetInfo' '-DLLVM_NATIVE_TARGETMC=LLVMInitializeX86TargetMC' '-DLLVM_NATIVE_TARGETMCA=LLVMInitializeX86TargetMCA' '-DLLVM_HOST_TRIPLE="x86_64-unknown-linux-gnu"' '-DLLVM_DEFAULT_TARGET_TRIPLE="x86_64-unknown-linux-gnu"' -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -DEIGEN_MPL2_ONLY '-DEIGEN_MAX_ALIGN_BYTES=64' -iquoteexternal/org_tensorflow -iquotebazel-out/k8-opt/bin/external/org_tensorflow -iquoteexternal/llvm-project -iquotebazel-out/k8-opt/bin/external/llvm-project -iquoteexternal/llvm_terminfo -iquotebazel-out/k8-opt/bin/external/llvm_terminfo -iquoteexternal/llvm_zlib -iquotebazel-out/k8-opt/bin/external/llvm_zlib -iquoteexternal/com_google_absl -iquotebazel-out/k8-opt/bin/external/com_google_absl -iquoteexternal/nsync -iquotebazel-out/k8-opt/bin/external/nsync -iquoteexternal/eigen_archive -iquotebazel-out/k8-opt/bin/external/eigen_archive -iquoteexternal/gif -iquotebazel-out/k8-opt/bin/external/gif -iquoteexternal/libjpeg_turbo -iquotebazel-out/k8-opt/bin/external/libjpeg_turbo -iquoteexternal/com_google_protobuf -iquotebazel-out/k8-opt/bin/external/com_google_protobuf -iquoteexternal/com_googlesource_code_re2 -iquotebazel-out/k8-opt/bin/external/com_googlesource_code_re2 -iquoteexternal/farmhash_archive -iquotebazel-out/k8-opt/bin/external/farmhash_archive -iquoteexternal/fft2d -iquotebazel-out/k8-opt/bin/external/fft2d -iquoteexternal/highwayhash -iquotebazel-out/k8-opt/bin/external/highwayhash -iquoteexternal/zlib -iquotebazel-out/k8-opt/bin/external/zlib -iquoteexternal/double_conversion -iquotebazel-out/k8-opt/bin/external/double_conversion -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/BuiltinAttributeInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/BuiltinAttributesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/BuiltinDialectIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/BuiltinLocationAttributesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/BuiltinOpsIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/BuiltinTypeInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/BuiltinTypesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/CallOpInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/CastOpInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/InferTypeOpInterfaceIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/OpAsmInterfaceIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/RegionKindInterfaceIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/SideEffectInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/SubElementInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/SymbolInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/TensorEncodingIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/ParserTokenKinds -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/ControlFlowInterfacesIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/DerivedAttributeOpInterfaceIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/LoopLikeInterfaceIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/StandardOpsIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/VectorInterfacesIncGen -isystem external/llvm-project/llvm/include -isystem bazel-out/k8-opt/bin/external/llvm-project/llvm/include -isystem external/llvm-project/mlir/include -isystem bazel-out/k8-opt/bin/external/llvm-project/mlir/include -isystem external/nsync/public -isystem bazel-out/k8-opt/bin/external/nsync/public -isystem external/org_tensorflow/third_party/eigen3/mkl_include -isystem bazel-out/k8-opt/bin/external/org_tensorflow/third_party/eigen3/mkl_include -isystem external/eigen_archive -isystem bazel-out/k8-opt/bin/external/eigen_archive -isystem external/gif -isystem bazel-out/k8-opt/bin/external/gif -isystem external/com_google_protobuf/src -isystem bazel-out/k8-opt/bin/external/com_google_protobuf/src -isystem external/farmhash_archive/src -isystem bazel-out/k8-opt/bin/external/farmhash_archive/src -isystem external/zlib -isystem bazel-out/k8-opt/bin/external/zlib -isystem external/double_conversion -isystem bazel-out/k8-opt/bin/external/double_conversion -w -DAUTOLOAD_DYNAMIC_KERNELS -msse4.1 -msse4.2 -mavx '-std=c++14' -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/ir/tf_ops.cc -o bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/_objs/tensorflow_ops/tf_ops.pic.o)
Execution platform: @local_execution_config_platform//:platform
gcc: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
INFO: Elapsed time: 14744.708s, Critical Path: 11721.38s
INFO: 7582 processes: 298 remote cache hit, 614 internal, 6670 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully

Trying again.

@rnett
Copy link
Contributor

rnett commented Nov 14, 2021

The bazel build is working for me now, but I get javacpp compliation errors. Likely more missing skips for new stuff.

Errors: https://gist.github.com/rnett/c7048d0efc355e2a535946de92f2d073

@karllessard
Copy link
Collaborator

@saudet can you please take a look at this? https://github.com/tensorflow/java/runs/4199332671?check_suite_focus=true#step:4:1870

Windows build also failed and Linux timed out... I'll relaunch it tomorrow morning to give you some time to review these errors, thanks

@karllessard
Copy link
Collaborator

The bazel build is working for me now, but I get javacpp compliation errors. Likely more missing skips for new stuff.

Errors: https://gist.github.com/rnett/c7048d0efc355e2a535946de92f2d073

Ok, it's probably stuff that came with your custom gradient PR since Samuel was able to do a full build before rebasing. Can you please take a look at it?

@saudet
Copy link
Contributor Author

saudet commented Nov 16, 2021

Yes, yes, working on it now.. I'm having a weird issue with Maven not recompiling properly, which is annoying, debugging that first before pushing.

@rnett
Copy link
Contributor

rnett commented Nov 16, 2021

Some probably are, but I don't think it's just that, look at this one:

/root/.cache/bazel/_bazel_root/7b134006cf25080eddce07c921f1114d/external/org_tensorflow/tensorflow/core/framework/tensor_types.h:176:25: error: use of 'auto' in lambda parameter declaration only available with -std=c++14 or -std=gnu++14
     auto all = [](const auto&... bool_vals) {
                         ^~~~

That's not even from our code, its from tensorflow core.

@saudet
Copy link
Contributor Author

saudet commented Nov 16, 2021

It looks like we can't rely on the order of the plugins when overriding their configurations like with maven-compiler-plugin:
https://stackoverflow.com/questions/22150209/maven-changes-the-order-of-plugins-of-different-profiles
I'll add a workaround for that too..

@rnett
Copy link
Contributor

rnett commented Nov 16, 2021

I can't commit to the branch, but tensorflow::Node::set_original_func_names should be skipped. Not sure if that will resolve everything but that should help.

@rnett
Copy link
Contributor

rnett commented Nov 16, 2021

I'm also not seeing the TF_FinishOperationLocked functions from custom-grad-helpers.patch in tensorflow.java, although I see them in tensorflow core.

Also fix execution order plugins for javacpp-parser
@saudet
Copy link
Contributor Author

saudet commented Nov 16, 2021

Ok, everything fixed! I think. It's working for me anyway. :)

@karllessard
Copy link
Collaborator

karllessard commented Nov 17, 2021

Good job @saudet , looks like Mac and Linux are happy now. Windows on the other hand...

ERROR: An error occurred during the fetch of repository 'llvm-project':
   Traceback (most recent call last):
	File "C:/tmp/_bazel_runneradmin/mg3pkz7r/external/llvm-raw/utils/bazel/configure.bzl", line 73, column 25, in _llvm_configure_impl
		_overlay_directories(repository_ctx)
	File "C:/tmp/_bazel_runneradmin/mg3pkz7r/external/llvm-raw/utils/bazel/configure.bzl", line 62, column 13, in _overlay_directories
		fail(("Failed to execute overlay script: '{cmd}'\n" +
Error in fail: Failed to execute overlay script: 'C:/msys64/usr/bin/python3.exe C:/tmp/_bazel_runneradmin/mg3pkz7r/external/llvm-raw/utils/bazel/overlay_directories.py --src C:/tmp/_bazel_runneradmin/mg3pkz7r/external/llvm-raw --overlay C:/tmp/_bazel_runneradmin/mg3pkz7r/external/llvm-raw/utils/bazel/llvm-project-overlay --target .'
Exited with code 2
stdout:

stderr:
/usr/bin/python3: can't open file '/c/tmp/_bazel_runneradmin/mg3pkz7r/external/llvm-project/C:/tmp/_bazel_runneradmin/mg3pkz7r/external/llvm-raw/utils/bazel/overlay_directories.py': [Errno 2] No such file or directory

This looks like a strange path... I don't even know where to start looking. WSL again? Looks like the /usr/bin/python3 command cannot interpret "C:/tmp/..." as an absolute path.

Copy link
Collaborator

@karllessard karllessard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, CI build looks promising, let's give it a shot, thanks @saudet !

@karllessard karllessard merged commit 910b337 into tensorflow:master Nov 17, 2021
@karllessard
Copy link
Collaborator

@saudet , all native builds pass but the Linux-GPU one, other than a bunch of cache timeouts, we're getting this error: https://github.com/tensorflow/java/runs/4252113624?check_suite_focus=true#step:5:4429

Any idea how to solve this?

@saudet
Copy link
Contributor Author

saudet commented Nov 18, 2021 via email

@karllessard
Copy link
Collaborator

Trying with 9 right now…

@karllessard
Copy link
Collaborator

Great, upgrading to devtoolset-9 did the trick, we’re done here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI build Triggers a full native build on a pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants