Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.15.2: _pywrap_tensorflow_internal.so differs between builds #37997

Closed
bmwiedemann opened this issue Mar 28, 2020 · 9 comments
Closed

1.15.2: _pywrap_tensorflow_internal.so differs between builds #37997

bmwiedemann opened this issue Mar 28, 2020 · 9 comments
Assignees
Labels
subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 1.15 for issues seen on TF 1.15 type:build/install Build and install issues

Comments

@bmwiedemann
Copy link

bmwiedemann commented Mar 28, 2020

System information

  • OS Platform and Distribution: openSUSE-Tumbleweed-20200324
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 1.15.2
  • Python version: 3.8
  • Installed using virtualenv? pip? conda?: built with rpm spec
  • Bazel version (if compiling from source): 0.24
  • GCC/Compiler version (if compiling from source): gcc9 and gcc10
  • CUDA/cuDNN version: -
  • GPU model and memory: -

Describe the problem

While working on reproducible builds for openSUSE, I found that
our tensorflow-1.15.2 package varied across builds.

See https://reproducible-builds.org/ for why this matters.

The variations do not occur when disabling ASLR for the build.

The previous 1.13.2 version built with python-3.7 still did build reproducibly.

Provide the exact sequence of commands / steps that you executed before running into the problem

build tensorflow twice from scratch:
osc checkout openSUSE:Factory/tensorflow && cd $_
osc build --noservice --keep-pkg=RPMS

and compare resulting _pywrap_tensorflow_internal.so content

Any other info / logs

/usr/lib64/python3.8/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so differs in assembler output

by building without Link Time Optimization (LTO), I could see that exactly one .o file differed in the build environment
Binary files /var/tmp/build-root.10/.mount/home/abuild/rpmbuild/SOURCES/BAZEL/_bazel_abuild/089fd2236bcbfcbcf994cdf39cd6bcb6/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/mlir/lite/_objs/tensorflow_lite_legalize_tf/prepare_tf.pic.o and /var/tmp/build-root.10b/.mount/home/abuild/rpmbuild/SOURCES/BAZEL/_bazel_abuild/089fd2236bcbfcbcf994cdf39cd6bcb6/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/mlir/lite/_objs/tensorflow_lite_legalize_tf/prepare_tf.pic.o differ

also the asm diff contained

  •   lea    offset(%rip),%rsi        #   <_ZTSZN4mlir16PassRegistrationINS_3TFL12_GLOBAL__N_129PrepareCompositeFunctionsPassEEC4EN4llvm9StringRefES6_EUlvE_ + ofs>
    

that comes from
tensorflow-1.15.2/tensorflow/compiler/mlir/lite/transforms/prepare_composite_functions_tf.cc
which is very close to the prepare_tf.cc file used to create the differing .o file

It is possible that the nondeterminism comes from within gcc9 and gcc10 triggered by some special feature used in prepare_tf.cc but to prove that, I would need a preprocessed version of that compilation. Due to the size and complexity of the build process, I did not manage to get that yet.

@bmwiedemann bmwiedemann added the type:build/install Build and install issues label Mar 28, 2020
@bmwiedemann
Copy link
Author

extra info: our openSUSE tensorflow2-2.1.0 package is also not affected

@amahendrakar amahendrakar added subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 1.15 for issues seen on TF 1.15 labels Mar 30, 2020
@amahendrakar amahendrakar assigned ymodak and unassigned amahendrakar Mar 30, 2020
@mihaimaruseac mihaimaruseac self-assigned this Mar 30, 2020
@ymodak ymodak removed their assignment Jun 1, 2020
@bmwiedemann
Copy link
Author

tensorflow-1.15.3 still has these variations in tensorflow_lite_legalize_tf/prepare_tf.pic.o

@bmwiedemann
Copy link
Author

looking at the diff of build trees again, I found this diff

+++ org_tensorflow/bazel-out/k8-opt/genfiles/tensorflow/compiler/mlir/lite/transforms/generated_prepare_tf.inc        
@@ -289,7 +289,7 @@
 */
 struct GeneratedConvert0 : public RewritePattern {
   GeneratedConvert0(MLIRContext *context)
-      : RewritePattern("tf.FusedBatchNorm", {"tf.Add", "tf.Mul", "tf.Rsqrt", "tf.Sub", "tf.Const"}, 1, context) {}
+      : RewritePattern("tf.FusedBatchNorm", {"tf.Const", "tf.Rsqrt", "tf.Add", "tf.Mul", "tf.Sub"}, 1, context) {}

generated by a bazel call of /bin/bash -c source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/host/bin/external/local_config_mlir/mlir-tblgen -I external/local_config_mlir/include -I external/org_tensorflow -I $(dirname tensorflow/compiler/mlir/lite/transforms/legalize_patterns.td) -gen-rewriters tensorflow/compiler/mlir/lite/transforms/legalize_patterns.td -o bazel-out/k8-opt/genfiles/tensorflow/compiler/mlir/lite/transforms/generated_legalize_tf.inc

from tensorflow-1.15.3/third_party/mlir/tools/mlir-tblgen/mlir-tblgen.cpp - so rather not a gcc bug.

@mihaimaruseac
Copy link
Collaborator

Can you check against master please? We won't be able to fix this on patch releases since it seems to be a significant amount of change.

@bmwiedemann
Copy link
Author

There seems to be a large difference from v1.15.3..master and master branch lacks the third_party/mlir/tools/mlir-tblgen dir. I have trouble building that as a package.
Was mlir-tblgen dropped? Or does some magic only pull it in at release time?

@mihaimaruseac
Copy link
Collaborator

third_party/mlir is now moved to be a part of LLVM. The location inside TF repo was only temporary.

It is expected that there are differences between master and r1.15, but the question is is the current builds are reproducible or if we should work towards making them be (likely). Since we cannot add major changes to release branches after final release, we know that r1.15 based builds are not reproducible.

@bmwiedemann
Copy link
Author

I built that llvm master mlir-tblgen and when omitting the --gen-rewriters param, it even created reproducible output. Otherwise it was

llvm-project/build/bin/mlir-tblgen -I $(pwd)/inc -o generated_legalize_tf.inc --gen-rewriters legalize_patterns.td ; md5sum generated_legalize_tf.inc
Included from legalize_patterns.td:21:
SOMEPATH/inc/tensorflow/compiler/mlir/tensorflow/ir/tf_ops.td:66:1: error: Record `TF_ConstOp' does not have a field named `successors'!            

def TF_ConstOp : TF_Op<"Const", [NoSideEffect]> {
^

I tried to compare it to the mlir-tblgen from 1.15.3 but that coredumped on the same call.
But this one worked with --gen-rewriters and was also reproducible now. so something else probably missing in this test setup.

so it seems I know too little to properly debug that.

@tensorflowbutler
Copy link
Member

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 1.15 for issues seen on TF 1.15 type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

5 participants