Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scio-tensorflow tests crashing when cross building #1137

Closed
nevillelyh opened this issue May 1, 2018 · 8 comments
Closed

scio-tensorflow tests crashing when cross building #1137

nevillelyh opened this issue May 1, 2018 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@nevillelyh
Copy link
Contributor

sbt +scio-tensorflow/test crashes with something like:

2018-04-25 13:20:28.292948: F tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/cc/saved_model/load_attempt_count
/usr/local/Cellar/sbt/1.1.4/libexec/bin/sbt-launch-lib.bash: line 58: 37046 Abort trap: 6           "$@"

#1003 (comment)

Adding Test / fork := true to scioTensorflow project fixes the issue locally but breaks Circle CI builds since now the containers are OOM due to the extra JVM.

@nevillelyh nevillelyh added the bug Something isn't working label May 1, 2018
jto pushed a commit that referenced this issue May 2, 2018
@jto
Copy link
Contributor

jto commented May 8, 2018

Comes from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/monitoring/collection_registry.cc#L75-L79 which suggest sbt is somehow loading the library once for the two versions but somehow is initializing it twice.

@jto
Copy link
Contributor

jto commented May 8, 2018

Running the same test twice in the same sbt session also crashes:

➜  scio git:(julient/scalac-lint) sbt
[info] Loading settings from idea.sbt,plugins.sbt ...
[info] Loading global plugins from /Users/julient/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /Users/julient/Documents/Spotify/scio/project
[info] Loading settings from version.sbt,apache.sbt,build.sbt ...
[info] Resolving key references (16753 settings) ...
[info] Set current project to scio (in build file:/Users/julient/Documents/Spotify/scio/)
[warn] sbt server could not start because there's another instance of sbt running on this build.
[warn] Running multiple instances is unsupported
sbt:scio> project scio-tensorflow
[info] Set current project to scio-tensorflow (in build file:/Users/julient/Documents/Spotify/scio/)
sbt:scio-tensorflow> testOnly com.spotify.scio.tensorflow.TensorflowSpec
2018-05-08 18:48:39.591399: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
[pool-7-thread-2-ScalaTest-running-TensorflowSpec] WARN com.spotify.scio.VersionUtil$ - Using a SNAPSHOT version of Scio: 0.5.4-SNAPSHOT
[pool-7-thread-2-ScalaTest-running-TensorflowSpec] WARN org.apache.beam.sdk.Pipeline - The following transforms do not have stable unique names: main@{NativeMethodAccessorImpl.java:-2}1, main@{NativeMethodAccessorImpl.java:-2}13/main@{NativeMethodAccessorImpl.java:-2}1
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@664969ed
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@6fc22302
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@19355730
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@743941b7
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@591258dc
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@2736aacf
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@1489f5ce
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@2a833a11
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@611c5325
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@73eeed1b
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@2892512d
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@1bf2a59c
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@18f448bc
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@1cf6f88c
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@26d7c906
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@28074605
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[pool-7-thread-2-ScalaTest-running-TensorflowSpec] WARN com.spotify.scio.VersionUtil$ - Using a SNAPSHOT version of Scio: 0.5.4-SNAPSHOT
[pool-7-thread-2-ScalaTest-running-TensorflowSpec] WARN org.apache.beam.sdk.Pipeline - The following transforms do not have stable unique names: main@{NativeMethodAccessorImpl.java:-2}14/main@{NativeMethodAccessorImpl.java:-2}1, main@{NativeMethodAccessorImpl.java:-2}1
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Loading TensorFlow graph
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 1 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 1 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 0 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 1 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 1 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 1 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - TensorFlow graph loaded in 1 ms
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@60a69da2
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@345e14ff
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@177b3c71
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@7690c1f3
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@220f090e
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@6632085c
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@2864285f
[direct-runner-worker] INFO c[info] TensorflowSpec:
om.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@580bfbf9
[info] Tensorflow
[info] - should work for hello-wold
[info] - should work for serde model
[info] - should allow to predict
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[info] - should allow to predict with 2 inputs
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@296ce8f
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@74104669
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@7dbfbe64
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@7a6c4619
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@4eba0d6d
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@386f39dd
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down predict DoFn com.spotify.scio.tensorflow.PredictDoFn@3823ec2a
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow session org.tensorflow.Session@20e739b9
[direct-runner-worker] INFO com.spotify.scio.tensorflow.PredictDoFn - Closing down TensorFlow graph null
[info] ScalaTest
[info] Run completed in 6 seconds, 191 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 4, Failed 0, Errors 0, Passed 4
[success] Total time: 11 s, completed May 8, 2018 6:48:45 PM
sbt:scio-tensorflow> testOnly com.spotify.scio.tensorflow.TensorflowSpec
2018-05-08 18:48:49.223958: F tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/cc/saved_model/load_attempt_count
/usr/local/Cellar/sbt/1.1.4/libexec/bin/sbt-launch-lib.bash: line 58:  9925 Abort trap: 6           "$@"
➜  scio git:(julient/scalac-lint)

@jto
Copy link
Contributor

jto commented May 9, 2018

If you launch sbt with sbt -Dorg.tensorflow.NativeLibrary.DEBUG=true, you can get a bit of debug which shows that's the library is loaded for each run:

sbt:scio> project scio-tensorflow
[info] Set current project to scio-tensorflow (in build file:/Users/julient/Documents/Spotify/scio/)
sbt:scio-tensorflow> testOnly com.spotify.scio.tensorflow.TensorflowSpec
org.tensorflow.NativeLibrary: tryLoadLibraryFailed: no tensorflow_jni in java.library.path
org.tensorflow.NativeLibrary: jniResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: frameworkResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851389874-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: copied 12019756 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851389874-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851389874-0/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: copied 124911544 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851389874-0/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: isLoaded: true
org.tensorflow.NativeLibrary: isLoaded: true
2018-05-09 09:36:30.342604: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
...
[info] TensorflowSpec:
[info] Tensorflow
[info] - should work for hello-wold
[info] - should work for serde model
[info] - should allow to predict
[info] - should allow to predict with 2 inputs
[info] ScalaTest
[info] Run completed in 6 seconds, 277 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 4, Failed 0, Errors 0, Passed 4
[success] Total time: 11 s, completed May 9, 2018 9:36:35 AM
sbt:scio-tensorflow> testOnly com.spotify.scio.tensorflow.TensorflowSpec
org.tensorflow.NativeLibrary: tryLoadLibraryFailed: no tensorflow_jni in java.library.path
org.tensorflow.NativeLibrary: jniResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: frameworkResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851418026-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: copied 12019756 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851418026-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851418026-0/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: copied 124911544 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525851418026-0/libtensorflow_jni.dylib
2018-05-09 09:36:58.435486: F tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/cc/saved_model/load_attempt_count
/usr/local/Cellar/sbt/1.1.4/libexec/bin/sbt-launch-lib.bash: line 58: 16418 Abort trap: 6           "$@"

I'm not sure what's going on because Tensorflow actually checks if the library is already loaded (https://github.com/tensorflow/tensorflow/blob/a9485a4b59477c722ed480baec9043d04cc25ea0/tensorflow/java/src/main/java/org/tensorflow/NativeLibrary.java#L48-L57).

It seems that the lib is properly unloaded between tests. Could be a bug in Tensorflow that makes it leak resources ? What do you think @nevillelyh ?

@lelayf
Copy link

lelayf commented May 9, 2018

I looked a bit at the C++ monitoring code, looks like registered_handle_ is reset where needed so a leak would come from a broader scope level. Shouldn't we expect some unload logic on the Java side though?

@jto
Copy link
Contributor

jto commented May 9, 2018

So I added the following setting in sbt:

testOptions in Test += Tests.Setup { _ =>
    def loadedLibs: Seq[String] = {
      val libs = classOf[ClassLoader].getDeclaredField("loadedLibraryNames")
      libs.setAccessible(true)
      import scala.collection.JavaConverters._
      libs.get(ClassLoader.getSystemClassLoader())
        .asInstanceOf[java.util.Vector[String]]
        .asScala
    }
    loadedLibs.foreach(println)
  },

Here's what happens:

➜  scio git:(master) ✗ sbt -Dorg.tensorflow.NativeLibrary.DEBUG=true
[info] Loading settings from idea.sbt,plugins.sbt ...
[info] Loading global plugins from /Users/julient/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /Users/julient/Documents/Spotify/scio/project
[info] Loading settings from version.sbt,apache.sbt,build.sbt ...
[info] Resolving key references (16772 settings) ...
[info] Set current project to scio (in build file:/Users/julient/Documents/Spotify/scio/)
[info] sbt server started at local:///Users/julient/.sbt/1.0/server/3059426f2d1c426345ec/sock
sbt:scio> project scio-tensorflow
[info] Set current project to scio-tensorflow (in build file:/Users/julient/Documents/Spotify/scio/)
sbt:scio-tensorflow> testOnly com.spotify.scio.tensorflow.TensorflowSpec -- -z "hello-wold"
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libzip.dylib
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libnet.dylib
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libnio.dylib
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libmanagement.dylib
/private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/jna--1256278045/jna8323833763036780383.tmp
org.tensorflow.NativeLibrary: tryLoadLibraryFailed: no tensorflow_jni in java.library.path
org.tensorflow.NativeLibrary: jniResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: frameworkResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876152718-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: copied 12019756 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876152718-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876152718-0/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: copied 124911544 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876152718-0/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: isLoaded: true
org.tensorflow.NativeLibrary: isLoaded: true
2018-05-09 16:29:13.163454: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
[info] TensorflowSpec:
[info] Tensorflow
[info] - should work for hello-wold
[info] ScalaTest
[info] Run completed in 1 second, 99 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 1, Failed 0, Errors 0, Passed 1
[success] Total time: 6 s, completed May 9, 2018 4:29:13 PM
sbt:scio-tensorflow> testOnly com.spotify.scio.tensorflow.TensorflowSpec -- -z "hello-wold"
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libzip.dylib
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libnet.dylib
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libnio.dylib
/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/libmanagement.dylib
/private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/jna--1256278045/jna8323833763036780383.tmp
/private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876152718-0/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: tryLoadLibraryFailed: no tensorflow_jni in java.library.path
org.tensorflow.NativeLibrary: jniResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: frameworkResourceName: org/tensorflow/native/darwin-x86_64/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876162656-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: copied 12019756 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876162656-0/libtensorflow_framework.so
org.tensorflow.NativeLibrary: extracting native library to: /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876162656-0/libtensorflow_jni.dylib
org.tensorflow.NativeLibrary: copied 124911544 bytes to /private/var/folders/cj/ql5lswyn12v4v7hh10dpttm40000gp/T/tensorflow_native_libraries-1525876162656-0/libtensorflow_jni.dylib
2018-05-09 16:29:23.033326: F tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/cc/saved_model/load_attempt_count
/usr/local/Cellar/sbt/1.1.4/libexec/bin/sbt-launch-lib.bash: line 58: 34318 Abort trap: 6           "$@"

So the library is already loaded before running the 2nd test but somehow, isLoaded returns false so Tensorflow tries to load the lib again.

@jto
Copy link
Contributor

jto commented May 9, 2018

I found a couple of fishy stuff in tensorflow-java.

There's a static bloc in org.tensorflow.Tensorflow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/TensorFlow.java#L70
So when the classloader loads that Tensorflow class, init() is executed, which calls NativeLibrary.load();
That methods starts by checking if the lib is already loaded: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/NativeLibrary.java#L48
isLoaded is implemented by checking if TensorFlow.version throws an UnsatisfiedLinkError.

At this point your in the middle of the initialisation of the Tensorflow class so I I'm not really sure wether or not it's safe to access version.

There's also the fact that isLoaded is interpreting a UnsatisfiedLinkError as the lib not yet loaded, but it could also be that the library is already loaded according to https://docs.oracle.com/javase/8/docs/technotes/guides/jni/jni-12.html:

In the JDK, each class loader manages its own set of native libraries. The same JNI native library cannot be loaded into more than one class loader. Doing so causes UnsatisfiedLinkError to be thrown. For example, System.loadLibrary throws an UnsatisfiedLinkError when used to load a native library into two class loaders.

@jto
Copy link
Contributor

jto commented May 18, 2018

I had a look at sbt's source code to figure out what it's doing with classloader and I think I know what happens.

The native library il loaded by System.load (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/NativeLibrary.java#L101). It seems that it this loads the library using the "system" Classloader. This makes sense because

The same JNI native library cannot be loaded into more than one class loader.

The system Classloader lives for the entire duration of your SBT session.
As a consequence, once the tensorflow native lib is loaded, it stays loaded until you stop sbt.

The Java classes (including org.tensorflow.TensorFlow and org.tensorflow.NativeLibrary) are loaded using a dedicated Classloader. SBT has a setting called testLoader and gets the Classloader used for tests from there.

So when you run a test, SBT calls testLoader and get a fresh Classloader. It uses it to load your test classes. Once the test has ran, the Classloader is garbage collected.

You therefore end up in a situation where the JNI libs are still loaded because they sit in the long lived system Classloader, while all the java classes have been "unloaded" which let's Tensorflow in a weird half loaded state.

@ravwojdyla
Copy link
Contributor

Fixed in #1168

4lejandrito added a commit to 4lejandrito/liferay-portal that referenced this issue Jul 23, 2018
4lejandrito added a commit to 4lejandrito/liferay-portal that referenced this issue Jul 23, 2018
This is because tensorflow must be in the root class loader or it will
crash every time the module is redeplouyed.

See spotify/scio#1137 (comment)
4lejandrito added a commit to 4lejandrito/liferay-portal that referenced this issue Jul 24, 2018
This is because tensorflow must be in the root class loader or it will
crash every time the module is redeplouyed.

See spotify/scio#1137 (comment)
shuyangzhou pushed a commit to shuyangzhou/liferay-portal that referenced this issue Jul 25, 2018
This is because tensorflow must be in the root class loader or it will
crash every time the module is redeplouyed.

See spotify/scio#1137 (comment)
shuyangzhou pushed a commit to shuyangzhou/liferay-portal that referenced this issue Jul 25, 2018
This is because tensorflow must be in the root class loader or it will
crash every time the module is redeplouyed.

See spotify/scio#1137 (comment)
4lejandrito added a commit to 4lejandrito/liferay-portal that referenced this issue Jul 25, 2018
This is because tensorflow must be in the root class loader or it will
crash every time the module is redeplouyed.

See spotify/scio#1137 (comment)
brianchandotcom pushed a commit to brianchandotcom/liferay-portal that referenced this issue Jul 26, 2018
This is because tensorflow must be in the root class loader or it will
crash every time the module is redeplouyed.

See spotify/scio#1137 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants