Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvn install -Pdev -Djavacpp.platform.extension=-gpu -e --> FAILURE! - in org.tensorflow.framework.optimizers.AdaGradDATest #131

Open
aday00 opened this issue Oct 14, 2020 · 4 comments

Comments

@aday00
Copy link

aday00 commented Oct 14, 2020

Many thanks for TF-Java! mvn install -Pdev -Djavacpp.platform.extension=-gpu -e on the master branch appears to fail a test, so I thought I'd share it, in case anyone else encounters this:

# mvn install -Pdev -Djavacpp.platform.extension=-gpu -e
...
Downloading from ossrh-snapshots: https://oss.sonatype.org/content/repositories/snapshots/org/tensorflow/tensorflow-core-api/0.3.0-SNAPSHOT/tensorflow-core-api-0.3.0-20201008.134402-33-linux-x86_64-gpu.jar
...

tensorflow framework build error
[INFO] --- maven-surefire-plugin:2.22.2:test (default-test) @ tensorflow-framework ---
[INFO] Surefire report directory: /tmp/docker-share/tensorflow-java/tensorflow-framework/target/surefire-reports
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.tensorflow.framework.optimizers.AdaDeltaTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.573 s - in org.tensorflow.framework.optimizers.AdaDeltaTest
[INFO] Running org.tensorflow.framework.optimizers.NadamTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.113 s - in org.tensorflow.framework.optimizers.NadamTest
[INFO] Running org.tensorflow.framework.optimizers.AdamTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.134 s - in org.tensorflow.framework.optimizers.AdamTest
[INFO] Running org.tensorflow.framework.optimizers.AdaGradTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.083 s - in org.tensorflow.framework.optimizers.AdaGradTest
[INFO] Running org.tensorflow.framework.optimizers.RMSPropTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.367 s - in org.tensorflow.framework.optimizers.RMSPropTest
[INFO] Running org.tensorflow.framework.optimizers.AdamaxTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.118 s - in org.tensorflow.framework.optimizers.AdamaxTest
[INFO] Running org.tensorflow.framework.optimizers.FtrlTest
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.273 s - in org.tensorflow.framework.optimizers.FtrlTest
[INFO] Running org.tensorflow.framework.optimizers.MomentumTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.206 s - in org.tensorflow.framework.optimizers.MomentumTest
[INFO] Running org.tensorflow.framework.optimizers.OptimizersTest
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.054 s - in org.tensorflow.framework.optimizers.OptimizersTest
[INFO] Running org.tensorflow.framework.optimizers.GradientDescentTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.14 s - in org.tensorflow.framework.optimizers.GradientDescentTest
[INFO] Running org.tensorflow.framework.optimizers.AdaGradDATest
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.055 s <<< FAILURE! - in org.tensorflow.framework.optimizers.AdaGradDATest
[ERROR] testBasic  Time elapsed: 2.044 s  <<< ERROR!
org.tensorflow.exceptions.TFInvalidArgumentException: 
Cannot assign a device for operation adagrad-da_1: Could not satisfy explicit device specification '' because the node {{colocation_node adagrad-da_1}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3]. 
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ApplyAdagradDA: CPU 
VariableV2: GPU CPU 
Assign: GPU CPU 
Colocation members, user-requested devices, and framework assigned devices, if any:
  var0 (VariableV2)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  adagrad-da_1 (Assign) 
  var0-gradient_accumulator (VariableV2) 
  adagrad-da_10 (Assign) 
  var0-gradient_squared_accumulator (VariableV2) 
  adagrad-da_15 (Assign) 
  adagrad-da_36 (ApplyAdagradDA) 

         [[{{node adagrad-da_1}}]]
        at org.tensorflow.framework.optimizers.AdaGradDATest.testBasic(AdaGradDATest.java:90)

[INFO] Running org.tensorflow.framework.data.SkipDatasetTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.029 s - in org.tensorflow.framework.data.SkipDatasetTest
[INFO] Running org.tensorflow.framework.data.BatchDatasetTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.021 s - in org.tensorflow.framework.data.BatchDatasetTest
...
[INFO] Running org.tensorflow.framework.initializers.OrthogonalTest
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.333 s - in org.tensorflow.framework.initializers.OrthogonalTest
[INFO] Running org.tensorflow.framework.initializers.HeTest
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.451 s - in org.tensorflow.framework.initializers.HeTest
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Errors: 
[ERROR]   AdaGradDATest.testBasic:90 ? TFInvalidArgument Cannot assign a device for oper...
[INFO] 
[ERROR] Tests run: 112, Failures: 0, Errors: 1, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for TensorFlow Java Parent 0.3.0-SNAPSHOT:
[INFO] 
[INFO] TensorFlow Java Parent ............................. SUCCESS [  3.164 s]
[INFO] TensorFlow NdArray Library ......................... SUCCESS [ 30.072 s]
[INFO] TensorFlow Core Parent ............................. SUCCESS [  0.006 s]
[INFO] TensorFlow Core Annotation Processor ............... SUCCESS [  0.190 s]
[INFO] TensorFlow Core API Library ........................ SUCCESS [ 34.833 s]
[INFO] TensorFlow Core API Library Platform GPU ........... SUCCESS [  0.021 s]
[INFO] TensorFlow Framework Library ....................... FAILURE [01:29 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:37 min
[INFO] Finished at: 2020-10-14T01:44:35Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on project tensorflow-framework: There are test failures.
[ERROR] 
[ERROR] Please refer to /tmp/docker-share/tensorflow-java/tensorflow-framework/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on project tensorflow-framework: There are test failures.

Please refer to /tmp/docker-share/tensorflow-java/tensorflow-framework/target/surefire-reports for the individual test results.
Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
...

This new quad-GPU machine presents new challenges, compared to our previous small test systems in #100 etc. Really appreciate the fast -Pdev option for building from artifacts, thanks!

@karllessard
Copy link
Collaborator

@Craigacp , any hint on this?

@saudet
Copy link
Contributor

saudet commented Nov 4, 2020

The tests pass just fine for me with a GeForce GTX 1050.

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 4, 2020

I can replicate this trying to build master on OL7, with two V100 GPUs. I'll poke at it some more over the next few days. It's very odd.

@aday00
Copy link
Author

aday00 commented Jan 23, 2021

Thanks for investigating! As a workaround to build, is it safe for me to disable this test if not using AdaGrad?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants