Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug in .cc file. org.tensorflow.exceptions.TFInvalidArgumentException: indices = 1 is not in [0, 1) #189

Open
uygnef opened this issue Jan 14, 2021 · 9 comments

Comments

@uygnef
Copy link

uygnef commented Jan 14, 2021

My tfserver project throw error when it perform an inference.
Something wrong when program get feature raw data and make feature to embedding tensor.
How can I debug into .cc file.
image

org.tensorflow.exceptions.TFInvalidArgumentException: indices = 1 is not in [0, 1)
	 [[{{node dense_features_5/cate_ord_start_hash_embedding_1/cate_ord_start_hash_embedding_weights/GatherV2}}]]

featue "cate_ord_start_hash" is created by

'cate_ord_start_hash': tf.feature_column.embedding_column( tf.feature_column.categorical_column_with_vocabulary_file('cate_ord_start_hash', "cate_ord_start_hash_dict.txt", dtype=tf.string, default_value=0), dimension=8)

full error stack is

16:06:19.623 [main] INFO com.sankuai.qcs.model.common.core.loader.AbstractModelLoader - Unzip model cancel(v1, 1.0) tar file succeed, get [/Users/yu/proj/qcs.dispatch.algo/model-cancelBlameCancelRate/target/test-classes/._pbModel]
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
2021-01-14 16:06:21.898193: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /Users/yu/proj/qcs.dispatch.algo/model-cancelBlameCancelRate/target/test-classes//pbModel
2021-01-14 16:06:21.913997: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2021-01-14 16:06:21.914019: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:250] Reading SavedModel debug info (if present) from: /Users/yu/proj/qcs.dispatch.algo/model-cancelBlameCancelRate/target/test-classes//pbModel
2021-01-14 16:06:21.914108: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-14 16:06:22.037598: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:215] Restoring SavedModel bundle.
2021-01-14 16:06:22.168687: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:199] Running initialization op on SavedModel bundle at path: /Users/yu/proj/qcs.dispatch.algo/model-cancelBlameCancelRate/target/test-classes//pbModel
2021-01-14 16:06:22.261870: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:319] SavedModel load for tags { serve }; Status: success: OK. Took 363683 microseconds.
16:06:22.872 [main] DEBUG com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameCancelRateModel - #GroupRespRateModel.loadConfig# 配置文件解析成功. featureConfig:{cate_ord_start_hash=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@11c9af63, cate_ord_hour=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@757acd7b, cate_ord_end_hash=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@36b4fe2a, str_im_message=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@574b560f, num_hold_time=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@ba54932, order_id_view=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@28975c28, num_trail_total_moving_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@3943a2be, num_navi_interval_duration=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@343570b7, num_trail_interval_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@157853da, cate_ord_hexhash16=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@71c3b41, num_user_total_moving_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@236e3f4e, cate_feature_call_type=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@3cc1435c, num_navi_offline_cnt=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@6bf0219d, num_navi_max_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@dd0c991, num_user_max_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@5f16132a, num_trail_max_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@69fb6037, num_navi_interval_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@36d585c, num_user_avg_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@87a85e1, num_trail_avg_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@671a5887, num_user_interval_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@5552768b, num_navi_avg_duration=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@3c947bc5, num_ord_cancel_time=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@609db43b, cate_feature_call_result=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@55f616cf, num_navi_max_duration=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@1356d4d4, num_navi_avg_distance=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@c03cf28, cate_ord_status=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@1329eff, num_user_avg_speed=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@6497b078, cate_ord_driver_type=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@41c2284a, num_trail_driver_speed=com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameFeatureDesc@53aad5d5}
16:06:22.873 [main] INFO com.sankuai.qcs.model.common.core.loader.AbstractModelLoader - Build model cancel(v1, 1.0) finished
16:06:22.902 [main] INFO com.sankuai.qcs.model.common.core.loader.AbstractModelLoader - Model cancel(v1, 1.0) get attributeVectorDef: [AttributeDef[name=num_ord_cancel_time,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=cate_ord_hour,attributeType=STRING,defaultValue=0,allowNone=false],AttributeDef[name=cate_ord_hexhash16,attributeType=STRING,defaultValue=0,allowNone=false],AttributeDef[name=cate_ord_start_hash,attributeType=STRING,defaultValue=0,allowNone=false],AttributeDef[name=cate_ord_end_hash,attributeType=STRING,defaultValue=0,allowNone=false],AttributeDef[name=cate_ord_driver_type,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=cate_ord_status,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_navi_max_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_navi_interval_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_navi_avg_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_navi_max_duration,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_navi_interval_duration,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_navi_avg_duration,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_navi_offline_cnt,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_trail_max_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_trail_avg_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_trail_interval_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_trail_total_moving_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_trail_driver_speed,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=cate_feature_call_type,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=cate_feature_call_result,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_hold_time,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_user_max_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_user_interval_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_user_avg_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_user_total_moving_distance,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=num_user_avg_speed,attributeType=INT,defaultValue=-1,allowNone=false],AttributeDef[name=str_im_message,attributeType=STRING,defaultValue=N,allowNone=false],AttributeDef[name=order_id_view,attributeType=STRING,defaultValue=0,allowNone=false]]
16:06:22.939 [main] INFO com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameCancelRateModelChecker - CancelBlameRateFeatures: cate_ord_hexhash16 = p0d221h0408
2021-01-14 16:06:24.111474: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111486: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111530: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111539: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111548: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111539: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111576: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111600: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111606: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111625: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111628: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111631: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111641: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111651: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111646: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111668: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111674: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111678: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111688: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
2021-01-14 16:06:24.111767: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 1 of dimension 0 out of bounds.
16:06:24.112 [main] ERROR com.sankuai.qcs.model.common.core.model.inner.AbstractTFDispatchModel - Compute tf model failed, modelName = cancel: 
org.tensorflow.exceptions.TFInvalidArgumentException: indices = 1 is not in [0, 1)
	 [[{{node dense_features_5/cate_ord_start_hash_embedding_1/cate_ord_start_hash_embedding_weights/GatherV2}}]]
	at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:87)
	at org.tensorflow.Session.run(Session.java:666)
	at org.tensorflow.Session.access$100(Session.java:72)
	at org.tensorflow.Session$Runner.runHelper(Session.java:381)
	at org.tensorflow.Session$Runner.run(Session.java:329)
	at com.sankuai.qcs.model.common.core.model.inner.AbstractTFDispatchModel.compute(AbstractTFDispatchModel.java:47)
	at com.sankuai.qcs.model.common.core.model.outer.AbstractTFModel.compute(AbstractTFModel.java:27)
	at com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameCancelRateModelChecker.check(cancelBlameCancelRateModelChecker.java:49)
	at com.sankuai.qcs.model.common.core.loader.AbstractModelLoader.loadFromFile(AbstractModelLoader.java:58)
	at com.sankuai.qcs.model.impl.cancelBlameCancelRate.cancelBlameCancelRateModelTest.test(cancelBlameCancelRateModelTest.java:41)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
@karllessard
Copy link
Collaborator

How can I debug into .cc file.

If I recall correctly, I used to do that by

  • adding a breakpoint in the TF Java code before running the graph (to make sure that all native libraries have been loaded)
  • launching the JVM in debug mode and waiting until it breaks
  • starting a gdb session targeting the java process of the JVM
  • adding native breakpoints
  • continuing the debugging session of gdb
  • continuing the debugging session of the JVM

Then, gdb should break when your program reaches the native breakpoints

@Craigacp
Copy link
Collaborator

It looks like your classpath might not be right as there are a bunch of JavaCPP errors at the top:

Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path

@saudet any idea how TF is managing to run if it can't load the JavaCPP pointer?

@saudet
Copy link
Contributor

saudet commented Jan 15, 2021

It's just a warning, it still runs for backward compatibility, but it's obviously not recommended since users may hit loading errors...

@karllessard
Copy link
Collaborator

@uygnef , do you still want us to keep track of this issue, i.e. was the purpose only to know how to debug the TF C++ runtime from a Java process or you suspect that the out-of-range issue is related to the Java client?

@saudet
Copy link
Contributor

saudet commented Apr 5, 2021

@Craigacp I've changed the behavior in commit bytedeco/javacpp@f8ad06a so that those messages appear only in the debug log instead of as warnings. When someone encounters one of those corner cases, it'll still be easy to diagnose when that happens and then make sure they put jnijavacpp in the (right) library/class/module path.

@Craigacp
Copy link
Collaborator

Craigacp commented Apr 5, 2021

@saudet What's the difference in behaviour when it can't load these JNI bindings? Is it slower or more resource intensive? If so then making what appears to be a packaging problem into an easily ignored logger message seems like a problem. Why don't we fix the fact that it can't load jnijavacpp by sorting out the packaging instead?

@saudet
Copy link
Contributor

saudet commented Apr 6, 2021

It's not going to do anything differently unless it can't load those functions from a native library somewhere, so it's not a big issue.
As for packaging, you tell me, why aren't you including libjnijavacpp.so in your paths somewhere?

@Craigacp
Copy link
Collaborator

Craigacp commented Apr 6, 2021

My projects which use TF-Java depend on what Maven tells me TF-Java depends on. So if the TF-Java pom files don't depend on a jar which contains jnijavacpp then we should fix the pom files. Given you wrote the JavaCPP related portions of TF-Java, I assumed we already had the necessary JavaCPP dependencies, both Java and native. If we don't have a complete set of them, then let's fix that.

Which jar file is this binary in?

@saudet
Copy link
Contributor

saudet commented Apr 6, 2021

For Linux, it's exactly this one:

<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>javacpp</artifactId>
<version>${javacpp.version}</version>
<classifier>${javacpp.platform.linux-x86_64}</classifier>
</dependency>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants