-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
caffe2: RuntimeError: [enforce fail at reshape_op.h:110] with Alexnet onnx test with cuda #13598
Comments
For reference, a full test script to recreate:
|
The failure may occur less frequently when the script is run as whole, vs. more frequently when just the Caffe2 ONNX import-and-run side is run against a previously-exported "alexnet.onnx" file. (But the frequency difference may be due to other factors.) The corruption seems to consistently hit the most-significant half of first shape dimension, turning what should be a "10" into a much larger value). Here's another run where
Since this is LE, the stray data appears at offset 0x4, between two apparently-wholesome data words at 0x0 and 0x8, so maybe that suggests this isn't a simple overwrite? The problem doesn't appear to be sensitive to the input data. If we seed the numpy RNG just before:
The problem is still intermittent: sometimes not occurring, and with different stray values when it does. In some testing we've seen a complaint from
Here the test script is specifying |
The stray data is consistent in form but not value:
The stray values always seem to be various int values. There isn't any float-looking stuff showing up. Just made 20 runs. 2 finished successfully (without overwrite, or where stray value happened to be 0x0), the other 18 runs failed with various stray values (some appearing multiple times across the runs):
|
Looks like the issue is in the gather operation -
%29 is the output of the pooling operation. The tensor contains all of the data, %36 - is the combination of several operations Shape, Gather, Unsqueeze and Concat and contained the data we observed to be corrupted. Working backwards through the net trace things seemed to go bad at the output of the Gather operation. The Output of Shape (%31) is a tensor with dtype int64 and values [10,256,6,6] which is used as input Gather contains a kernel function to collect pieces of data into a single tensor.
when the kernal function is called in the example the inputs are:
since dst_offset and src_offsets are both float pointers derived from the I think the fix will be to template the types of dst_offset and src_offset instead of the Index tensor type, Im working on that now |
I've made a change to use the input data size as the data type in GatherKernel, then began to wonder what types made sense, in the current version of the code I don't see a restriction on the data size for the input, should any valid numerical type be accepted? |
* Intermittent data corruption was seen in the Reshape_op when running the end-to-end pytorch to caffe2 example, the issue was traced back to the gather_op transfering only float data types. See pytorch/pytorch/pytorch#13598 for additional details
The gather code has been refactored recently, but we are still seeing the failure. One thing I noticed with the new code is that the issue only occurs when the input data type is int64 I used the following test to validate
the following output was observed for the int32 type
and the following with int64 as the input data type
I will continue to investigate a solution |
I also tried gathering multiple int8 types from a tensor, it also gave incorrect results when using the GPU version of gather
however, if I switch the tensor to CPU type device_opts.device_type = caffe2_pb2.CPU the results are correct
|
Created a PR based on the refactored code #16077 to resolve the issue |
馃悰 Bug
we are seeing an intermittent failure in the reshape_op when trying to run the EXAMPLE: END-TO-END ALEXNET FROM PYTORCH TO CAFFE2.
To Reproduce
Steps to reproduce the behavior:
Using the example code and the exported alexnet.onnx file run the following sample.
Expected behavior
example code runs successfully every time.
Environment
conda
,pip
, source): Build from sourceproblem only occurs when use device = cuda, if device = cpu the issue is not see, seen on ppc64le unknown if other platforms are effected.
Additional context
Checking that very large incorrect value, we see least significant
32-bits of the total size is correct (i.e. 0x16800, or 92160), but the
high order 32-bits are incorrect (should be 0x0, but contain data):
If we build Caffe2 to crash here, we find the cause of the problem seems
to be that the shape information copied back from the GPU is damaged,
for example in one run we see:
Here, 2 values (0x0a / 10, and 0x2400 / 9216) should have been copied in
from the GPU, but instead the most-significant 32-bits of the "10" value
have been overlayed with "0x05", resulting in a final apparent value of
0x0000 0005 0000 000a (21474836490).
Since the problem is intermittent, it's not clear whether the overwrite
is always happening (but mostly happens to be 0x0) or only happens
sometimes.
The text was updated successfully, but these errors were encountered: