Skip to content
This repository has been archived by the owner on Mar 8, 2024. It is now read-only.

mnist example panick #1

Closed
davidleon opened this issue Jun 13, 2016 · 11 comments
Closed

mnist example panick #1

davidleon opened this issue Jun 13, 2016 · 11 comments

Comments

@davidleon
Copy link

stack backtrace:
0: 0x7ff7336a72cc - std::rt::lang_start::hfe4efe1fc39e4a30
1: 0x7ff7336a661d - std::rt::lang_start::hfe4efe1fc39e4a30
2: 0x7ff73369a8fd - std::panicking::rust_panic_with_hook::h587239a80cad02d2
3: 0x7ff7336a996b - rust_begin_unwind
4: 0x7ff73369b99f - std::panicking::begin_panic_fmt::hb3024643f3039337
5: 0x7ff7336942f4 - check
at D:\Dev\deeplearn-rs:8
6: 0x7ff733658be3 - doubledeeplearn::var_store::VarIndex
7: 0x7ff7336584c9 - as_event_listcore::cell::Refopencl::hl::Event,opencl::hl::Event,closure
at C:\Users\davel.cargo\git\checkouts\rust-opencl-34f1354d1798ac72\master\src\hl.rs:864
8: 0x7ff7336582d7 - enqueue_async_kernel<(usize, usize),&[core::option::Optioncore::cell::Refopencl::hl::Event]>
at C:\Users\davel.cargo\git\checkouts\rust-opencl-34f1354d1798ac72\master\src\hl.rs:461
9: 0x7ff733669e1c - matmul
at C:\Users\davel.cargo\git\checkouts\gpuarray-rs-23536b5f6e829730\master\src\ops.rs:129
10: 0x7ff7336698ba - forward
at D:\Dev\deeplearn-rs\src\op.rs:72
11: 0x7ff733653ad7 - forward
at D:\Dev\deeplearn-rs\src\graph.rs:171
12: 0x7ff733647575 - train
at D:\Dev\deeplearn-rs\src\train.rs:24
13: 0x7ff73362232a - main
at D:\Dev\deeplearn-rs\examples\mnist.rs:136
14: 0x7ff7336a5fac - std::rt::lang_start::hfe4efe1fc39e4a30
15: 0x7ff7336abd41 - _rust_maybe_catch_panic
16: 0x7ff7336a5cee - std::rt::lang_start::hfe4efe1fc39e4a30
17: 0x7ff7336477d9 - main
18: 0x7ff7336b427f - __scrt_common_main_seh
at f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:255
19: 0x7ffcc4488101 - BaseThreadInitThunk
error: Process didn't exit successfully: target\debug\examples\mnist.exe (exit code: 101)

@davidleon
Copy link
Author

i believe this is due to the opencl lib bug. I tried the example in gpuarray-rs. still doesn't work. it emits the same CL_INVALID_EVENT_WAIT_LIST error.

@tedsta
Copy link
Owner

tedsta commented Jun 13, 2016

What OS, OpenCL driver, and OpenCL device are you using?

EDIT: Also, can you run cargo test in gpuarray-rs and paste the results back here?

@davidleon
Copy link
Author

in order to build on win 64. i tweaked little bit of the rust-opencl repo.
added #[link(name="opencl")] in cl.rs for msvc abi. for gnu abi I added #[link(name="opencl")] extern "win64"{

cargo test outputs the following:
Compiling opencl v0.3.0-dev (file:///D:/Dev/rust-opencl)
D:\Dev\rust-opencl\src\ext.rs:3:10: 3:28 warning: lint raw_pointer_derive has been removed: using derive with raw pointe
rs is ok, #[warn(renamed_and_removed_lints)] on by default
D:\Dev\rust-opencl\src\ext.rs:3 raw_pointer_derive,
^~~~~~~~~~~~~~~~~~
D:\Dev\rust-opencl\src\hl.rs:9:5: 9:20 warning: unused import, #[warn(unused_imports)] on by default
D:\Dev\rust-opencl\src\hl.rs:9 use std::ops::Deref;
^~~~~~~~~~~~~~~
D:\Dev\rust-opencl\src\hl.rs:137:29: 137:51 warning: use of deprecated item: the lazy-static crate suffices for static s
ync primitives and eventually this type shouldn't be necessary as Mutex::new in a static should suffice, #[warn(deprec
ated)] on by default
D:\Dev\rust-opencl\src\hl.rs:137 static mut platforms_mutex: std::sync::StaticMutex = std::sync::MUTEX_INIT;
^~~~~~~~~~~~~~~~~~~~~~
D:\Dev\rust-opencl\src\hl.rs:137:54: 137:75 warning: use of deprecated item: the lazy-static crate suffices for static s
ync primitives and eventually this type shouldn't be necessary as Mutex::new in a static should suffice, #[warn(deprec
ated)] on by default
D:\Dev\rust-opencl\src\hl.rs:137 static mut platforms_mutex: std::sync::StaticMutex = std::sync::MUTEX_INIT;
^~~~~~~~~~~~~~~~~~~~~
D:\Dev\rust-opencl\src\hl.rs:145:37: 145:41 warning: use of deprecated item: the lazy-static crate suffices for static s
ync primitives and eventually this type shouldn't be necessary as Mutex::new in a static should suffice, #[warn(deprec
ated)] on by default
D:\Dev\rust-opencl\src\hl.rs:145 let guard = platforms_mutex.lock();
^~~~
Compiling gpuarray v0.1.0 (file:///D:/Dev/gpuarray-rs)
src\ops.rs:70:13: 70:23 warning: unused variable: event_list, #[warn(unused_variables)] on by default
src\ops.rs:70 let event_list: &[Option<Ref<Rc>>] = &[a.get_event(), b.get_event()];
^~~~~~~~~~
src\ops.rs:70:13: 70:23 warning: unused variable: event_list, #[warn(unused_variables)] on by default
src\ops.rs:70 let event_list: &[Option<Ref<Rc>>] = &[a.get_event(), b.get_event()];
^~~~~~~~~~
Running target\debug\gpuarray-ba59dee6a24fc9c0.exe

running 25 tests
test helper::test_compute_dim_steps ... ok
test array::test_reshape ... ok
test array::test_array_indexing ... ok
test array::test_array_indexing_mut ... ok
test ops::tensor_divide_axis1 ... FAILED
test ops::tensor_add ... FAILED
test ops::tensor_add_reuse ... FAILED
test ops::tensor_add_axis ... FAILED
test ops::tensor_dsigmoid ... ok
test ops::tensor_dtanh ... FAILED
test ops::tensor_fill ... ok
test ops::tensor_matmul ... FAILED
test ops::tensor_multiply_axis1 ... FAILED
test ops::tensor_sigmoid ... ok
test ops::tensor_sum_axis0 ... ok
test ops::tensor_sum_axis1 ... ok
test ops::tensor_tanh ... FAILED
test ops::tensor_transpose ... ok
test ops::test_add_slice ... FAILED
test range_arg::test_s_macro ... ok
test ops::test_copy_to_slice ... ok
test ops::test_fill_slice ... ok
test ops::test_multiply_slice ... FAILED
test tensor::test_tensor_incomplete_slice ... ok
test tensor::test_tensor_read ... ok

failures:

---- ops::tensor_divide_axis1 stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::tensor_divide_axis1' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-opencl
\src\error.rs:65
note: Run with RUST_BACKTRACE=1 for a backtrace.

---- ops::tensor_add stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::tensor_add' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-opencl\src\erro
r.rs:65

---- ops::tensor_add_reuse stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::tensor_add_reuse' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-opencl\sr
c\error.rs:65

---- ops::tensor_add_axis stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::tensor_add_axis' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-opencl\src
\error.rs:65

---- ops::tensor_dtanh stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
[
[1 0.41997433 0.0706507]
[0.009866118 0.0013408661 0.00018155575]
[0.000024437904 0.00000333786 0.00000047683716]
[0 0 0]
[0 0 0]
]

thread 'ops::tensor_dtanh' panicked at 'assertion failed: b.buffer() ==
&[1.0, 0.4199744, 0.070650816, 0.009865999, 0.0013408661, 0.00018167496,
0.000024676323, 0.00000333786, 0.00000047683716, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0]', src\ops.rs:788

---- ops::tensor_matmul stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::tensor_matmul' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-opencl\src\e
rror.rs:65

---- ops::tensor_multiply_axis1 stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::tensor_multiply_axis1' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-open
cl\src\error.rs:65

---- ops::tensor_tanh stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
[
[0 0.7615942 0.96402764]
[0.9950547 0.9993293 0.9999092]
[0.9999878 0.99999833 0.99999976]
[1 1 1]
[1 1 1]
]

thread 'ops::tensor_tanh' panicked at 'assertion failed: b.buffer() ==
&[0.0, 0.7615941, 0.9640276, 0.9950548, 0.9993293, 0.99990916, 0.99998766,
0.99999833, 0.99999976, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]', src\ops.rs:767

---- ops::test_add_slice stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::test_add_slice' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-opencl\src
error.rs:65

---- ops::test_multiply_slice stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
thread 'ops::test_multiply_slice' panicked at 'Error enqueuing kernel. (CL_INVALID_EVENT_WAIT_LIST)', D:\Dev\rust-opencl
\src\error.rs:65

failures:
ops::tensor_add
ops::tensor_add_axis
ops::tensor_add_reuse
ops::tensor_divide_axis1
ops::tensor_dtanh
ops::tensor_matmul
ops::tensor_multiply_axis1
ops::tensor_tanh
ops::test_add_slice
ops::test_multiply_slice

test result: FAILED. 15 passed; 10 failed; 0 ignored; 0 measured

error: test failed

@tedsta
Copy link
Owner

tedsta commented Jun 18, 2016

It seems that it's failing on operations that wait on more than one OpenCL event (events that take only one input like sigmoid pass, events that take two inputs like add fail). I've updated my rust-opencl fork and gpuarray-rs. Could you try cargo update and cargo test again in gpuarray-rs? Thanks for helping me catch bugs :)

@davidleon
Copy link
Author

a quick test. it doesn't change anything. I tried git pull later. and it doesn't compile now. complaining:
src\ops.rs:16:34: 16:54 error: this function takes 5 parameters but 4 parameters were supplied [E0061]
src\ops.rs:16 .enqueue_async_kernel(&kernel, a.len(),
^~~~~~~~~~~~~~~~~~~~
src\ops.rs:16:34: 16:54 help: run rustc --explain E0061 to see a detailed explanation
src\ops.rs:26:35: 26:55 error: this function takes 5 parameters but 4 parameters were supplied [E0061]
src\ops.rs:26 a.set_event(Rc::new(ctx.queue.enqueue_async_kernel(&kernel, a.len(), None, ())));
^~~~~~~~~~~~~~~~~~~~
src\ops.rs:26:35: 26:55 help: run rustc --explain E0061 to see a detailed explanation
src\ops.rs:41:19: 41:39 error: this function takes 5 parameters but 4 parameters were supplied [E0061]
src\ops.rs:41 ctx.queue.enqueue_async_kernel(&kernel, keep_dim, None, a.get_event().as_ref().map(|x| &***x))

@davidleon
Copy link
Author

ah, it's the git pull conflict I didn't resolve. all opencl tests passes. gpuarray improves a lot with the following:
Compiling gpuarray v0.1.0 (file:///D:/Dev/gpuarray-rs)
Running target\debug\gpuarray-ba59dee6a24fc9c0.exe

running 25 tests
test array::test_array_indexing_mut ... ok
test array::test_reshape ... ok
test helper::test_compute_dim_steps ... ok
test array::test_array_indexing ... ok
test ops::tensor_add ... ok
test ops::tensor_divide_axis1 ... ok
test ops::tensor_add_reuse ... ok
test ops::tensor_add_axis ... ok
test ops::tensor_dsigmoid ... ok
test ops::tensor_dtanh ... FAILED
test ops::tensor_fill ... ok
test ops::tensor_matmul ... ok
test ops::tensor_multiply_axis1 ... ok
test ops::tensor_sigmoid ... ok
test ops::tensor_sum_axis0 ... ok
test ops::tensor_sum_axis1 ... ok
test ops::tensor_tanh ... FAILED
test ops::tensor_transpose ... ok
test ops::test_add_slice ... ok
test range_arg::test_s_macro ... ok
test ops::test_copy_to_slice ... ok
test ops::test_fill_slice ... ok
test ops::test_multiply_slice ... ok
test tensor::test_tensor_incomplete_slice ... ok
test tensor::test_tensor_read ... ok

failures:

---- ops::tensor_dtanh stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
[
[1 0.41997433 0.0706507]
[0.009866118 0.0013408661 0.00018155575]
[0.000024437904 0.00000333786 0.00000047683716]
[0 0 0]
[0 0 0]
]

thread 'ops::tensor_dtanh' panicked at 'assertion failed: b.buffer() ==
&[1.0, 0.4199744, 0.070650816, 0.009865999, 0.0013408661, 0.00018167496,
0.000024676323, 0.00000333786, 0.00000047683716, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0]', src\ops.rs:803
note: Run with RUST_BACKTRACE=1 for a backtrace.

---- ops::tensor_tanh stdout ----
Using OpenCL Device: Intel(R) HD Graphics 4000
[
[0 0.7615942 0.96402764]
[0.9950547 0.9993293 0.9999092]
[0.9999878 0.99999833 0.99999976]
[1 1 1]
[1 1 1]
]

thread 'ops::tensor_tanh' panicked at 'assertion failed: b.buffer() ==
&[0.0, 0.7615941, 0.9640276, 0.9950548, 0.9993293, 0.99990916, 0.99998766,
0.99999833, 0.99999976, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]', src\ops.rs:782

failures:
ops::tensor_dtanh
ops::tensor_tanh

test result: FAILED. 23 passed; 2 failed; 0 ignored; 0 measured

error: test failed

@davidleon
Copy link
Author

davidleon commented Jun 18, 2016

also please change the gpuarray context.rs with corresponding:
diff --git a/src/context.rs b/src/context.rs
index 2d74a6a..bccc242 100644
--- a/src/context.rs
+++ b/src/context.rs
@@ -17,7 +17,7 @@ impl Context {
include_str!("cl/main.cl"),
include_str!("cl/slice_ops.cl"));

- let (device, ctx, queue) = opencl::util::create_compute_context().unwrap();
+ let (device, ctx, queue) = opencl::util::create_compute_context_prefer(opencl::util::PreferedType::GPUPrefered).unwrap();

println!("Using OpenCL Device: {}", device.name());

and change the rust-opencl src/cl.rs with following:
diff --git a/src/cl.rs b/src/cl.rs
index 1919337..918bd5d 100644
--- a/src/cl.rs
+++ b/src/cl.rs
@@ -436,7 +436,7 @@ pub mod ll {
use cl::*;
use libc;

-
+ #[link(name="opencl")]
extern
{
/* Platform APIs */
to improve platform compatibility

@davidleon
Copy link
Author

umm, interesting. There are some rounding errors. And I don't understand why the code in gpuarray/ops.rs doesn't need to queue.write(args) as opencl examples.

@tedsta
Copy link
Owner

tedsta commented Jun 18, 2016

I applied your patches. In the future you can always do a pull request ;)

Yeah the rounding errors make sense. I should probably change those tanh and dtanh unit tests to compare with an epsilon tolerance.

The gpuarray operations operate on Tensors, which are basically just a wrapper around CLBuffers. It is the user's responsibility to Tensor::set the buffer with the right data.

Does the mnist example work for you now? If so, please close the issue. Thanks for working through it with me!

@davidleon
Copy link
Author

yeah mnist works. but the accuracy is just 90% something. Is it expected? if so you can close it. Also why on earth the wait on 2 events doesn't work? i'm curious why you wrote in that way and why it doesn't work in my case.

@tedsta
Copy link
Owner

tedsta commented Jun 19, 2016

Yes, right now the validation accuracy is just 92%. It's just a simple linear model.

I have it set up so each Tensor has an associated event that represents the last kernel that wrote to the Tensor's buffer. Operations using a given Tensor as an input need to wait on it's event to make sure the last kernel that wrote to it is done executing. The problem was it was an Option<Rc>, so it was possible to get event lists with NULL events (if the event was None in the Tensor, for instance after creating the Tensor but before any use in kernels). I guessed that your OpenCL implementation didn't like those NULL events, so I created user events and set them as complete when creating Tensors, so events are no longer Options in Tensor. I hope that makes sense!

@tedsta tedsta closed this as completed Jun 19, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants