Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal an error when number of index argument exceeds array dimension #418

Closed
ukoethe opened this issue Sep 16, 2017 · 15 comments
Closed

Signal an error when number of index argument exceeds array dimension #418

ukoethe opened this issue Sep 16, 2017 · 15 comments

Comments

@ukoethe
Copy link
Contributor

ukoethe commented Sep 16, 2017

Currently, index access to arrays and ufuncs via operator() implements very permissive semantics: When passing too many arguments, only the last D are used, where D is the dimension. I'm convinced that this will lead to errors that are hard to debug. Instead, the function should throw an exception (if D is only known at runtime) or a static_assert (if D is known at compile time). I can't see any negative side effects of this change.

@JohanMabille
Copy link
Member

Actually, assuming a is a 2D expression and b is a 1D expression, the following code should be valid:

auto expr = a + b;
double res = expr(2, 3); // This is computed as a(2, 3) + b(2, 3);

More formally, we want the following to be true, whatever the dimensions of a and b are:

(a+b)(i0, ..., in) = a(i0, ..., in) + b(i0, ..., in);

This is not possible if we forbid more arguments than the dimension.

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 16, 2017

Then the cleanest solution would be to provide two functions:

  • b(2, 3) signals an error
  • b.broadcast(2, 3) implements the current behavior and makes explicit that this is intended (maybe find a shorter name).

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 16, 2017

BTW, numpy also signals an error when too many indices are provided and still manages to implement broadcasting.

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 16, 2017

Another solution to the broadcasting problem worked successfully in my vigra2 prototype: Embed the smaller array in an array view with the correct dimension and set the stride of the missing dimensions to 0. This has the additional advantage that one can broadcast over any dimension, not just the first.

@JohanMabille
Copy link
Member

JohanMabille commented Sep 16, 2017

Numpy implements broadcasting for expression assignment only, it does not provide unevaluated expressions with evaluation upon element access (which is what we're doing in the example that I posted above).

When assigning an expression, the broadcasting mechanism in xtensor is different from the one described here; we use iterators and steppers instead of the strides index scheme to avoid computing inner product; numpy uses a similar one with pointer arithmetic.

I don't really see how allowing more index arguments than the dimension can be hard to debug, as long as the behavior is specified in the documentation (which is something we should really do).

Also I'm not sure to get what you mean about broadcasting over any dimension and not just the first ones. Do you mean something like the following?

xarray<double> a({3, 1, 4});
xarray<double> b({3, 3, 4});
auto expr = a + b;
double res = expr(2, 2, 2);

That is, broadcasting w.r.t. to the second dimension ?

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 16, 2017

Do you mean broadcasting w.r.t. to the second dimension ?

Yes. But I suppose this already works?

"Hard to debug" refers to the case where one accidentaly writes more indices than intended, and the code silently produces incorrect results. If the results are not too far off, it may take a long time until one even notices that something is wrong.

@JohanMabille
Copy link
Member

JohanMabille commented Sep 16, 2017

About the broadcasting over any dimension, indeed that already works.

Considering an additional broadcasting method, we would have

(a+b).broadcast(i0, ..., in) == a.broadcast(i0, ..., in) + b.broadcast(i0, ..., in);

but (a+b)(i0, ...., in) may throw if a and b do not have the same dimension. Thus if I write some code that must work with any kind of expression, I will end up using the broadcast method everywhere instead of the operator(), to be sure I don't have any exception.

If the problem is debugging only, maybe a better solution would be to turn on exceptions according to a compilation flag (just like we do for bounds checking), or add an element access method that throws when it is provided too many arguments. Besides, I find it more "C++ish" (sorry for this barbarism), where the operator() can be very permissive, and other functions (like at) perform additional checks and may throw.

@SylvainCorlay
Copy link
Member

The desired behavior is that if too many indices are provided, we drop the first until we reach the right number. For containers, it is to prepend with zeros until we reach the right number. We narrowed this recently and some expressions still don't implement correctly the case where too few arguments are provided, (namely xview and xgenerator).

This is the only rule that guarantees that (a + b)(i0, ..., in) = a(i0, ..., in) + b(i0, ..., in) with the broadcasting rules when a and b have different dimensions.

I feel very strongly that this is how it should be across the library and that throwing exceptions on dimension mismatch is actually a mistake

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 16, 2017

but (a+b)(i0, ...., in) may throw if a and b do not have the same dimension.

I didn't mean to be that strict. (a+b)(i0, ...., in) would only throw if n was larger than the maximum dimension of a and b (or if the shapes were not compatible). Thus, it would be implemented as

(a+b)(i0, ..., in) == a.broadcast(i0, ..., in) + b.broadcast(i0, ..., in)

after checking that the number of indices on the LHS is valid.

maybe a better solution would be to turn on exceptions according to a compilation flag

That's also a possibility. I generally follow this rule: if the check doesn't significantly affect performance, turn it on by default, otherwise have the user turn it on explicitly. One should not forget that the C/C++ tradition of having very permissive functions originated in a time when computers were really slow. We don't have to continue like this just for historical reasons when the performance argument is no longer valid (e.g., a bounds check may incur no performance hit at all if the CPU pipeline gets branch prediction right). Every diagnostic is helpful for the user!

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 16, 2017

For containers, it is to prepend with zeros until we reach the right number.

I don't argue against this rule at all. I just refer to the case when there are too many indices (meaning: more than the largest dimension in the expression) on the outermost expression/access.

@JohanMabille
Copy link
Member

JohanMabille commented Sep 17, 2017

Even with the less strict implementation

(a+b)(i0, ..., in) == a.broadcast(i0, ..., in) + b.broadcast(i0, ..., in)

writing code that must work with any kind of expressions, whatever the dimension, will end up using the broadcast method everywhere. One of the use case we have is a 2D PDE solver where coeffecients may be constant over a direction (or even the two). In that case, to save memory space, we use 1D (or 0D) expressions but access them as if they were 2D (so the same code works with every kind of coefficients).

About the C++ tradition, my point was more on the intuitive part of it: having used the STL, we expect operator[] / operator() of containers to not check their arguments, and to have dedicated functions for that (the counterpart of the vector<T>::at function). I find it more intuitive to keep this scheme, that is operator() / operator[] does not throw, specific method such at or safe_access do. Thus the user has the choice, he can work with the safe API if he needs to debug or if this does not hit performances on his machine, or with the one guaranteed to be the fastest.

Every diagnostic is helpful for the user!

I strongly agree with you

So here is what I propose to do for fixing that:

  • add an at method that checks dimension and bounds
  • add a compilation flag to activate dimension checking for operator(), for debugging purpose.

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 18, 2017

Thanks a lot. One can now even compare the speed with and w/o ckecking code activated and eventually keep the check active at places were it doesn't hurt performance or usability.

@JohanMabille
Copy link
Member

I forgot to mention, checking the dimension requires to compute the shape of the expression, which may add some overhead for some kinds of expressions where the shape is computed the first time it is asked.

This won't cause any performance hurt in most of the cases since the shape will be evaluated when assigning the expression, however this could give strange results in benchmarks.

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 18, 2017

If the expression only involves arrays whose dimension is known at compile time, the dimension check can be done in a metaprogramm followed by static_assert and costs nothing at runtime. This is probably the most powerful variant of the check I can think of.

@ukoethe
Copy link
Contributor Author

ukoethe commented Sep 18, 2017

This is especially useful because arrays with static dimensions are anyway the preferrable alternative in many situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants