Input validation refactoring #3440

Closed
amueller opened this Issue Jul 19, 2014 · 18 comments

Comments

Projects
None yet
4 participants
@amueller
Owner

amueller commented Jul 19, 2014

I propose to refactor the input validation. The current zoo of methods is kinda confusing.

Related to #3142.

Checks that we want are:

  • numpy array vs sparse vs list vs anything indexable
  • sparse matrix type
  • inf / nan
  • dtype
  • ndims
  • number of samples is consistent in multiple arrays.
  • contiguous

I'll now check if there is anything else that we currently check and see what functions we have.

Remaining Issues

  • possibly remove "make_float_array" and "1d_or_column" and also make these options of check_array.
  • Currently "ensure2d" makes vectors into rows, which I find pretty counter-intuitive. This is for backward compatibility. 1d input is not currently handled consistently.
@arjoly

This comment has been minimized.

Show comment Hide comment
@arjoly

arjoly Jul 19, 2014

Owner

+1

Owner

arjoly commented Jul 19, 2014

+1

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jul 19, 2014

Owner

What we have

(only public functions)

  • assert_all_finite - elements of numpy / sparse matrix are not inf / nan
  • safe_asarray - let's through array, csr, csc, coo, converts other sparse formats to csr, optional finiteness and order.
  • array2d - raises on sparse, converts lists to arrays, makes atleast2d, optional finite, copy, order.
  • atleast2d_or_csx - converts sparse csr / csc, makes arrays atleast2d, optional finite, copy, order.
  • check_arrays - (I take the blame for large parts of this) checks that all inputs have the same len or shape[0], optionally converts sparse matrices to a given format, or passes through everything iterable, checks contiguity, dtype, finite.
  • column_or_1d - converts lists dense array, ravels 2ndim to 1ndim and optionally warns.
  • as_float_array - make array or sparse into float32 or float64 depending on input type.
Owner

amueller commented Jul 19, 2014

What we have

(only public functions)

  • assert_all_finite - elements of numpy / sparse matrix are not inf / nan
  • safe_asarray - let's through array, csr, csc, coo, converts other sparse formats to csr, optional finiteness and order.
  • array2d - raises on sparse, converts lists to arrays, makes atleast2d, optional finite, copy, order.
  • atleast2d_or_csx - converts sparse csr / csc, makes arrays atleast2d, optional finite, copy, order.
  • check_arrays - (I take the blame for large parts of this) checks that all inputs have the same len or shape[0], optionally converts sparse matrices to a given format, or passes through everything iterable, checks contiguity, dtype, finite.
  • column_or_1d - converts lists dense array, ravels 2ndim to 1ndim and optionally warns.
  • as_float_array - make array or sparse into float32 or float64 depending on input type.
@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jul 19, 2014

Owner

git grep array2d | wc -l
131

git grep safe_asarray | wc -l
59

git grep atleast2d_or_csr | wc -l
84

git grep atleast2d_or_csc | wc -l
28

git grep check_arrays | wc -l
152

Owner

amueller commented Jul 19, 2014

git grep array2d | wc -l
131

git grep safe_asarray | wc -l
59

git grep atleast2d_or_csr | wc -l
84

git grep atleast2d_or_csc | wc -l
28

git grep check_arrays | wc -l
152

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jul 19, 2014

Owner

Ok my proposed interface is:

check_consistent_length(X, y)
# hopefully not as long in real live
X = check_array(X, allowed_sparse=['csr', 'lil'], convert_sparse_to='csr', dtype='any_float', order='C', copy=copy, check_finite=True)
y = column_or_1d(y)

and array_2d_or_csx would basically be a shortcut to allowed_sparse=['csx'], convert_sparse_to='csx'

Owner

amueller commented Jul 19, 2014

Ok my proposed interface is:

check_consistent_length(X, y)
# hopefully not as long in real live
X = check_array(X, allowed_sparse=['csr', 'lil'], convert_sparse_to='csr', dtype='any_float', order='C', copy=copy, check_finite=True)
y = column_or_1d(y)

and array_2d_or_csx would basically be a shortcut to allowed_sparse=['csx'], convert_sparse_to='csx'

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 19, 2014

Owner

check_consistent_length(X, y)

hopefully not as long in real live

X = check_array(X, allowed_sparse={'csr'}, convert_sparse_to='csr', dtype=np.float, order='C', copy=copy)
y = column_or_1d(y)

I like that. I would also propose the renaming of check_array as
discussed earlier.

Owner

GaelVaroquaux commented Jul 19, 2014

check_consistent_length(X, y)

hopefully not as long in real live

X = check_array(X, allowed_sparse={'csr'}, convert_sparse_to='csr', dtype=np.float, order='C', copy=copy)
y = column_or_1d(y)

I like that. I would also propose the renaming of check_array as
discussed earlier.

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jul 19, 2014

Owner

pr here: #3443

Owner

amueller commented Jul 19, 2014

pr here: #3443

@arjoly

This comment has been minimized.

Show comment Hide comment
@arjoly

arjoly Jul 19, 2014

Owner

We could also add check to raise ValueError on 64 bit sparse matrices if it's not supported.

Owner

arjoly commented Jul 19, 2014

We could also add check to raise ValueError on 64 bit sparse matrices if it's not supported.

@arjoly

This comment has been minimized.

Show comment Hide comment
@arjoly

arjoly Jul 20, 2014

Owner

There is still the as_float_array utility.

Owner

arjoly commented Jul 20, 2014

There is still the as_float_array utility.

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jan 22, 2015

Owner

@arjoly can you comment on the 64bit sparse matrices? What is the issue there again? Otherwise I'd close this as I think the current state is pretty decent.

Owner

amueller commented Jan 22, 2015

@arjoly can you comment on the 64bit sparse matrices? What is the issue there again? Otherwise I'd close this as I think the current state is pretty decent.

@jnothman

This comment has been minimized.

Show comment Hide comment
@jnothman

jnothman Jan 23, 2015

Owner

Sparse matrices rows with intp-like types (csr/c/bsr indices, indptr; coo col, row) were constrained to a 32 bit int until very recently. If the user has a supporting scipy version, and we only perform scipy public API things, 64-bit index support should be seamless. If we perform our own cython ops, or if we use explicit dtyping, things will break with unhelpful error messages. So it's best if we validate the index dtype where that's going to happen, or fix our implementations to allow very big sparse matrices. Basically, we need to alter all the sparse matrix tests to test with both index dtypes, and go from there.

Owner

jnothman commented Jan 23, 2015

Sparse matrices rows with intp-like types (csr/c/bsr indices, indptr; coo col, row) were constrained to a 32 bit int until very recently. If the user has a supporting scipy version, and we only perform scipy public API things, 64-bit index support should be seamless. If we perform our own cython ops, or if we use explicit dtyping, things will break with unhelpful error messages. So it's best if we validate the index dtype where that's going to happen, or fix our implementations to allow very big sparse matrices. Basically, we need to alter all the sparse matrix tests to test with both index dtypes, and go from there.

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jan 23, 2015

Owner

Right. That should actually not be too hard.

Owner

amueller commented Jan 23, 2015

Right. That should actually not be too hard.

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jan 23, 2015

Owner

I think I will make that a new issue, though, as that is more about tighter testing.

Owner

amueller commented Jan 23, 2015

I think I will make that a new issue, though, as that is more about tighter testing.

@amueller amueller closed this Jan 23, 2015

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jan 23, 2015

Owner
Owner

amueller commented Jan 23, 2015

@arjoly

This comment has been minimized.

Show comment Hide comment
@arjoly

arjoly Jan 23, 2015

Owner

Are there reasons to keep as_float_array?

Owner

arjoly commented Jan 23, 2015

Are there reasons to keep as_float_array?

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jan 23, 2015

Owner

Actually, probably not.

Owner

amueller commented Jan 23, 2015

Actually, probably not.

@amueller amueller reopened this Jan 23, 2015

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jan 23, 2015

Owner

There is a lot of logic in that one, and I'm not sure I want to include this all in the check_array function....

Owner

amueller commented Jan 23, 2015

There is a lot of logic in that one, and I'm not sure I want to include this all in the check_array function....

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jan 23, 2015

Owner

Maybe that is a good candidate to fix after #4136 has been merged. A problem is that it is a lot used in function interfaces, which are not tested in the common tests :-/

Owner

amueller commented Jan 23, 2015

Maybe that is a good candidate to fix after #4136 has been merged. A problem is that it is a lot used in function interfaces, which are not tested in the common tests :-/

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Sep 13, 2016

Owner

closing, I think we're happy with check_array for now.

Owner

amueller commented Sep 13, 2016

closing, I think we're happy with check_array for now.

@amueller amueller closed this Sep 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment