[MRG] Input validation refactoring #3443

Merged
merged 1 commit into from Jul 20, 2014

Projects

None yet

5 participants

@amueller
Member

Refactor input validation to make it more consistent.

Todo

  • Tests
  • docstrings
@amueller
Member

Hopefully more readable and consistent input checking. Also fixed some corner cases.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling c3ecfed on amueller:input_validation_refactoring into 0807e19 on scikit-learn:master.

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ elif string_format == "coo":
+ return sp.coo_matrix
+ else:
+ raise ValueError("Don't know how to construct a sparse matrix of type"
+ " %s" % string_format)
+
+
+def _ensure_sparse_format(spmatrix, allowed_sparse, dtype, order, copy,
+ force_all_finite, convert_sparse_to):
+ """Convert a sparse matrix to a given format.
+
+ Checks the sparse format of spmatrix and converts if necessary.
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix.
@arjoly
arjoly Jul 19, 2014 Member

. is not needed

@arjoly arjoly and 2 others commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
+ Sparse format to convert sparse matrices to if allowed_sparse is not
+ None. By default, the first entry of allowed_sparse will be used.
+
+ make_2d : boolean, default=True
+ Whether to make X at least 2d.
+
+ allow_nd : boolean, default=True
+ Whether to allow nd X.
@arjoly
arjoly Jul 19, 2014 Member

What is nd?

@amueller
amueller Jul 19, 2014 Member

Should say "allow X.ndim > 2"

@GaelVaroquaux
GaelVaroquaux Jul 19, 2014 Member

In the docstring, yes, but I must say that I find the name quite clear.

@arjoly arjoly and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
+ Sparse format to convert sparse matrices to if allowed_sparse is not
+ None. By default, the first entry of allowed_sparse will be used.
+
+ make_2d : boolean, default=True
@arjoly
arjoly Jul 19, 2014 Member

make_2d -> ensure_2d?

@GaelVaroquaux
GaelVaroquaux Jul 19, 2014 Member

make_2d -> ensure_2d?

👍

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ force_all_finite=True, convert_sparse_to=None, make_2d=True,
+ allow_nd=True):
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to convert_sparse_to.
@arjoly
arjoly Jul 19, 2014 Member

Can you add what are the available string?

@arjoly arjoly and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+
+def _sparse_matrix_constructor(string_format):
+ """Get constructor from a sparse matrix string format."""
+ if string_format == "csr":
+ return sp.csr_matrix
+ elif string_format == "csc":
+ return sp.csc_matrix
+ elif string_format == "coo":
+ return sp.coo_matrix
+ else:
+ raise ValueError("Don't know how to construct a sparse matrix of type"
+ " %s" % string_format)
+
+
+def _ensure_sparse_format(spmatrix, allowed_sparse, dtype, order, copy,
+ force_all_finite, convert_sparse_to):
@arjoly
arjoly Jul 19, 2014 Member

Is there a point of making a function for this?

@amueller
amueller Jul 19, 2014 Member

to make the code more readable. I could also put it into the check_array function.

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ else:
+ raise ValueError("Unknown sparse matrix format passed: %s"
+ % type(sparse_matrix))
+
+
+def _sparse_matrix_constructor(string_format):
+ """Get constructor from a sparse matrix string format."""
+ if string_format == "csr":
+ return sp.csr_matrix
+ elif string_format == "csc":
+ return sp.csc_matrix
+ elif string_format == "coo":
+ return sp.coo_matrix
+ else:
+ raise ValueError("Don't know how to construct a sparse matrix of type"
+ " %s" % string_format)
@arjoly
arjoly Jul 19, 2014 Member

I would use a dictionnary for that.

@arjoly arjoly and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
@@ -175,6 +133,177 @@ def _num_samples(x):
return x.shape[0] if hasattr(x, 'shape') else len(x)
+def check_consistent_length(*arrays):
+ """Check that all arrays have consistent first dimensions.
+
+ Checks whether all objects in arrays have the same shape or length.
+
+ Parameters
+ ----------
+ arrays : list or tuple of input objects.
+ Objects that will be checked for consistent length.
+ """
+
+ n_samples = [_num_samples(X) for X in arrays if X is not None]
+ uniques = np.unique(n_samples)
+ if len(uniques) > 1:
@arjoly
arjoly Jul 19, 2014 Member

len(set(_num_samples(X) for X in arrays if X is not None)) > 1?

@amueller
amueller Jul 19, 2014 Member

and how do you do the error message? yeah I could put the unique in the top line.

@arjoly
arjoly Jul 19, 2014 Member

yeah I could put the unique in the top line.

+1

@arjoly arjoly and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ Returns
+ -------
+ spmatrix_convertd : scipy sparse matrix.
+ Matrix that is ensured to have an allowed type (or convert_sparse_to).
+ """
+ if allowed_sparse is None:
+ raise TypeError('A sparse matrix was passed, but dense '
+ 'data is required. Use X.toarray() to '
+ 'convert to a dense numpy array.')
+ sparse_type = spmatrix.format
+ if sparse_type in allowed_sparse:
+ # correct type
+ if dtype == spmatrix.dtype or dtype is None:
+ # correct dtype
+ if copy:
+ spmatrix = spmatrix.copy()
@arjoly
arjoly Jul 19, 2014 Member

Feels like that is not at the right place.

@amueller
amueller Jul 19, 2014 Member

why? I could also put it at the very end, but that wouldn't really make a difference.

@arjoly
arjoly Jul 19, 2014 Member

I would make this sooner before the type coercition.

@amueller
amueller Jul 19, 2014 Member

but then you might copy twice!

@arjoly
arjoly Jul 19, 2014 Member

Yeah, but you might change the original data. no?
This require tests.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling f72aa17 on amueller:input_validation_refactoring into 0807e19 on scikit-learn:master.

@arjoly arjoly and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ if sparse_type in allowed_sparse:
+ # correct type
+ if dtype == spmatrix.dtype or dtype is None:
+ # correct dtype
+ if copy:
+ spmatrix = spmatrix.copy()
+ else:
+ # convert dtype
+ spmatrix = spmatrix.astype(dtype)
+ else:
+ # create new
+ spmatrix = _sparse_matrix_constructor(convert_sparse_to)(
+ spmatrix, copy=copy, dtype=dtype)
+ if force_all_finite:
+ _assert_all_finite(spmatrix.data)
+ spmatrix.data = np.array(spmatrix.data, copy=False, order=order)
@arjoly
arjoly Jul 19, 2014 Member

It assumes that you a .data which is not the case for each sparse matrix.

@amueller
amueller Jul 19, 2014 Member

that is true. The tests seem not to be so great ^^ we need to add a test for that.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
+ Sparse format to convert sparse matrices to if allowed_sparse is not
+ None. By default, the first entry of allowed_sparse will be used.
+
+ Returns
+ -------
+ spmatrix_convertd : scipy sparse matrix.
@GaelVaroquaux
GaelVaroquaux Jul 19, 2014 Member

I think that there is a typo here: 'converted'.

@coveralls

Coverage Status

Coverage decreased (-0.0%) when pulling d03c84e on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ raise TypeError('A sparse matrix was passed, but dense '
+ 'data is required. Use X.toarray() to '
+ 'convert to a dense numpy array.')
+ sparse_type = spmatrix.format
+ if sparse_type in allowed_sparse:
+ # correct type
+ if dtype == spmatrix.dtype or dtype is None:
+ # correct dtype
+ if copy:
+ spmatrix = spmatrix.copy()
+ else:
+ # convert dtype
+ spmatrix = spmatrix.astype(dtype)
+ else:
+ # create new
+ spmatrix = _sparse_matrix_constructor(convert_sparse_to)(
@GaelVaroquaux
GaelVaroquaux Jul 19, 2014 Member

I think that here I would use the 'asformat' method of sparse matrices.

@amueller
amueller Jul 19, 2014 Member

ahh, I have been looking and not finding this method.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to convert_sparse_to.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
@GaelVaroquaux
GaelVaroquaux Jul 19, 2014 Member

Do we actually need this argument, or could we do simply with 'allowed_sparse'?

@amueller
amueller Jul 19, 2014 Member

You are right, I introduced it before thinking about the default behavior. I'll remove it.

@arjoly
Member
arjoly commented Jul 19, 2014

Does dense data always allowed?

@amueller
Member

Yes, currently dense data is always allowed. Do we have an application where this is not the case? If we do at some point, we could later add a flag?

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling 1be9e46 on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

@amueller
Member

thanks for the helpful reviews guys :)

@amueller amueller changed the title from [WIP] Input validation refactoring to [MRG] Input validation refactoring Jul 19, 2014
@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 33f20bf on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

@GaelVaroquaux
Member

I see that this has switched to MRG, but the TODO still lists tests. What's the status on that?

@arjoly arjoly commented on the diff Jul 19, 2014
sklearn/utils/__init__.py
from .class_weight import compute_class_weight
from sklearn.utils.sparsetools import minimum_spanning_tree
__all__ = ["murmurhash3_32", "as_float_array", "check_arrays", "safe_asarray",
- "assert_all_finite", "array2d", "atleast2d_or_csc",
+ "assert_all_finite", "array2d", "atleast2d_or_csc", "check_array",
"atleast2d_or_csr",
"warn_if_not_float",
"check_random_state",
@arjoly
arjoly Jul 19, 2014 Member

Can you remove deprecated stuff?

@amueller
amueller Jul 20, 2014 Member

Not in this PR, but in the next PR which will touch all files.

@arjoly arjoly commented on the diff Jul 19, 2014
sklearn/utils/tests/test_validation.py
@@ -223,3 +226,106 @@ def test_check_arrays():
# check that lists are passed through if force_arrays is true
X_, Y_ = check_arrays(X, Y, force_arrays=False)
assert_true(isinstance(X_, list))
+
+
+def test_check_array():
+ # allowed_sparse == None
+ # raise error on sparse inputs
+ X = [[1, 2], [3, 4]]
+ X_csr = sp.csr_matrix(X)
+ assert_raises(TypeError, check_array, X_csr)
+ # ensure_2d
@arjoly
arjoly Jul 19, 2014 Member

Can you make a blank line between each case to ease reading?

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/tests/test_validation.py
+ # nan check
+ X_nan = np.arange(4).reshape(2, 2).astype(np.float)
+ X_nan[0, 0] = np.nan
+ assert_raises(ValueError, check_array, X_nan)
+ check_array(X_inf, force_all_finite=False) # no raise
+
+ # dtype and order enforcement.
+ X_C = np.arange(4).reshape(2, 2).copy("C")
+ X_F = X_C.copy("F")
+ X_int = X_C.astype(np.int)
+ X_float = X_C.astype(np.float)
+
+ for X in [X_C, X_F, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
+ for order in ['C', 'F', None]:
+ for copy in [True, False]:
@arjoly
arjoly Jul 19, 2014 Member

Can you use the product trick here?

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/tests/test_validation.py
+ assert_raises(ValueError, check_array, X_inf)
+ check_array(X_inf, force_all_finite=False) # no raise
+ # nan check
+ X_nan = np.arange(4).reshape(2, 2).astype(np.float)
+ X_nan[0, 0] = np.nan
+ assert_raises(ValueError, check_array, X_nan)
+ check_array(X_inf, force_all_finite=False) # no raise
+
+ # dtype and order enforcement.
+ X_C = np.arange(4).reshape(2, 2).copy("C")
+ X_F = X_C.copy("F")
+ X_int = X_C.astype(np.int)
+ X_float = X_C.astype(np.float)
+
+ for X in [X_C, X_F, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
@arjoly
arjoly Jul 19, 2014 Member

can you add np.object, np.bool?

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/tests/test_validation.py
+ X_checked.flags['C_CONTIGUOUS'] ==
+ X.flags['C_CONTIGUOUS'] and
+ X_checked.flags['F_CONTIGUOUS'] ==
+ X.flags['F_CONTIGUOUS']):
+ assert_true(X is X_checked)
+
+ # allowed sparse != None
+ X_csc = sp.csc_matrix(X_C)
+ X_coo = X_csc.tocoo()
+ X_dok = X_csc.todok()
+ X_int = X_csc.astype(np.int)
+ X_float = X_csc.astype(np.float)
+ for X in [X_csc, X_coo, X_dok, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
+ for allowed_sparse in [['csr', 'coo'], ['coo', 'dok']]:
+ for copy in [True, False]:
@arjoly
arjoly Jul 19, 2014 Member

Can you use product here?

@arjoly arjoly commented on the diff Jul 19, 2014
sklearn/utils/validation.py
@@ -175,6 +133,142 @@ def _num_samples(x):
return x.shape[0] if hasattr(x, 'shape') else len(x)
+def check_consistent_length(*arrays):
@arjoly
arjoly Jul 19, 2014 Member

Should we keep *arrays or put a list of array / iterable of array instead?

@amueller
amueller Jul 20, 2014 Member

I think in this function it is fine. I'm thinking about marking it as private.

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ """Convert a sparse matrix to a given format.
+
+ Checks the sparse format of spmatrix and converts if necessary.
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
@arjoly
arjoly Jul 19, 2014 Member

order : 'F', 'C' or None (default=None)

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ raise ValueError("Found arrays with inconsistent numbers of samples: %s"
+ % str(uniques))
+
+
+def _ensure_sparse_format(spmatrix, allowed_sparse, dtype, order, copy,
+ force_all_finite):
+ """Convert a sparse matrix to a given format.
+
+ Checks the sparse format of spmatrix and converts if necessary.
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
@arjoly
arjoly Jul 19, 2014 Member

default=None => (default=None)

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
@arjoly
arjoly Jul 19, 2014 Member

copy : boolean, (default=False)

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
@arjoly
arjoly Jul 19, 2014 Member

default=True => (default=True)

@arjoly arjoly and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
@arjoly
arjoly Jul 19, 2014 Member

Could you add that a copy might be trigger if necessary?

@amueller
amueller Jul 20, 2014 Member

Sorry I don't understand. That's what is says, right?

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
@arjoly
arjoly Jul 19, 2014 Member

It would be nice to be consistent with the default.

@arjoly
arjoly Jul 19, 2014 Member

(docstring wise)

@arjoly arjoly commented on the diff Jul 19, 2014
sklearn/utils/validation.py
+ if not hasattr(spmatrix, "data"):
+ warnings.warn("Can't check %s sparse matrix for nan or inf."
+ % spmatrix.format)
+ else:
+ _assert_all_finite(spmatrix.data)
+ if hasattr(spmatrix, "data"):
+ spmatrix.data = np.array(spmatrix.data, copy=False, order=order)
+ return spmatrix
+
+
+def check_array(array, allowed_sparse=None, dtype=None, order=None, copy=False,
+ force_all_finite=True, ensure_2d=True, allow_nd=False):
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
@arjoly
arjoly Jul 19, 2014 Member

Could you add that dense array is always allowed?

@arjoly arjoly commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
@arjoly
arjoly Jul 19, 2014 Member

(default=None)

@arjoly
Member
arjoly commented Jul 19, 2014

Except for the cosmetic comment, 👍

@GaelVaroquaux GaelVaroquaux commented on the diff Jul 19, 2014
sklearn/feature_extraction/image.py
@@ -349,7 +349,7 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None):
i_h, i_w = image.shape[:2]
p_h, p_w = patch_size
- image = array2d(image)
+ image = check_array(image, allow_nd=True)
@GaelVaroquaux
GaelVaroquaux Jul 19, 2014 Member

Hum, that change is really surprising for me: I would read the 2 lines (the one removed and the one added) as doing very different things. It's probably just a question of choice of names on the arguments.

@amueller
amueller Jul 19, 2014 Member

that is because the previous behavior was surprising ;)

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 19, 2014
sklearn/utils/tests/test_validation.py
+ for X in [X_C, X_F, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
+ for order in ['C', 'F', None]:
+ for copy in [True, False]:
+ X_checked = check_array(X, dtype=dtype, order=order,
+ copy=copy)
+ if dtype is not None:
+ assert_equal(X_checked.dtype, dtype)
+ else:
+ assert_equal(X_checked.dtype, X.dtype)
+ if order == 'C':
+ assert_true(X_checked.flags['C_CONTIGUOUS'])
+ assert_false(X_checked.flags['F_CONTIGUOUS'])
+ elif order == 'F':
+ assert_true(X_checked.flags['F_CONTIGUOUS'])
+ assert_false(X_checked.flags['C_CONTIGUOUS'])
@GaelVaroquaux
GaelVaroquaux Jul 19, 2014 Member

Why are you checking that it is not C contiguous? I agree that, as this is a 2D array, it cannot be F and C contiguous, but in my opinion you are overconstraining the tests, which is not a good thing.

@amueller
amueller Jul 20, 2014 Member

I replicated the previous tests. I didn't want to test less than before.

@ogrisel ogrisel commented on an outdated diff Jul 19, 2014
sklearn/utils/validation.py
+def check_array(array, allowed_sparse=None, dtype=None, order=None, copy=False,
+ force_all_finite=True, ensure_2d=True, allow_nd=False):
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to the first listed format.
@ogrisel
ogrisel Jul 19, 2014 Member

Very good.

@ogrisel ogrisel commented on the diff Jul 19, 2014
sklearn/utils/validation.py
+ if sp.issparse(array):
+ array = _ensure_sparse_format(array, allowed_sparse, dtype, order,
+ copy, force_all_finite)
+ else:
+ if ensure_2d:
+ array = np.atleast_2d(array)
+ array = np.array(array, dtype=dtype, order=order, copy=copy)
+ if not allow_nd and array.ndim >= 3:
+ raise ValueError("Found array with dim %d. Expected <= 2" %
+ array.ndim)
+ if force_all_finite:
+ _assert_all_finite(array)
+
+ return array
+
+
def check_arrays(*arrays, **options):
@ogrisel
ogrisel Jul 19, 2014 Member

@amueller have you thought about deprecating this function that we have check_array to check assumptions on individual datastructures? If yes what motivates you not to deprecate it in this PR? To much code to update at once?

Would it make sense to try to deprecate it in a future PR?

@amueller
amueller Jul 20, 2014 Member

I'm working on it. I think it makes sense to merge this one first because it is reviewable ;)

@coveralls

Coverage Status

Coverage increased (+0.04%) when pulling 95afdc7 on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

@ogrisel
Member
ogrisel commented Jul 20, 2014

+1 for merging this PR as it is and deal with further refactoring in a separate PR.

@arjoly
Member
arjoly commented Jul 20, 2014

+1

@GaelVaroquaux GaelVaroquaux merged commit 8dab222 into scikit-learn:master Jul 20, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling f7549fd on amueller:input_validation_refactoring into 41d02e0 on scikit-learn:master.

@GaelVaroquaux
Member

Merged from the airport!

Awesome work! Go Go team!

@amueller
Member

err... I just squashed, rebased and merge... hum.... Let's see what'll happen now.

@amueller
Member

Oh I didn't push. all is good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment