Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

[MRG] Input validation refactoring #3443

Merged
merged 1 commit into from

5 participants

@amueller
Owner

Refactor input validation to make it more consistent.

Todo

  • Tests
  • docstrings
@amueller
Owner

Hopefully more readable and consistent input checking. Also fixed some corner cases.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling c3ecfed on amueller:input_validation_refactoring into 0807e19 on scikit-learn:master.

sklearn/utils/validation.py
((42 lines not shown))
+ elif string_format == "coo":
+ return sp.coo_matrix
+ else:
+ raise ValueError("Don't know how to construct a sparse matrix of type"
+ " %s" % string_format)
+
+
+def _ensure_sparse_format(spmatrix, allowed_sparse, dtype, order, copy,
+ force_all_finite, convert_sparse_to):
+ """Convert a sparse matrix to a given format.
+
+ Checks the sparse format of spmatrix and converts if necessary.
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix.
@arjoly Owner
arjoly added a note

. is not needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((130 lines not shown))
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
+ Sparse format to convert sparse matrices to if allowed_sparse is not
+ None. By default, the first entry of allowed_sparse will be used.
+
+ make_2d : boolean, default=True
+ Whether to make X at least 2d.
+
+ allow_nd : boolean, default=True
+ Whether to allow nd X.
@arjoly Owner
arjoly added a note

What is nd?

@amueller Owner

Should say "allow X.ndim > 2"

@GaelVaroquaux Owner

In the docstring, yes, but I must say that I find the name quite clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((126 lines not shown))
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
+ Sparse format to convert sparse matrices to if allowed_sparse is not
+ None. By default, the first entry of allowed_sparse will be used.
+
+ make_2d : boolean, default=True
@arjoly Owner
arjoly added a note

make_2d -> ensure_2d?

@GaelVaroquaux Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((110 lines not shown))
+ force_all_finite=True, convert_sparse_to=None, make_2d=True,
+ allow_nd=True):
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to convert_sparse_to.
@arjoly Owner
arjoly added a note

Can you add what are the available string?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((35 lines not shown))
+
+def _sparse_matrix_constructor(string_format):
+ """Get constructor from a sparse matrix string format."""
+ if string_format == "csr":
+ return sp.csr_matrix
+ elif string_format == "csc":
+ return sp.csc_matrix
+ elif string_format == "coo":
+ return sp.coo_matrix
+ else:
+ raise ValueError("Don't know how to construct a sparse matrix of type"
+ " %s" % string_format)
+
+
+def _ensure_sparse_format(spmatrix, allowed_sparse, dtype, order, copy,
+ force_all_finite, convert_sparse_to):
@arjoly Owner
arjoly added a note

Is there a point of making a function for this?

@amueller Owner

to make the code more readable. I could also put it into the check_array function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((31 lines not shown))
+ else:
+ raise ValueError("Unknown sparse matrix format passed: %s"
+ % type(sparse_matrix))
+
+
+def _sparse_matrix_constructor(string_format):
+ """Get constructor from a sparse matrix string format."""
+ if string_format == "csr":
+ return sp.csr_matrix
+ elif string_format == "csc":
+ return sp.csc_matrix
+ elif string_format == "coo":
+ return sp.coo_matrix
+ else:
+ raise ValueError("Don't know how to construct a sparse matrix of type"
+ " %s" % string_format)
@arjoly Owner
arjoly added a note

I would use a dictionnary for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
@@ -175,6 +133,177 @@ def _num_samples(x):
return x.shape[0] if hasattr(x, 'shape') else len(x)
+def check_consistent_length(*arrays):
+ """Check that all arrays have consistent first dimensions.
+
+ Checks whether all objects in arrays have the same shape or length.
+
+ Parameters
+ ----------
+ arrays : list or tuple of input objects.
+ Objects that will be checked for consistent length.
+ """
+
+ n_samples = [_num_samples(X) for X in arrays if X is not None]
+ uniques = np.unique(n_samples)
+ if len(uniques) > 1:
@arjoly Owner
arjoly added a note

len(set(_num_samples(X) for X in arrays if X is not None)) > 1?

@amueller Owner

and how do you do the error message? yeah I could put the unique in the top line.

@arjoly Owner
arjoly added a note

yeah I could put the unique in the top line.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((66 lines not shown))
+ Returns
+ -------
+ spmatrix_convertd : scipy sparse matrix.
+ Matrix that is ensured to have an allowed type (or convert_sparse_to).
+ """
+ if allowed_sparse is None:
+ raise TypeError('A sparse matrix was passed, but dense '
+ 'data is required. Use X.toarray() to '
+ 'convert to a dense numpy array.')
+ sparse_type = spmatrix.format
+ if sparse_type in allowed_sparse:
+ # correct type
+ if dtype == spmatrix.dtype or dtype is None:
+ # correct dtype
+ if copy:
+ spmatrix = spmatrix.copy()
@arjoly Owner
arjoly added a note

Feels like that is not at the right place.

@amueller Owner

why? I could also put it at the very end, but that wouldn't really make a difference.

@arjoly Owner
arjoly added a note

I would make this sooner before the type coercition.

@amueller Owner

but then you might copy twice!

@arjoly Owner
arjoly added a note

Yeah, but you might change the original data. no?
This require tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling f72aa17 on amueller:input_validation_refactoring into 0807e19 on scikit-learn:master.

sklearn/utils/validation.py
((76 lines not shown))
+ if sparse_type in allowed_sparse:
+ # correct type
+ if dtype == spmatrix.dtype or dtype is None:
+ # correct dtype
+ if copy:
+ spmatrix = spmatrix.copy()
+ else:
+ # convert dtype
+ spmatrix = spmatrix.astype(dtype)
+ else:
+ # create new
+ spmatrix = _sparse_matrix_constructor(convert_sparse_to)(
+ spmatrix, copy=copy, dtype=dtype)
+ if force_all_finite:
+ _assert_all_finite(spmatrix.data)
+ spmatrix.data = np.array(spmatrix.data, copy=False, order=order)
@arjoly Owner
arjoly added a note

It assumes that you a .data which is not the case for each sparse matrix.

@amueller Owner

that is true. The tests seem not to be so great ^^ we need to add a test for that.

@amueller Owner

fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((53 lines not shown))
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
+ Sparse format to convert sparse matrices to if allowed_sparse is not
+ None. By default, the first entry of allowed_sparse will be used.
+
+ Returns
+ -------
+ spmatrix_convertd : scipy sparse matrix.
@GaelVaroquaux Owner

I think that there is a typo here: 'converted'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage decreased (-0.0%) when pulling d03c84e on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

sklearn/utils/validation.py
((72 lines not shown))
+ raise TypeError('A sparse matrix was passed, but dense '
+ 'data is required. Use X.toarray() to '
+ 'convert to a dense numpy array.')
+ sparse_type = spmatrix.format
+ if sparse_type in allowed_sparse:
+ # correct type
+ if dtype == spmatrix.dtype or dtype is None:
+ # correct dtype
+ if copy:
+ spmatrix = spmatrix.copy()
+ else:
+ # convert dtype
+ spmatrix = spmatrix.astype(dtype)
+ else:
+ # create new
+ spmatrix = _sparse_matrix_constructor(convert_sparse_to)(
@GaelVaroquaux Owner

I think that here I would use the 'asformat' method of sparse matrices.

@amueller Owner

ahh, I have been looking and not finding this method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((108 lines not shown))
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to convert_sparse_to.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
+ Whether to raise an error on np.inf and np.nan in X.
+
+ convert_sparse_to : string or None (default).
@GaelVaroquaux Owner

Do we actually need this argument, or could we do simply with 'allowed_sparse'?

@amueller Owner

You are right, I introduced it before thinking about the default behavior. I'll remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly
Owner

Does dense data always allowed?

@amueller
Owner

Yes, currently dense data is always allowed. Do we have an application where this is not the case? If we do at some point, we could later add a flag?

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling 1be9e46 on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

@amueller
Owner

thanks for the helpful reviews guys :)

@amueller amueller changed the title from [WIP] Input validation refactoring to [MRG] Input validation refactoring
@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 33f20bf on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

@GaelVaroquaux

I see that this has switched to MRG, but the TODO still lists tests. What's the status on that?

@arjoly arjoly commented on the diff
sklearn/utils/__init__.py
((6 lines not shown))
from .class_weight import compute_class_weight
from sklearn.utils.sparsetools import minimum_spanning_tree
__all__ = ["murmurhash3_32", "as_float_array", "check_arrays", "safe_asarray",
- "assert_all_finite", "array2d", "atleast2d_or_csc",
+ "assert_all_finite", "array2d", "atleast2d_or_csc", "check_array",
"atleast2d_or_csr",
"warn_if_not_float",
"check_random_state",
@arjoly Owner
arjoly added a note

Can you remove deprecated stuff?

@amueller Owner

Not in this PR, but in the next PR which will touch all files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly arjoly commented on the diff
sklearn/utils/tests/test_validation.py
@@ -223,3 +226,106 @@ def test_check_arrays():
# check that lists are passed through if force_arrays is true
X_, Y_ = check_arrays(X, Y, force_arrays=False)
assert_true(isinstance(X_, list))
+
+
+def test_check_array():
+ # allowed_sparse == None
+ # raise error on sparse inputs
+ X = [[1, 2], [3, 4]]
+ X_csr = sp.csr_matrix(X)
+ assert_raises(TypeError, check_array, X_csr)
+ # ensure_2d
@arjoly Owner
arjoly added a note

Can you make a blank line between each case to ease reading?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/tests/test_validation.py
((26 lines not shown))
+ # nan check
+ X_nan = np.arange(4).reshape(2, 2).astype(np.float)
+ X_nan[0, 0] = np.nan
+ assert_raises(ValueError, check_array, X_nan)
+ check_array(X_inf, force_all_finite=False) # no raise
+
+ # dtype and order enforcement.
+ X_C = np.arange(4).reshape(2, 2).copy("C")
+ X_F = X_C.copy("F")
+ X_int = X_C.astype(np.int)
+ X_float = X_C.astype(np.float)
+
+ for X in [X_C, X_F, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
+ for order in ['C', 'F', None]:
+ for copy in [True, False]:
@arjoly Owner
arjoly added a note

Can you use the product trick here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/tests/test_validation.py
((24 lines not shown))
+ assert_raises(ValueError, check_array, X_inf)
+ check_array(X_inf, force_all_finite=False) # no raise
+ # nan check
+ X_nan = np.arange(4).reshape(2, 2).astype(np.float)
+ X_nan[0, 0] = np.nan
+ assert_raises(ValueError, check_array, X_nan)
+ check_array(X_inf, force_all_finite=False) # no raise
+
+ # dtype and order enforcement.
+ X_C = np.arange(4).reshape(2, 2).copy("C")
+ X_F = X_C.copy("F")
+ X_int = X_C.astype(np.int)
+ X_float = X_C.astype(np.float)
+
+ for X in [X_C, X_F, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
@arjoly Owner
arjoly added a note

can you add np.object, np.bool?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/tests/test_validation.py
((59 lines not shown))
+ X_checked.flags['C_CONTIGUOUS'] ==
+ X.flags['C_CONTIGUOUS'] and
+ X_checked.flags['F_CONTIGUOUS'] ==
+ X.flags['F_CONTIGUOUS']):
+ assert_true(X is X_checked)
+
+ # allowed sparse != None
+ X_csc = sp.csc_matrix(X_C)
+ X_coo = X_csc.tocoo()
+ X_dok = X_csc.todok()
+ X_int = X_csc.astype(np.int)
+ X_float = X_csc.astype(np.float)
+ for X in [X_csc, X_coo, X_dok, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
+ for allowed_sparse in [['csr', 'coo'], ['coo', 'dok']]:
+ for copy in [True, False]:
@arjoly Owner
arjoly added a note

Can you use product here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly arjoly commented on the diff
sklearn/utils/validation.py
@@ -175,6 +133,142 @@ def _num_samples(x):
return x.shape[0] if hasattr(x, 'shape') else len(x)
+def check_consistent_length(*arrays):
@arjoly Owner
arjoly added a note

Should we keep *arrays or put a list of array / iterable of array instead?

@amueller Owner

I think in this function it is fine. I'm thinking about marking it as private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((23 lines not shown))
+ """Convert a sparse matrix to a given format.
+
+ Checks the sparse format of spmatrix and converts if necessary.
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
@arjoly Owner
arjoly added a note

order : 'F', 'C' or None (default=None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((17 lines not shown))
+ raise ValueError("Found arrays with inconsistent numbers of samples: %s"
+ % str(uniques))
+
+
+def _ensure_sparse_format(spmatrix, allowed_sparse, dtype, order, copy,
+ force_all_finite):
+ """Convert a sparse matrix to a given format.
+
+ Checks the sparse format of spmatrix and converts if necessary.
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
@arjoly Owner
arjoly added a note

default=None => (default=None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((26 lines not shown))
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
@arjoly Owner
arjoly added a note

copy : boolean, (default=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((30 lines not shown))
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean, default=True
@arjoly Owner
arjoly added a note

default=True => (default=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((28 lines not shown))
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
@arjoly Owner
arjoly added a note

Could you add that a copy might be trigger if necessary?

@amueller Owner

Sorry I don't understand. That's what is says, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((88 lines not shown))
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean, default=False
@arjoly Owner
arjoly added a note

It would be nice to be consistent with the default.

@arjoly Owner
arjoly added a note

(docstring wise)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly arjoly commented on the diff
sklearn/utils/validation.py
((73 lines not shown))
+ if not hasattr(spmatrix, "data"):
+ warnings.warn("Can't check %s sparse matrix for nan or inf."
+ % spmatrix.format)
+ else:
+ _assert_all_finite(spmatrix.data)
+ if hasattr(spmatrix, "data"):
+ spmatrix.data = np.array(spmatrix.data, copy=False, order=order)
+ return spmatrix
+
+
+def check_array(array, allowed_sparse=None, dtype=None, order=None, copy=False,
+ force_all_finite=True, ensure_2d=True, allow_nd=False):
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
@arjoly Owner
arjoly added a note

Could you add that dense array is always allowed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((85 lines not shown))
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to the first listed format.
+
+ order : 'F', 'C' or None (default)
@arjoly Owner
arjoly added a note

(default=None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly
Owner

Except for the cosmetic comment, :+1:

@GaelVaroquaux GaelVaroquaux commented on the diff
sklearn/feature_extraction/image.py
@@ -349,7 +349,7 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None):
i_h, i_w = image.shape[:2]
p_h, p_w = patch_size
- image = array2d(image)
+ image = check_array(image, allow_nd=True)
@GaelVaroquaux Owner

Hum, that change is really surprising for me: I would read the 2 lines (the one removed and the one added) as doing very different things. It's probably just a question of choice of names on the arguments.

@amueller Owner

that is because the previous behavior was surprising ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/tests/test_validation.py
((38 lines not shown))
+ for X in [X_C, X_F, X_int, X_float]:
+ for dtype in [np.int32, np.int, np.float, np.float32, None]:
+ for order in ['C', 'F', None]:
+ for copy in [True, False]:
+ X_checked = check_array(X, dtype=dtype, order=order,
+ copy=copy)
+ if dtype is not None:
+ assert_equal(X_checked.dtype, dtype)
+ else:
+ assert_equal(X_checked.dtype, X.dtype)
+ if order == 'C':
+ assert_true(X_checked.flags['C_CONTIGUOUS'])
+ assert_false(X_checked.flags['F_CONTIGUOUS'])
+ elif order == 'F':
+ assert_true(X_checked.flags['F_CONTIGUOUS'])
+ assert_false(X_checked.flags['C_CONTIGUOUS'])
@GaelVaroquaux Owner

Why are you checking that it is not C contiguous? I agree that, as this is a 2D array, it cannot be F and C contiguous, but in my opinion you are overconstraining the tests, which is not a good thing.

@amueller Owner

I replicated the previous tests. I didn't want to test less than before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/utils/validation.py
((83 lines not shown))
+def check_array(array, allowed_sparse=None, dtype=None, order=None, copy=False,
+ force_all_finite=True, ensure_2d=True, allow_nd=False):
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string or list of string, default=None
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to the first listed format.
@ogrisel Owner
ogrisel added a note

Very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on the diff
sklearn/utils/validation.py
((124 lines not shown))
+ if sp.issparse(array):
+ array = _ensure_sparse_format(array, allowed_sparse, dtype, order,
+ copy, force_all_finite)
+ else:
+ if ensure_2d:
+ array = np.atleast_2d(array)
+ array = np.array(array, dtype=dtype, order=order, copy=copy)
+ if not allow_nd and array.ndim >= 3:
+ raise ValueError("Found array with dim %d. Expected <= 2" %
+ array.ndim)
+ if force_all_finite:
+ _assert_all_finite(array)
+
+ return array
+
+
def check_arrays(*arrays, **options):
@ogrisel Owner
ogrisel added a note

@amueller have you thought about deprecating this function that we have check_array to check assumptions on individual datastructures? If yes what motivates you not to deprecate it in this PR? To much code to update at once?

Would it make sense to try to deprecate it in a future PR?

@amueller Owner

I'm working on it. I think it makes sense to merge this one first because it is reviewable ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage increased (+0.04%) when pulling 95afdc7 on amueller:input_validation_refactoring into 4ec8630 on scikit-learn:master.

@ogrisel
Owner

+1 for merging this PR as it is and deal with further refactoring in a separate PR.

@arjoly
Owner

+1

@GaelVaroquaux GaelVaroquaux merged commit 8dab222 into scikit-learn:master
@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling f7549fd on amueller:input_validation_refactoring into 41d02e0 on scikit-learn:master.

@GaelVaroquaux

Merged from the airport!

Awesome work! Go Go team!

@amueller
Owner

err... I just squashed, rebased and merge... hum.... Let's see what'll happen now.

@amueller
Owner

Oh I didn't push. all is good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 20, 2014
  1. @amueller

    Refactor input validation.

    amueller authored
This page is out of date. Refresh to see the latest.
View
4 sklearn/feature_extraction/image.py
@@ -15,7 +15,7 @@
from scipy import sparse
from numpy.lib.stride_tricks import as_strided
-from ..utils import array2d, check_random_state
+from ..utils import check_array, check_random_state
from ..utils.fixes import astype
from ..base import BaseEstimator
@@ -349,7 +349,7 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None):
i_h, i_w = image.shape[:2]
p_h, p_w = patch_size
- image = array2d(image)
+ image = check_array(image, allow_nd=True)
@GaelVaroquaux Owner

Hum, that change is really surprising for me: I would read the 2 lines (the one removed and the one added) as doing very different things. It's probably just a question of choice of names on the arguments.

@amueller Owner

that is because the previous behavior was surprising ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
image = image.reshape((i_h, i_w, -1))
n_colors = image.shape[-1]
View
4 sklearn/utils/__init__.py
@@ -11,13 +11,13 @@
from .validation import (as_float_array, check_arrays, safe_asarray,
assert_all_finite, array2d, atleast2d_or_csc,
atleast2d_or_csr, warn_if_not_float,
- check_random_state, column_or_1d)
+ check_random_state, column_or_1d, check_array)
from .class_weight import compute_class_weight
from sklearn.utils.sparsetools import minimum_spanning_tree
__all__ = ["murmurhash3_32", "as_float_array", "check_arrays", "safe_asarray",
- "assert_all_finite", "array2d", "atleast2d_or_csc",
+ "assert_all_finite", "array2d", "atleast2d_or_csc", "check_array",
"atleast2d_or_csr",
"warn_if_not_float",
"check_random_state",
@arjoly Owner
arjoly added a note

Can you remove deprecated stuff?

@amueller Owner

Not in this PR, but in the next PR which will touch all files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
View
106 sklearn/utils/tests/test_validation.py
@@ -5,9 +5,13 @@
from numpy.testing import assert_array_equal
import scipy.sparse as sp
from nose.tools import assert_raises, assert_true, assert_false, assert_equal
+from itertools import product
from sklearn.utils import (array2d, as_float_array, atleast2d_or_csr,
- atleast2d_or_csc, check_arrays, safe_asarray)
+ atleast2d_or_csc, check_arrays, safe_asarray,
+ check_array)
+
+from sklearn.utils.estimator_checks import NotAnArray
from sklearn.random_projection import sparse_random_matrix
@@ -223,3 +227,103 @@ def test_check_arrays():
# check that lists are passed through if force_arrays is true
X_, Y_ = check_arrays(X, Y, force_arrays=False)
assert_true(isinstance(X_, list))
+
+
+def test_check_array():
+ # allowed_sparse == None
+ # raise error on sparse inputs
+ X = [[1, 2], [3, 4]]
+ X_csr = sp.csr_matrix(X)
+ assert_raises(TypeError, check_array, X_csr)
+ # ensure_2d
@arjoly Owner
arjoly added a note

Can you make a blank line between each case to ease reading?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ X_array = check_array([0, 1, 2])
+ assert_equal(X_array.ndim, 2)
+ X_array = check_array([0, 1, 2], ensure_2d=False)
+ assert_equal(X_array.ndim, 1)
+ # don't allow ndim > 3
+ X_ndim = np.arange(8).reshape(2, 2, 2)
+ assert_raises(ValueError, check_array, X_ndim)
+ check_array(X_ndim, allow_nd=True) # doesn't raise
+ # force_all_finite
+ X_inf = np.arange(4).reshape(2, 2).astype(np.float)
+ X_inf[0, 0] = np.inf
+ assert_raises(ValueError, check_array, X_inf)
+ check_array(X_inf, force_all_finite=False) # no raise
+ # nan check
+ X_nan = np.arange(4).reshape(2, 2).astype(np.float)
+ X_nan[0, 0] = np.nan
+ assert_raises(ValueError, check_array, X_nan)
+ check_array(X_inf, force_all_finite=False) # no raise
+
+ # dtype and order enforcement.
+ X_C = np.arange(4).reshape(2, 2).copy("C")
+ X_F = X_C.copy("F")
+ X_int = X_C.astype(np.int)
+ X_float = X_C.astype(np.float)
+ Xs = [X_C, X_F, X_int, X_float]
+ dtypes = [np.int32, np.int, np.float, np.float32, None, np.bool, object]
+ orders = ['C', 'F', None]
+ copys = [True, False]
+
+ for X, dtype, order, copy in product(Xs, dtypes, orders, copys):
+ X_checked = check_array(X, dtype=dtype, order=order, copy=copy)
+ if dtype is not None:
+ assert_equal(X_checked.dtype, dtype)
+ else:
+ assert_equal(X_checked.dtype, X.dtype)
+ if order == 'C':
+ assert_true(X_checked.flags['C_CONTIGUOUS'])
+ assert_false(X_checked.flags['F_CONTIGUOUS'])
+ elif order == 'F':
+ assert_true(X_checked.flags['F_CONTIGUOUS'])
+ assert_false(X_checked.flags['C_CONTIGUOUS'])
+ if copy:
+ assert_false(X is X_checked)
+ else:
+ # doesn't copy if it was already good
+ if (X.dtype == X_checked.dtype and
+ X_checked.flags['C_CONTIGUOUS'] == X.flags['C_CONTIGUOUS']
+ and X_checked.flags['F_CONTIGUOUS'] == X.flags['F_CONTIGUOUS']):
+ assert_true(X is X_checked)
+
+ # allowed sparse != None
+ X_csc = sp.csc_matrix(X_C)
+ X_coo = X_csc.tocoo()
+ X_dok = X_csc.todok()
+ X_int = X_csc.astype(np.int)
+ X_float = X_csc.astype(np.float)
+
+ Xs = [X_csc, X_coo, X_dok, X_int, X_float]
+ allowed_sparses = [['csr', 'coo'], ['coo', 'dok']]
+ for X, dtype, allowed_sparse, copy in product(Xs, dtypes, allowed_sparses,
+ copys):
+ X_checked = check_array(X, dtype=dtype, allowed_sparse=allowed_sparse,
+ copy=copy)
+ if dtype is not None:
+ assert_equal(X_checked.dtype, dtype)
+ else:
+ assert_equal(X_checked.dtype, X.dtype)
+ if X.format in allowed_sparse:
+ # no change if allowed
+ assert_equal(X.format, X_checked.format)
+ else:
+ # got converted
+ assert_equal(X_checked.format, allowed_sparse[0])
+ if copy:
+ assert_false(X is X_checked)
+ else:
+ # doesn't copy if it was already good
+ if (X.dtype == X_checked.dtype and X.format == X_checked.format):
+ assert_true(X is X_checked)
+
+ # other input formats
+ # convert lists to arrays
+ X_dense = check_array([[1, 2], [3, 4]])
+ assert_true(isinstance(X_dense, np.ndarray))
+ # raise on too deep lists
+ assert_raises(ValueError, check_array, X_ndim.tolist())
+ check_array(X_ndim.tolist(), allow_nd=True) # doesn't raise
+ # convert weird stuff to arrays
+ X_no_array = NotAnArray(X_dense)
+ result = check_array(X_no_array)
+ assert_true(isinstance(result, np.ndarray))
View
253 sklearn/utils/validation.py
@@ -14,7 +14,6 @@
import scipy.sparse as sp
from ..externals import six
-from .fixes import safe_copy
class DataConversionWarning(UserWarning):
@@ -64,20 +63,9 @@ def safe_asarray(X, dtype=None, order=None, copy=False, force_all_finite=True):
If a specific compressed sparse format is required, use atleast2d_or_cs{c,r}
instead.
"""
- if sp.issparse(X):
- if not isinstance(X, (sp.coo_matrix, sp.csc_matrix, sp.csr_matrix)):
- X = X.tocsr()
- elif copy:
- X = X.copy()
- if force_all_finite:
- _assert_all_finite(X.data)
- # enforces dtype on data array (order should be kept the same).
- X.data = np.asarray(X.data, dtype=dtype)
- else:
- X = np.array(X, dtype=dtype, order=order, copy=copy)
- if force_all_finite:
- _assert_all_finite(X)
- return X
+ return check_array(X, allowed_sparse=['csr', 'csc', 'coo'], dtype=dtype,
+ order=order, copy=copy,
+ force_all_finite=force_all_finite, ensure_2d=False)
def as_float_array(X, copy=True, force_all_finite=True):
@@ -114,33 +102,7 @@ def as_float_array(X, copy=True, force_all_finite=True):
def array2d(X, dtype=None, order=None, copy=False, force_all_finite=True):
"""Returns at least 2-d array with data from X"""
- if sp.issparse(X):
- raise TypeError('A sparse matrix was passed, but dense data '
- 'is required. Use X.toarray() to convert to dense.')
- X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
- if force_all_finite:
- _assert_all_finite(X_2d)
- if X is X_2d and copy:
- X_2d = safe_copy(X_2d)
- return X_2d
-
-
-def _atleast2d_or_sparse(X, dtype, order, copy, sparse_class, convmethod,
- check_same_type, force_all_finite):
- if sp.issparse(X):
- if check_same_type(X) and X.dtype == dtype:
- X = getattr(X, convmethod)(copy=copy)
- elif dtype is None or X.dtype == dtype:
- X = getattr(X, convmethod)()
- else:
- X = sparse_class(X, dtype=dtype)
- if force_all_finite:
- _assert_all_finite(X.data)
- X.data = np.array(X.data, copy=False, order=order)
- else:
- X = array2d(X, dtype=dtype, order=order, copy=copy,
- force_all_finite=force_all_finite)
- return X
+ return check_array(X, None, dtype, order, copy, force_all_finite)
def atleast2d_or_csc(X, dtype=None, order=None, copy=False,
@@ -149,9 +111,7 @@ def atleast2d_or_csc(X, dtype=None, order=None, copy=False,
Also, converts np.matrix to np.ndarray.
"""
- return _atleast2d_or_sparse(X, dtype, order, copy, sp.csc_matrix,
- "tocsc", sp.isspmatrix_csc,
- force_all_finite)
+ return check_array(X, "csc", dtype, order, copy, force_all_finite)
def atleast2d_or_csr(X, dtype=None, order=None, copy=False,
@@ -160,9 +120,7 @@ def atleast2d_or_csr(X, dtype=None, order=None, copy=False,
Also, converts np.matrix to np.ndarray.
"""
- return _atleast2d_or_sparse(X, dtype, order, copy, sp.csr_matrix,
- "tocsr", sp.isspmatrix_csr,
- force_all_finite)
+ return check_array(X, "csr", dtype, order, copy, force_all_finite)
def _num_samples(x):
@@ -175,6 +133,142 @@ def _num_samples(x):
return x.shape[0] if hasattr(x, 'shape') else len(x)
+def check_consistent_length(*arrays):
@arjoly Owner
arjoly added a note

Should we keep *arrays or put a list of array / iterable of array instead?

@amueller Owner

I think in this function it is fine. I'm thinking about marking it as private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ """Check that all arrays have consistent first dimensions.
+
+ Checks whether all objects in arrays have the same shape or length.
+
+ Parameters
+ ----------
+ arrays : list or tuple of input objects.
+ Objects that will be checked for consistent length.
+ """
+
+ uniques = np.unique([_num_samples(X) for X in arrays if X is not None])
+ if len(uniques) > 1:
+ raise ValueError("Found arrays with inconsistent numbers of samples: %s"
+ % str(uniques))
+
+
+def _ensure_sparse_format(spmatrix, allowed_sparse, dtype, order, copy,
+ force_all_finite):
+ """Convert a sparse matrix to a given format.
+
+ Checks the sparse format of spmatrix and converts if necessary.
+
+ Parameters
+ ----------
+ spmatrix : scipy sparse matrix
+ Input to validate and convert.
+
+ allowed_sparse : string, list of string or None (default=None)
+ String[s] representing allowed sparse matrix formats ('csc',
+ 'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). None means that sparse
+ matrix input will raise an error. If the input is sparse but not in
+ the allowed format, it will be converted to the first listed format.
+
+ order : 'F', 'C' or None (default=None)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean (default=False)
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean (default=True)
+ Whether to raise an error on np.inf and np.nan in X.
+
+ Returns
+ -------
+ spmatrix_converted : scipy sparse matrix.
+ Matrix that is ensured to have an allowed type.
+ """
+ if allowed_sparse is None:
+ raise TypeError('A sparse matrix was passed, but dense '
+ 'data is required. Use X.toarray() to '
+ 'convert to a dense numpy array.')
+ sparse_type = spmatrix.format
+ if dtype is None:
+ dtype = spmatrix.dtype
+ if sparse_type in allowed_sparse:
+ # correct type
+ if dtype == spmatrix.dtype:
+ # correct dtype
+ if copy:
+ spmatrix = spmatrix.copy()
+ else:
+ # convert dtype
+ spmatrix = spmatrix.astype(dtype)
+ else:
+ # create new
+ spmatrix = spmatrix.asformat(allowed_sparse[0]).astype(dtype)
+ if force_all_finite:
+ if not hasattr(spmatrix, "data"):
+ warnings.warn("Can't check %s sparse matrix for nan or inf."
+ % spmatrix.format)
+ else:
+ _assert_all_finite(spmatrix.data)
+ if hasattr(spmatrix, "data"):
+ spmatrix.data = np.array(spmatrix.data, copy=False, order=order)
+ return spmatrix
+
+
+def check_array(array, allowed_sparse=None, dtype=None, order=None, copy=False,
+ force_all_finite=True, ensure_2d=True, allow_nd=False):
+ """Input validation on an array, list, sparse matrix or similar.
+
+ By default, the input is converted to an at least 2nd numpy array.
+
@arjoly Owner
arjoly added a note

Could you add that dense array is always allowed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ Parameters
+ ----------
+ array : object
+ Input object to check / convert.
+
+ allowed_sparse : string, list of string or None (default=None)
+ String[s] representing allowed sparse matrix formats, such as 'csc',
+ 'csr', etc. None means that sparse matrix input will raise an error.
+ If the input is sparse but not in the allowed format, it will be
+ converted to the first listed format.
+
+ order : 'F', 'C' or None (default=None)
+ Whether an array will be forced to be fortran or c-style.
+
+ copy : boolean (default=False)
+ Whether a forced copy will be triggered. If copy=False, a copy might
+ be triggered by a conversion.
+
+ force_all_finite : boolean (default=True)
+ Whether to raise an error on np.inf and np.nan in X.
+
+ ensure_2d : boolean (default=True)
+ Whether to make X at least 2d.
+
+ allow_nd : boolean (default=False)
+ Whether to allow X.ndim > 2.
+
+ Returns
+ -------
+ X_converted : object
+ The converted and validated X.
+ """
+ if isinstance(allowed_sparse, str):
+ allowed_sparse = [allowed_sparse]
+
+ if sp.issparse(array):
+ array = _ensure_sparse_format(array, allowed_sparse, dtype, order,
+ copy, force_all_finite)
+ else:
+ if ensure_2d:
+ array = np.atleast_2d(array)
+ array = np.array(array, dtype=dtype, order=order, copy=copy)
+ if not allow_nd and array.ndim >= 3:
+ raise ValueError("Found array with dim %d. Expected <= 2" %
+ array.ndim)
+ if force_all_finite:
+ _assert_all_finite(array)
+
+ return array
+
+
def check_arrays(*arrays, **options):
@ogrisel Owner
ogrisel added a note

@amueller have you thought about deprecating this function that we have check_array to check assumptions on individual datastructures? If yes what motivates you not to deprecate it in this PR? To much code to update at once?

Would it make sense to try to deprecate it in a future PR?

@amueller Owner

I'm working on it. I think it makes sense to merge this one first because it is reviewable ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
"""Check that all arrays have consistent first dimensions.
@@ -237,60 +331,27 @@ def check_arrays(*arrays, **options):
if len(arrays) == 0:
return None
-
- n_samples = _num_samples(arrays[0])
+ check_consistent_length(*arrays)
+
+ order = 'C' if check_ccontiguous else None
+ force_finite = not allow_nans
+ if sparse_format == 'dense':
+ allow_sparse = None
+ elif sparse_format is None:
+ allow_sparse = ['csr', 'csc']
+ else:
+ allow_sparse = sparse_format
checked_arrays = []
for array in arrays:
- array_orig = array
if array is None:
- # special case: ignore optional y=None kwarg pattern
checked_arrays.append(array)
continue
- size = _num_samples(array)
-
- if size != n_samples:
- raise ValueError("Found array with dim %d. Expected %d"
- % (size, n_samples))
-
- if (force_arrays or hasattr(array, "__array__")
- or hasattr(array, "shape")):
- if sp.issparse(array):
- if sparse_format == 'csr':
- array = array.tocsr()
- elif sparse_format == 'csc':
- array = array.tocsc()
- elif sparse_format == 'dense':
- raise TypeError('A sparse matrix was passed, but dense '
- 'data is required. Use X.toarray() to '
- 'convert to a dense numpy array.')
- if check_ccontiguous:
- array.data = np.ascontiguousarray(array.data, dtype=dtype)
- elif hasattr(array, 'data'):
- array.data = np.asarray(array.data, dtype=dtype)
- elif array.dtype != dtype:
- # Cast on the required dtype
- array = array.astype(dtype)
- if not allow_nans:
- if hasattr(array, 'data'):
- _assert_all_finite(array.data)
- else:
- # DOK sparse matrices
- _assert_all_finite(array.values())
- else:
- if check_ccontiguous:
- array = np.ascontiguousarray(array, dtype=dtype)
- elif dtype is not None or force_arrays:
- array = np.asarray(array, dtype=dtype)
- if not allow_nans:
- _assert_all_finite(array)
-
- if force_arrays and not allow_nd and array.ndim >= 3:
- raise ValueError("Found array with dim %d. Expected <= 2" %
- array.ndim)
-
- if copy and array is array_orig:
- array = array.copy()
+
+ if force_arrays or sp.issparse(array):
+ array = check_array(array, allow_sparse, dtype, order, copy=copy,
+ ensure_2d=False, allow_nd=allow_nd,
+ force_all_finite=force_finite)
checked_arrays.append(array)
return checked_arrays
Something went wrong with that request. Please try again.