Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Raise proper error for sparse dataframe with mixed dtypes #17992

Merged
merged 43 commits into from Aug 25, 2020

Conversation

alexshacked
Copy link
Contributor

@alexshacked alexshacked commented Jul 25, 2020

Fixes #17945.
Currently, function check_array() in validation.py throws a confusing exception while validating:

  1. A sparse DataFrame with mixed numeric types
  2. A sparse DataFrame with object types.
    The error happens because for this cases the DataFrame creates a coo_matrix with dtype('object'), and when
    trying to create a csr_matrix from this coo_matrix scipy will raise a ValueError.
    This scipy Exception is not very informative for the scikit-learn user that gets this exception while trying to fit an
    estimator with a sparse DataFrame.

The fix is to handle this two cases in check_array by looking at the coo_matrix and issue a clear Exception to the caller.
The fix includes a uninittest that handles all the intput types scenarios that can generate this issue.
The unnittest will fail if run in the version prior to this fix.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @alexshacked

I was wondering if we can allow for the sparse conversion to happen, and then check the dtype. If it is a object dtype, raise an error regarding pandas extensions arrays etc.

To get more attention on a PR, consider updating the title. (It will be used as the commit message when this PR gets merged).

@alexshacked
Copy link
Contributor Author

Hey @thomasjpfan. In this fix we analyze the coo_matrix , we got from
array.sparse.to_coo() - line 570 in validation.py
and if it is of dtype('object') we raise the new Exception.
If we did not do that, the next line is :
array = _ensure_sparse_format(...
and this would throw us the original Exception while calling coo_matrix.to_csr()
So, in short we can let the code create the coo_matrix and analyze it's type (and this is what we are actually doing),
but we cannot go to the next step and create the csr_matrix from the coo_matrix because this would throw the original scipy exception

@alexshacked alexshacked changed the title Fix 17945 Raise ValueError with clear error message in check_array for sparse DataFrames with mixed types. Fixes issue #17945 Jul 27, 2020
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the test is trying to cover many test cases, but I think we only need to test the example demonstrated in the issue.

Comment on lines 583 to 586
raise ValueError(
"Pandas DataFrame with sparse extention arrays generated "
"a sparse coo_matrix with dtype 'np.object'\n which "
"scipy cannot convert to a CSR or a CSC. {}".format(mxg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I do not think we need to be specific about the cause. I think a message such as:

Pandas DataFrame with mixed sparse extension arrays generated a sparse matrix with object dtype which can not be converted to a scipy sparse matrix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @thomasjpfan. There are two scenarios, that can result in scipy coo_matrix.to_csr throwing the Exception:
TypeError: no supported conversion for types: (dtype('O'),)
a. check_array receives a DataFrame with columns of different numeric types
b. check_array receives a data frame where all the columns are the same type but this type is str or bytes or object

tf = pd.DataFrame({'col1': pd.arrays.SparseArray([0, 1, 0],
                                                             dtype='str'),
                   'col2': pd.arrays.SparseArray([1, 0, 1],
                                                             dtype='str')})
a = check_array(tf, **{'accept_sparse': ['csr', 'csc'],
                'ensure_min_features': 2})
...
raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('O'),)

In this fix I try to handle both cases since they both throw the same scipy exception.

For scenario b, I don't think we can display the message:
Pandas DataFrame with mixed sparse extension arrays generated a sparse matrix with object dtype which can not be converted to a scipy sparse matrix.

So you propose that we won't handle scenario b, meaning that we let the system throw the original exception?
TypeError: no supported conversion for types: (dtype('O'),)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you propose that we won't handle scenario b, meaning that we let the system throw the original exception?

Yes.

I do not think check_array should expect SparseArrays with strings or objects. Raising an error for the common use case of mixed numerical data as specified in #17945 would be enough.

I am trying to avoid too much pandas specific code in the test suite. I think a simple test for mixed numerical sparse arrays, would be sufficient to resolve the original issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will do. Thanks a lot.

@alexshacked
Copy link
Contributor Author

alexshacked commented Aug 3, 2020

I changed the fix to support only the mixed numeric types scenario.
When running circleci I noticed that some tests use the latest pandas version 1.1.0, while other tests still work with pandas 1.0.5.
Pandas 1.1.0 fixes the mixed numerics type issue. So from this version and beyond we do not need this fix.
My current delivery tests pandas version and expects an Exception to be raised only if pandas version is 1.0.5 and below.
My question is, do we still want to add this fix and increase the implementation of check _array() considering that an year from now version 1.0.5 and below will probably not be very frequent? Pandas are releasing a new version nearly every month.

@alexshacked
Copy link
Contributor Author

Hey @thomasjpfan. I chanded the fix to handle only the mixed types scenario. Meanwhile pandas released version 1.1.0 which fixes this issue. Please, see my full comment above.

@@ -1213,3 +1213,85 @@ def test_check_sparse_pandas_sp_format(sp_format):
assert sp.issparse(result)
assert result.format == sp_format
assert_allclose_dense_sparse(sp_mat, result)


def make_types_table():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to have this function at all, we do not need to check every single pandas dtype combinations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 1250 to 1251
@pytest.mark.parametrize('dt_name',
['bool', 'float', 'int', 'uint'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can explicitly list out the combinations we are about there

@pytest.mark.parametrize('ntype1, ntype2', [
	("int32", "long"),
...
])
def test_check_pandas_sparse_invalid(ntype1, ntype2):
	pd = pytest.importskip("pandas")
	df = pd.DataFrame(...)
	with pytest.raises(ValueError, ...):
		check_array(df, ...)

@pytest.mark.parametrize('ntype1, ntype2', [
	("int", "long"),
...
])
def test_check_pandas_sparse_valid(ntype1, ntype2):
	# make sure valid combinations result in the expected sparse matrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the tests accordingly

if is_sparse_df_with_mixed_types(array_orig):
raise ValueError(
"Pandas DataFrame with mixed sparse extension arrays "
"generated a sparse matrix\n with object dtype which "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"generated a sparse matrix\n with object dtype which "
"generated a sparse matrix with object dtype which "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of style changes and a bit more check required in the tests.

sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
Comment on lines 1272 to 1273
check_array(df, **{'accept_sparse': ['csr', 'csc'],
'ensure_min_features': 2})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check_array(df, **{'accept_sparse': ['csr', 'csc'],
'ensure_min_features': 2})
check_array(df, **{'accept_sparse': ['csr', 'csc']})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check the output by checking that we get a numpy array with the expected dtype?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least that we have a single dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_array returns a scipy csr_matrix. Scipy sparse matrices like numpy arrays always contain elements of the same type. The type is found in the matrix member dtype. So, one type is guaranteed.
I can test that the dtype of the csr_matrix is as expected. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

raise ValueError(
"Pandas DataFrame with mixed sparse extension arrays "
"generated a sparse matrix with object dtype which "
"can not be converted to a scipy sparse matrix")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"can not be converted to a scipy sparse matrix")
"can not be converted to a scipy sparse matrix."
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we provide some hints to solve the issue? One might my want to convert manually to a single dtype if possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
@glemaitre glemaitre changed the title Raise ValueError with clear error message in check_array for sparse DataFrames with mixed types. Fixes issue #17945 FIX raise proper error for sparse dataframe with mixed data type Aug 21, 2020
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @thomasjpfan do you want to have a look at this and potentially merge it.

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Show resolved Hide resolved
@glemaitre
Copy link
Contributor

FYI I check codecov. I am sure that we covered any change that we did in this PR.

@glemaitre glemaitre added this to TO BE MERGED in Guillaume's pet Aug 25, 2020
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@alexshacked
Copy link
Contributor Author

Thanks a lot @glemaitre

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. Otherwise LGTM

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
alexshacked and others added 3 commits August 25, 2020 19:57
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@alexshacked
Copy link
Contributor Author

@thomasjpfan Thank you very much for guiding me through this issue.

@thomasjpfan thomasjpfan changed the title FIX raise proper error for sparse dataframe with mixed data type FIX Raise proper error for sparse dataframe with mixed dtypes Aug 25, 2020
@thomasjpfan thomasjpfan merged commit 0111511 into scikit-learn:master Aug 25, 2020
5 checks passed
@glemaitre glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Aug 26, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…-learn#17992)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

TruncatedSVD fails on DataFrame with mixture of Sparse columns int and float
4 participants