New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Raise proper error for sparse dataframe with mixed dtypes #17992
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @alexshacked
I was wondering if we can allow for the sparse conversion to happen, and then check the dtype. If it is a object dtype, raise an error regarding pandas extensions arrays etc.
To get more attention on a PR, consider updating the title. (It will be used as the commit message when this PR gets merged).
Hey @thomasjpfan. In this fix we analyze the coo_matrix , we got from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the test is trying to cover many test cases, but I think we only need to test the example demonstrated in the issue.
sklearn/utils/validation.py
Outdated
raise ValueError( | ||
"Pandas DataFrame with sparse extention arrays generated " | ||
"a sparse coo_matrix with dtype 'np.object'\n which " | ||
"scipy cannot convert to a CSR or a CSC. {}".format(mxg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I do not think we need to be specific about the cause. I think a message such as:
Pandas DataFrame with mixed sparse extension arrays generated a sparse matrix with
object
dtype which can not be converted to a scipy sparse matrix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @thomasjpfan. There are two scenarios, that can result in scipy coo_matrix.to_csr throwing the Exception:
TypeError: no supported conversion for types: (dtype('O'),)
a. check_array receives a DataFrame with columns of different numeric types
b. check_array receives a data frame where all the columns are the same type but this type is str or bytes or object
tf = pd.DataFrame({'col1': pd.arrays.SparseArray([0, 1, 0],
dtype='str'),
'col2': pd.arrays.SparseArray([1, 0, 1],
dtype='str')})
a = check_array(tf, **{'accept_sparse': ['csr', 'csc'],
'ensure_min_features': 2})
...
raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('O'),)
In this fix I try to handle both cases since they both throw the same scipy exception.
For scenario b, I don't think we can display the message:
Pandas DataFrame with mixed sparse extension arrays generated a sparse matrix with object dtype which can not be converted to a scipy sparse matrix
.
So you propose that we won't handle scenario b, meaning that we let the system throw the original exception?
TypeError: no supported conversion for types: (dtype('O'),)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you propose that we won't handle scenario b, meaning that we let the system throw the original exception?
Yes.
I do not think check_array
should expect SparseArrays with strings or objects. Raising an error for the common use case of mixed numerical data as specified in #17945 would be enough.
I am trying to avoid too much pandas specific code in the test suite. I think a simple test for mixed numerical sparse arrays, would be sufficient to resolve the original issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Will do. Thanks a lot.
I changed the fix to support only the mixed numeric types scenario. |
Hey @thomasjpfan. I chanded the fix to handle only the mixed types scenario. Meanwhile pandas released version 1.1.0 which fixes this issue. Please, see my full comment above. |
@@ -1213,3 +1213,85 @@ def test_check_sparse_pandas_sp_format(sp_format): | |||
assert sp.issparse(result) | |||
assert result.format == sp_format | |||
assert_allclose_dense_sparse(sp_mat, result) | |||
|
|||
|
|||
def make_types_table(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to have this function at all, we do not need to check every single pandas dtype combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@pytest.mark.parametrize('dt_name', | ||
['bool', 'float', 'int', 'uint']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can explicitly list out the combinations we are about there
@pytest.mark.parametrize('ntype1, ntype2', [
("int32", "long"),
...
])
def test_check_pandas_sparse_invalid(ntype1, ntype2):
pd = pytest.importskip("pandas")
df = pd.DataFrame(...)
with pytest.raises(ValueError, ...):
check_array(df, ...)
@pytest.mark.parametrize('ntype1, ntype2', [
("int", "long"),
...
])
def test_check_pandas_sparse_valid(ntype1, ntype2):
# make sure valid combinations result in the expected sparse matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the tests accordingly
sklearn/utils/validation.py
Outdated
if is_sparse_df_with_mixed_types(array_orig): | ||
raise ValueError( | ||
"Pandas DataFrame with mixed sparse extension arrays " | ||
"generated a sparse matrix\n with object dtype which " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"generated a sparse matrix\n with object dtype which " | |
"generated a sparse matrix with object dtype which " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of style changes and a bit more check required in the tests.
check_array(df, **{'accept_sparse': ['csr', 'csc'], | ||
'ensure_min_features': 2}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_array(df, **{'accept_sparse': ['csr', 'csc'], | |
'ensure_min_features': 2}) | |
check_array(df, **{'accept_sparse': ['csr', 'csc']}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also check the output by checking that we get a numpy array with the expected dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least that we have a single dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_array returns a scipy csr_matrix. Scipy sparse matrices like numpy arrays always contain elements of the same type. The type is found in the matrix member dtype. So, one type is guaranteed.
I can test that the dtype of the csr_matrix is as expected. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sklearn/utils/validation.py
Outdated
raise ValueError( | ||
"Pandas DataFrame with mixed sparse extension arrays " | ||
"generated a sparse matrix with object dtype which " | ||
"can not be converted to a scipy sparse matrix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"can not be converted to a scipy sparse matrix") | |
"can not be converted to a scipy sparse matrix." | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we provide some hints to solve the issue? One might my want to convert manually to a single dtype if possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @thomasjpfan do you want to have a look at this and potentially merge it.
FYI I check codecov. I am sure that we covered any change that we did in this PR. |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Thanks a lot @glemaitre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. Otherwise LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@thomasjpfan Thank you very much for guiding me through this issue. |
…-learn#17992) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Fixes #17945.
Currently, function check_array() in validation.py throws a confusing exception while validating:
The error happens because for this cases the DataFrame creates a coo_matrix with dtype('object'), and when
trying to create a csr_matrix from this coo_matrix scipy will raise a ValueError.
This scipy Exception is not very informative for the scikit-learn user that gets this exception while trying to fit an
estimator with a sparse DataFrame.
The fix is to handle this two cases in check_array by looking at the coo_matrix and issue a clear Exception to the caller.
The fix includes a uninittest that handles all the intput types scenarios that can generate this issue.
The unnittest will fail if run in the version prior to this fix.