Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stacking_cv_regression.py should use X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None) #562

Open
tjhgit opened this issue Jul 1, 2019 · 9 comments · Fixed by #563

Comments

@tjhgit
Copy link

tjhgit commented Jul 1, 2019

currently stacking_cv_regression.py uses X, y = check_X_y(X, y, accept_sparse=['csc', 'csr']) which checks for numeric input.
However if the first layer models perform a categorical encoding it is perfectly fine not to have numeric but rather np.array of type object.

So please change the check to:
X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None)

@rasbt
Copy link
Owner

rasbt commented Jul 1, 2019

Thanks for the note! I think the current implementation should work fine though:

from sklearn.utils import check_X_y
import numpy as np

X = np.random.random((10, 2)).astype(str)
y = np.ones(10)
check_X_y(X, y, accept_sparse=['csc', 'csr'])

returns

(array([['0.9346316090185622', '0.16042444137263268'],
        ['0.8509128681888157', '0.04362697943384508'],
        ['0.9017565031657032', '0.3581573277097798'],
        ['0.17439346265779732', '0.5833740568279369'],
        ['0.12902336862557318', '0.06860327739703476'],
        ['0.8507041677034654', '0.35158528908301556'],
        ['0.6883309548232583', '0.42166588958658446'],
        ['0.6854723415869861', '0.3736871105678704'],
        ['0.8142315708984', '0.7559914053700448'],
        ['0.14472053378019012', '0.07451959739328617']], dtype='<U32'),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

However, based on the FutureWarning:

/Users/sebastian/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:532: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
FutureWarning)

It may stop working in the next release. What would your recommendation be for handling that type of input where you have object-type output? One could simply get rid of chck_X_y, but maybe you have an alternative in mind?

EDIT: Ooops, I overlooked your suggestion ... I thought dtype=None was already the default. Makes sense then to make that change.

@rasbt rasbt mentioned this issue Jul 1, 2019
5 tasks
@rasbt rasbt closed this as completed in #563 Jul 1, 2019
@pwl
Copy link

pwl commented Aug 31, 2019

Shouldn't it be check_X_y(X, y, accept_sparse=['csc', 'csr'], force_all_finite='allow-nan')? The nan's in X should be cared for by the models, right?

@rasbt
Copy link
Owner

rasbt commented Aug 31, 2019

Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to force_all_finite=False (but maybe add a user warning if there is an np.nan or np.inf in the input

@rasbt rasbt reopened this Aug 31, 2019
@pwl
Copy link

pwl commented Aug 31, 2019

Good point, best not to discard any kind of data a prior.

There's another issue, somehow check_X_y also converts the X and y to numpy arrays when passing a DataFrames or a Series, I couldn't figure out an option to disable this kind of behavior. In the end, the easiest fix for me for the time being was to comment out the check.

Also, the StackingClassifier has no check whatsoever, perhaps both stacking classifiers should have the same checks in place?

@rasbt
Copy link
Owner

rasbt commented Aug 31, 2019

Yeah, I think it may not be necessary to recast the inputs to NumPy arrays ... depending on the behavior of the estimators used within the stacking classifier.

Regarding the inconsistency between the StackingCVClassifier and the StackingClassifier, that's probably because they were implemented independently and have been modified inconsistently with respect to each other. Some time ago, we tried to unify them in terms of (Python) class inheritance but haven't done so yet.

@rchaves33
Copy link
Contributor

Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to force_all_finite=False (but maybe add a user warning if there is an np.nan or np.inf in the input

I ran into this issue in stacking_cv_classification.py which has the same check_X_y call. One of my level 1 classifiers is an XGBClassifier, which can handle NaNs in the X input, but since my features contain many NaNs, this check was failing, and .fit was raising an unnecessary error. Adding the arg force_all_finite=False resolved the issue.

Are you open to a PR for this?

(Just curious: the recent sklearn implementation, StackingClassifier does not seem to have a call to check_X_y at all. Perhaps it can be avoided altogether?)

@rasbt
Copy link
Owner

rasbt commented Jan 13, 2020

Are you open to a PR for this?

Yes, I am open to it. I think I already addressed that via #606 though, but you are welcome to take another look if there are still issues.

@rchaves33
Copy link
Contributor

rchaves33 commented Jan 15, 2020

@rasbt Thanks for pointing me to that issue; I hadn't seen it, and it does resolve my error.

Unfortunately in my firewalled env I'm restricted to using official pip releases. The MR from #606 came after the current v0.17 on PyPy. Any idea when v0.18 will be released, or is there a possibility for releasing v0.17.1 with the various changes since? That would be awesome.

@tjhgit As for this Issue, it can probably be closed, unless @tjhgit wants to re-introduce the check_X_y call (but now with force_all_finite=False) for the other checks it performs. From my side, fine to remove it altogether as it currently is in master.

@rasbt
Copy link
Owner

rasbt commented Feb 10, 2020

just catching up with notifications: Made a 0.17.1 version ~2 weeks back that should have that fix :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants