stacking_cv_regression.py should use X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None) #562

tjhgit · 2019-07-01T08:40:29Z

currently stacking_cv_regression.py uses X, y = check_X_y(X, y, accept_sparse=['csc', 'csr']) which checks for numeric input.
However if the first layer models perform a categorical encoding it is perfectly fine not to have numeric but rather np.array of type object.

So please change the check to:
X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None)

rasbt · 2019-07-01T12:52:19Z

Thanks for the note! I think the current implementation should work fine though:

from sklearn.utils import check_X_y
import numpy as np

X = np.random.random((10, 2)).astype(str)
y = np.ones(10)
check_X_y(X, y, accept_sparse=['csc', 'csr'])

returns

(array([['0.9346316090185622', '0.16042444137263268'],
        ['0.8509128681888157', '0.04362697943384508'],
        ['0.9017565031657032', '0.3581573277097798'],
        ['0.17439346265779732', '0.5833740568279369'],
        ['0.12902336862557318', '0.06860327739703476'],
        ['0.8507041677034654', '0.35158528908301556'],
        ['0.6883309548232583', '0.42166588958658446'],
        ['0.6854723415869861', '0.3736871105678704'],
        ['0.8142315708984', '0.7559914053700448'],
        ['0.14472053378019012', '0.07451959739328617']], dtype='<U32'),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

However, based on the FutureWarning:

/Users/sebastian/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:532: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
FutureWarning)

It may stop working in the next release. What would your recommendation be for handling that type of input where you have object-type output? One could simply get rid of chck_X_y, but maybe you have an alternative in mind?

EDIT: Ooops, I overlooked your suggestion ... I thought dtype=None was already the default. Makes sense then to make that change.

pwl · 2019-08-31T13:25:24Z

Shouldn't it be check_X_y(X, y, accept_sparse=['csc', 'csr'], force_all_finite='allow-nan')? The nan's in X should be cared for by the models, right?

rasbt · 2019-08-31T14:31:27Z

Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to force_all_finite=False (but maybe add a user warning if there is an np.nan or np.inf in the input

pwl · 2019-08-31T16:46:17Z

Good point, best not to discard any kind of data a prior.

There's another issue, somehow check_X_y also converts the X and y to numpy arrays when passing a DataFrames or a Series, I couldn't figure out an option to disable this kind of behavior. In the end, the easiest fix for me for the time being was to comment out the check.

Also, the StackingClassifier has no check whatsoever, perhaps both stacking classifiers should have the same checks in place?

rasbt · 2019-08-31T17:45:48Z

Yeah, I think it may not be necessary to recast the inputs to NumPy arrays ... depending on the behavior of the estimators used within the stacking classifier.

Regarding the inconsistency between the StackingCVClassifier and the StackingClassifier, that's probably because they were implemented independently and have been modified inconsistently with respect to each other. Some time ago, we tried to unify them in terms of (Python) class inheritance but haven't done so yet.

rchaves33 · 2020-01-13T07:19:24Z

Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to force_all_finite=False (but maybe add a user warning if there is an np.nan or np.inf in the input

I ran into this issue in stacking_cv_classification.py which has the same check_X_y call. One of my level 1 classifiers is an XGBClassifier, which can handle NaNs in the X input, but since my features contain many NaNs, this check was failing, and .fit was raising an unnecessary error. Adding the arg force_all_finite=False resolved the issue.

Are you open to a PR for this?

(Just curious: the recent sklearn implementation, StackingClassifier does not seem to have a call to check_X_y at all. Perhaps it can be avoided altogether?)

rasbt · 2020-01-13T14:45:14Z

Are you open to a PR for this?

Yes, I am open to it. I think I already addressed that via #606 though, but you are welcome to take another look if there are still issues.

rchaves33 · 2020-01-15T13:18:05Z

@rasbt Thanks for pointing me to that issue; I hadn't seen it, and it does resolve my error.

Unfortunately in my firewalled env I'm restricted to using official pip releases. The MR from #606 came after the current v0.17 on PyPy. Any idea when v0.18 will be released, or is there a possibility for releasing v0.17.1 with the various changes since? That would be awesome.

@tjhgit As for this Issue, it can probably be closed, unless @tjhgit wants to re-introduce the check_X_y call (but now with force_all_finite=False) for the other checks it performs. From my side, fine to remove it altogether as it currently is in master.

rasbt · 2020-02-10T20:28:22Z

just catching up with notifications: Made a 0.17.1 version ~2 weeks back that should have that fix :).

rasbt mentioned this issue Jul 1, 2019

Stacking dtype #563

Merged

5 tasks

rasbt closed this as completed in #563 Jul 1, 2019

rasbt reopened this Aug 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stacking_cv_regression.py should use X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None) #562

stacking_cv_regression.py should use X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None) #562

tjhgit commented Jul 1, 2019

rasbt commented Jul 1, 2019 •

edited

Loading

pwl commented Aug 31, 2019

rasbt commented Aug 31, 2019

pwl commented Aug 31, 2019

rasbt commented Aug 31, 2019

rchaves33 commented Jan 13, 2020

rasbt commented Jan 13, 2020

rchaves33 commented Jan 15, 2020 •

edited

Loading

rasbt commented Feb 10, 2020

stacking_cv_regression.py should use X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None) #562

stacking_cv_regression.py should use X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None) #562

Comments

tjhgit commented Jul 1, 2019

rasbt commented Jul 1, 2019 • edited Loading

pwl commented Aug 31, 2019

rasbt commented Aug 31, 2019

pwl commented Aug 31, 2019

rasbt commented Aug 31, 2019

rchaves33 commented Jan 13, 2020

rasbt commented Jan 13, 2020

rchaves33 commented Jan 15, 2020 • edited Loading

rasbt commented Feb 10, 2020

rasbt commented Jul 1, 2019 •

edited

Loading

rchaves33 commented Jan 15, 2020 •

edited

Loading