Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLS reports "array must not contain nan" if a feature is constant #13609

Closed
jnothman opened this issue Apr 10, 2019 · 7 comments · Fixed by #14450
Closed

PLS reports "array must not contain nan" if a feature is constant #13609

jnothman opened this issue Apr 10, 2019 · 7 comments · Fixed by #14450
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted

Comments

@jnothman
Copy link
Member

Originally reported at #2089 (comment) by @Franck-Dernoncourt. Reproduce with:

import numpy as np
import sklearn.cross_decomposition

pls2 = sklearn.cross_decomposition.PLSRegression()
xx = np.random.random((5,5))
yy = np.zeros((5,5) ) 

yy[0,:] = [0,1,0,0,0]
yy[1,:] = [0,0,0,1,0]
yy[2,:] = [0,0,0,0,1]
#yy[3,:] = [1,0,0,0,0] # Uncommenting this line solves the issue

pls2.fit(xx, yy)

The obscure error message is due to the presence of a column containing only 0.

@jnothman jnothman added Bug Easy Well-defined and straightforward way to resolve help wanted labels Apr 10, 2019
@MarcoGorelli
Copy link
Contributor

What would you like to see instead? An assertion when the fit method is called that checks that no feature is constant, and returns a clear error if the assertion fails?

@jnothman
Copy link
Member Author

jnothman commented Apr 10, 2019 via email

@borysknysh
Copy link

As far as I understand we need to remove the warning message keeping the correct answer (when line yy[3,:] = [1,0,0,0,0] is uncommented ).
Can I try to solve this issue if nobody minds?

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Apr 17, 2019 via email

@jnothman
Copy link
Member Author

As far as I understand we need to remove the warning message keeping the correct answer

I'm not an expert on PLS; I was relying on the comments historically related to this issue to describe it as a simple fix. But certainly the problem is constant features.

Go ahead and submit a pull request, @iodapro

@camilaagw
Copy link
Contributor

camilaagw commented Jul 13, 2019

@jnothman there is something I can't undestand about the example you give in the issue: Even when we are uncommenting the line yy[3,:] = [1,0,0,0,0], the third column of yy is constant, but in that case pls2.fit(xx, yy) works. Do we need two columns to be constant for the PLS to fail?

@camilaagw
Copy link
Contributor

camilaagw commented Jul 23, 2019

After taking a deeper look at the problem, the problem is not constant features. The problem is that the first column of the target (yy) is constant. For instance, this case will work (constant features and some constant columns in the target that are not the first column):

import numpy as np
import sklearn.cross_decomposition

pls2 = sklearn.cross_decomposition.PLSRegression()
xx = np.random.random((5,5))
xx[:,1] = 1
xx[:,2] = 0
yy = np.random.random((5,5))
yy[:,2] = 5
yy[:,4] = 1
pls2.fit(xx, yy)
pls2.predict(xx)

But this case won't (the first column in the target is a constant):

import numpy as np
import sklearn.cross_decomposition

pls2 = sklearn.cross_decomposition.PLSRegression()
xx = np.random.random((5,5))
yy = np.random.random((5,5))
yy[:,0] = 4
pls2.fit(xx, yy)
pls2.predict(xx)

This is because the first step of the _nipals_twoblocks_inner_loop algorithm is to calculate y_score = Y[:, [0]] and this will cause the x_weights = np.dot(X.T, y_score) / np.dot(y_score.T, y_score) to be an array of nan. This happens because _center_scale_xy will cause the first column of yy to be a column of zeros.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted
Projects
None yet
4 participants