New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GradientBoostingClassifier.fit accepts sparse X, but .predict does not #6101

Closed
aflaxman opened this Issue Dec 30, 2015 · 18 comments

Comments

Projects
None yet
7 participants
@aflaxman
Contributor

aflaxman commented Dec 30, 2015

I have a sparse dataset that is too large for main memory if I call X.todense(). If I understand correctly, GradientBoostingClassifier.fit will accept my sparse X, but it is not currently possible to use GradientBoostingClassifier.predict on the results. It would be great if that were not the case.

Here is a minimal example of the issue:

from scipy import sparse
from sklearn.datasets.samples_generator import make_classification
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_classification(n_samples=20, n_features=5, random_state=0)
X_sp = sparse.coo_matrix(X)

clf = GradientBoostingClassifier()
clf.fit(X,y)
clf.predict(X)  # works

clf.fit(X_sp, y)  # works
clf.predict(X_sp)  # fails with TypeError: A sparse matrix was passed, but dense data is required.
@olologin

This comment has been minimized.

Show comment
Hide comment
@olologin

olologin Jan 2, 2016

Contributor

Confirmed. I think small rework of https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gradient_boosting.pyx#L39 is needed.
Looks like function _predict_regression_tree_inplace_fast was written to somehow optimize prediction speed. But i see some problems in it, its main loop https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gradient_boosting.pyx#L92 differs from the main loop here https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L784, and additionally this function doesn't work with sparse matrices.

Maybe it's better to rely on predict method of Tree class, but some performance tests is needed.

Contributor

olologin commented Jan 2, 2016

Confirmed. I think small rework of https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gradient_boosting.pyx#L39 is needed.
Looks like function _predict_regression_tree_inplace_fast was written to somehow optimize prediction speed. But i see some problems in it, its main loop https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gradient_boosting.pyx#L92 differs from the main loop here https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L784, and additionally this function doesn't work with sparse matrices.

Maybe it's better to rely on predict method of Tree class, but some performance tests is needed.

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Jan 14, 2016

this problem still in
clf.predict(X_sp)
Traceback (most recent call last):
Debug Probe, prompt 11, line 1
File "C:\Anaconda\Lib\site-packages\sklearn\ensemble\gradient_boosting.py", line 1498, in predict
score = self.decision_function(X)
File "C:\Anaconda\Lib\site-packages\sklearn\ensemble\gradient_boosting.py", line 1456, in decision_function
X = check_array(X, dtype=DTYPE, order="C")
File "C:\Anaconda\Lib\site-packages\sklearn\utils\validation.py", line 371, in check_array
force_all_finite)
File "C:\Anaconda\Lib\site-packages\sklearn\utils\validation.py", line 238, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Sandy4321 commented Jan 14, 2016

this problem still in
clf.predict(X_sp)
Traceback (most recent call last):
Debug Probe, prompt 11, line 1
File "C:\Anaconda\Lib\site-packages\sklearn\ensemble\gradient_boosting.py", line 1498, in predict
score = self.decision_function(X)
File "C:\Anaconda\Lib\site-packages\sklearn\ensemble\gradient_boosting.py", line 1456, in decision_function
X = check_array(X, dtype=DTYPE, order="C")
File "C:\Anaconda\Lib\site-packages\sklearn\utils\validation.py", line 371, in check_array
force_all_finite)
File "C:\Anaconda\Lib\site-packages\sklearn\utils\validation.py", line 238, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

@aflaxman

This comment has been minimized.

Show comment
Hide comment
@aflaxman

aflaxman Jan 14, 2016

Contributor

@Sandy4321 here is how you can try out the code in PR #6116 :

git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
git pull https://github.com/olologin/scikit-learn.git GradientBoostingFix
python setup.py install
cd ..

It would be cool if GitHub gave this advice on the PR page... I know it has good advice on how to merge a PR manually if you are a repo admin. But it is the non-admins like me who really need the step-by-step instructions.

BTW, when I do this, my minimal example succeeds! Thanks @olologin !

Contributor

aflaxman commented Jan 14, 2016

@Sandy4321 here is how you can try out the code in PR #6116 :

git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
git pull https://github.com/olologin/scikit-learn.git GradientBoostingFix
python setup.py install
cd ..

It would be cool if GitHub gave this advice on the PR page... I know it has good advice on how to merge a PR manually if you are a repo admin. But it is the non-admins like me who really need the step-by-step instructions.

BTW, when I do this, my minimal example succeeds! Thanks @olologin !

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Jan 14, 2016

And I have anaconda python installation, recently I updated scikit by pip
and this destroyed scikit on my computer... It took me many efforts to get
scikit work... Does this have some risk? Maybe better update only one, two
files?
On Jan 14, 2016 4:10 PM, "Abraham Flaxman" notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321 here is how you can try out the
code in PR #6116 #6116
:

git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
git pull https://github.com/olologin/scikit-learn.git GradientBoostingFix
python setup.py install
cd ..

It would be cool if GitHub gave this advice on the PR page... I know it
has good advice on how to merge a PR manually if you are a repo admin. But
it is the non-admins like me who really need the step-by-step instructions.

BTW, when I do this, my minimal example succeeds! Thanks @olologin
https://github.com/olologin !


Reply to this email directly or view it on GitHub
#6101 (comment)
.

Sandy4321 commented Jan 14, 2016

And I have anaconda python installation, recently I updated scikit by pip
and this destroyed scikit on my computer... It took me many efforts to get
scikit work... Does this have some risk? Maybe better update only one, two
files?
On Jan 14, 2016 4:10 PM, "Abraham Flaxman" notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321 here is how you can try out the
code in PR #6116 #6116
:

git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
git pull https://github.com/olologin/scikit-learn.git GradientBoostingFix
python setup.py install
cd ..

It would be cool if GitHub gave this advice on the PR page... I know it
has good advice on how to merge a PR manually if you are a repo admin. But
it is the non-admins like me who really need the step-by-step instructions.

BTW, when I do this, my minimal example succeeds! Thanks @olologin
https://github.com/olologin !


Reply to this email directly or view it on GitHub
#6101 (comment)
.

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Jan 14, 2016

OK I have windows computer, so all these are commands in dos window?
On Jan 14, 2016 4:10 PM, "Abraham Flaxman" notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321 here is how you can try out the
code in PR #6116 #6116
:

git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
git pull https://github.com/olologin/scikit-learn.git GradientBoostingFix
python setup.py install
cd ..

It would be cool if GitHub gave this advice on the PR page... I know it
has good advice on how to merge a PR manually if you are a repo admin. But
it is the non-admins like me who really need the step-by-step instructions.

BTW, when I do this, my minimal example succeeds! Thanks @olologin
https://github.com/olologin !


Reply to this email directly or view it on GitHub
#6101 (comment)
.

Sandy4321 commented Jan 14, 2016

OK I have windows computer, so all these are commands in dos window?
On Jan 14, 2016 4:10 PM, "Abraham Flaxman" notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321 here is how you can try out the
code in PR #6116 #6116
:

git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
git pull https://github.com/olologin/scikit-learn.git GradientBoostingFix
python setup.py install
cd ..

It would be cool if GitHub gave this advice on the PR page... I know it
has good advice on how to merge a PR manually if you are a repo admin. But
it is the non-admins like me who really need the step-by-step instructions.

BTW, when I do this, my minimal example succeeds! Thanks @olologin
https://github.com/olologin !


Reply to this email directly or view it on GitHub
#6101 (comment)
.

@olologin

This comment has been minimized.

Show comment
Hide comment
@olologin

olologin Jan 15, 2016

Contributor

@Sandy4321, I'm not sure about windows computer, because you need to install git VCS.

Read here about dev-version building http://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge

Contributor

olologin commented Jan 15, 2016

@Sandy4321, I'm not sure about windows computer, because you need to install git VCS.

Read here about dev-version building http://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Jan 17, 2016

ok,
easy does it, may I just have py file to implement it

On Thu, Jan 14, 2016 at 11:30 PM, Ganiev Ibraim notifications@github.com
wrote:

@Sandy4321 https://github.com/Sandy4321, I'm not sure about windows
computer, because you need to install git VCS.

Read here about dev-version building
http://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge


Reply to this email directly or view it on GitHub
#6101 (comment)
.

Sandy4321 commented Jan 17, 2016

ok,
easy does it, may I just have py file to implement it

On Thu, Jan 14, 2016 at 11:30 PM, Ganiev Ibraim notifications@github.com
wrote:

@Sandy4321 https://github.com/Sandy4321, I'm not sure about windows
computer, because you need to install git VCS.

Read here about dev-version building
http://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge


Reply to this email directly or view it on GitHub
#6101 (comment)
.

@jnothman jnothman closed this in 78dbcb2 Oct 15, 2016

@aflaxman

This comment has been minimized.

Show comment
Hide comment
@aflaxman

aflaxman Oct 15, 2016

Contributor

So cool, thanks all!

Contributor

aflaxman commented Oct 15, 2016

So cool, thanks all!

afiodorov added a commit to unravelin/scikit-learn that referenced this issue Apr 25, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this issue Jun 14, 2017

paulha added a commit to paulha/scikit-learn that referenced this issue Aug 19, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017

@deeptipatil

This comment has been minimized.

Show comment
Hide comment
@deeptipatil

deeptipatil May 17, 2018

I am still getting error on GradientBoostingClassifier.predict that " A sparse matrix was passed, but dense data is required.) Works fine with GradientBoostingClassifier.fit

deeptipatil commented May 17, 2018

I am still getting error on GradientBoostingClassifier.predict that " A sparse matrix was passed, but dense data is required.) Works fine with GradientBoostingClassifier.fit

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 17, 2018

Member

Please provide a runnable code snippet that reproduces the issue, mentioning the scikit-learn version

Member

jnothman commented May 17, 2018

Please provide a runnable code snippet that reproduces the issue, mentioning the scikit-learn version

@deeptipatil

This comment has been minimized.

Show comment
Hide comment
@deeptipatil

deeptipatil May 17, 2018

text_clf_1 = Pipeline([('vect', CountVectorizer(stop_words=STOPWORDS, ngram_range=(1,2))),('tfidf', TfidfTransformer()),('clf',GradientBoostingClassifier(verbose =100,n_estimators=100))])
text_clf_1fit = text_clf_1.fit(X_train, y_train)
***---above code works fine

import datetime
from datetime import datetime
t1 = datetime.now()
text_clf_1predicted= text_clf_1fit.predict(X_test)
print(datetime.now() - t1)

TypeError Traceback (most recent call last)
in ()
2 from datetime import datetime
3 t1 = datetime.now()
----> 4 text_clf_1predicted= text_clf_1fit.predict(X_test)
5 print(datetime.now() - t1)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in (*args, **kwargs)
52
53 # lambda, but not partial, allows help() to work with update_wrapper
---> 54 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
55 # update the docstring of the returned function
56 update_wrapper(out, self.fn)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X)
325 if transform is not None:
326 Xt = transform.transform(Xt)
--> 327 return self.steps[-1][-1].predict(Xt)
328
329 @if_delegate_has_method(delegate='_final_estimator')

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in predict(self, X)
1533 The predicted values.
1534 """
-> 1535 score = self.decision_function(X)
1536 decisions = self.loss_.score_to_decision(score)
1537 return self.classes
.take(decisions, axis=0)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in decision_function(self, X)
1491 [n_samples].
1492 """
-> 1493 X = check_array(X, dtype=DTYPE, order="C")
1494 score = self._decision_function(X)
1495 if score.shape[1] == 1:

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
378 if sp.issparse(array):
379 array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 380 force_all_finite)
381 else:
382 array = np.array(array, dtype=dtype, order=order, copy=copy)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
241 """
242 if accept_sparse in [None, False]:
--> 243 raise TypeError('A sparse matrix was passed, but dense '
244 'data is required. Use X.toarray() to '
245 'convert to a dense numpy array.')

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

The scikit-learn version is 0.18.1.

deeptipatil commented May 17, 2018

text_clf_1 = Pipeline([('vect', CountVectorizer(stop_words=STOPWORDS, ngram_range=(1,2))),('tfidf', TfidfTransformer()),('clf',GradientBoostingClassifier(verbose =100,n_estimators=100))])
text_clf_1fit = text_clf_1.fit(X_train, y_train)
***---above code works fine

import datetime
from datetime import datetime
t1 = datetime.now()
text_clf_1predicted= text_clf_1fit.predict(X_test)
print(datetime.now() - t1)

TypeError Traceback (most recent call last)
in ()
2 from datetime import datetime
3 t1 = datetime.now()
----> 4 text_clf_1predicted= text_clf_1fit.predict(X_test)
5 print(datetime.now() - t1)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in (*args, **kwargs)
52
53 # lambda, but not partial, allows help() to work with update_wrapper
---> 54 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
55 # update the docstring of the returned function
56 update_wrapper(out, self.fn)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X)
325 if transform is not None:
326 Xt = transform.transform(Xt)
--> 327 return self.steps[-1][-1].predict(Xt)
328
329 @if_delegate_has_method(delegate='_final_estimator')

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in predict(self, X)
1533 The predicted values.
1534 """
-> 1535 score = self.decision_function(X)
1536 decisions = self.loss_.score_to_decision(score)
1537 return self.classes
.take(decisions, axis=0)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in decision_function(self, X)
1491 [n_samples].
1492 """
-> 1493 X = check_array(X, dtype=DTYPE, order="C")
1494 score = self._decision_function(X)
1495 if score.shape[1] == 1:

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
378 if sp.issparse(array):
379 array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 380 force_all_finite)
381 else:
382 array = np.array(array, dtype=dtype, order=order, copy=copy)

C:\Users\deepti.patil\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
241 """
242 if accept_sparse in [None, False]:
--> 243 raise TypeError('A sparse matrix was passed, but dense '
244 'data is required. Use X.toarray() to '
245 'convert to a dense numpy array.')

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

The scikit-learn version is 0.18.1.

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 May 17, 2018

friends
we have this problem for 3 years
this problem we opened
aflaxman commented on Dec 29, 2015
it is important to run both and fit and predict on sparse data
can you please finally fix it
we deserved it....

Sandy4321 commented May 17, 2018

friends
we have this problem for 3 years
this problem we opened
aflaxman commented on Dec 29, 2015
it is important to run both and fit and predict on sparse data
can you please finally fix it
we deserved it....

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth May 17, 2018

Member

As far as I can tell this was fixed in #6116 which was included in v0.19 and later.

Member

rth commented May 17, 2018

As far as I can tell this was fixed in #6116 which was included in v0.19 and later.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 17, 2018

Member
Member

jnothman commented May 17, 2018

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 May 18, 2018

super good new, thanks
it works now

#S_may18_2018_sparse_data_GBM.py
from scipy import sparse
from sklearn.datasets.samples_generator import make_classification
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_classification(n_samples=20, n_features=5, random_state=0)
X_sp = sparse.coo_matrix(X)

clf = GradientBoostingClassifier()
clf.fit(X,y)
clf.predict(X)  # works

clf.fit(X_sp, y)  # works
clf.predict(X_sp)  # fails with TypeError: A sparse matrix was passed, but dense data is required

q=1
why they did not let us know


as was written
olologin commented on Aug 27, 2016
I fixed performance issue, now it works almost as fast as dense version in test provided by @ogrisel above. 2.773s for dense and 3.104s for sparse.
Also I've found and fixed stupid mistake in safe_realloc usage from tree.pyx and in function for sparse prediction which I added here. It required more memory to allocate than user needs


SO
if somebody may share a test case code?

Sandy4321 commented May 18, 2018

super good new, thanks
it works now

#S_may18_2018_sparse_data_GBM.py
from scipy import sparse
from sklearn.datasets.samples_generator import make_classification
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_classification(n_samples=20, n_features=5, random_state=0)
X_sp = sparse.coo_matrix(X)

clf = GradientBoostingClassifier()
clf.fit(X,y)
clf.predict(X)  # works

clf.fit(X_sp, y)  # works
clf.predict(X_sp)  # fails with TypeError: A sparse matrix was passed, but dense data is required

q=1
why they did not let us know


as was written
olologin commented on Aug 27, 2016
I fixed performance issue, now it works almost as fast as dense version in test provided by @ogrisel above. 2.773s for dense and 3.104s for sparse.
Also I've found and fixed stupid mistake in safe_realloc usage from tree.pyx and in function for sparse prediction which I added here. It required more memory to allocate than user needs


SO
if somebody may share a test case code?

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth May 18, 2018

Member

You can use Github markdown formatting for code and citations it really helps readability. (I edited your code formatting above).

The above code sample works fine for me for 0.19.1. I am not sure what you mean by the cited comment.

Member

rth commented May 18, 2018

You can use Github markdown formatting for code and citations it really helps readability. (I edited your code formatting above).

The above code sample works fine for me for 0.19.1. I am not sure what you mean by the cited comment.

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 May 22, 2018

Github markdown formatting for code - how to use it?

Sandy4321 commented May 22, 2018

Github markdown formatting for code - how to use it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment