Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: LinearClassifierMixin support for sparse coef_ #1702

Merged
merged 1 commit into from Feb 26, 2013

Conversation

pprett
Copy link
Member

@pprett pprett commented Feb 22, 2013

In the case of high-dimensional feature spaces and/or large number of classes it might be wise to store the coef_ array of a linear model as a scipy sparse matrix rather than a dense numpy array.

We might even do this automagically e.g. if the sparsity of coef_ reaches a certain level, however, in the meanwhile I'd suggest that the user is in charge of making the decision whether to convert coef_ to a sparse representation.

This popped up on the mailing list recently: http://sourceforge.net/mailarchive/message.php?msg_id=30515870

This PR is just a minor modification of LinearModelMixin.decision_function that enables the use of sparse coef_.

NOTE: this PR is work in progress - no tests yet.

I was suspicious that adding dense_output=True might cause a performance regression, however, it seems that this is not the case::

The following test is on the test set of RCV1 ccat; shape: (23149, 47152)

Original

Dense

In [7]: %timeit clf.predict(x)
10000 loops, best of 3: 51.6 us per loop

In [8]: %timeit clf.predict(X)
100 loops, best of 3: 6.55 ms per loop

This PR

Dense

In [16]: %timeit clf.predict(x)
10000 loops, best of 3: 54 us per loop

In [15]: %timeit clf.predict(X)
100 loops, best of 3: 6.83 ms per loop

CSC coef_

In [12]: %timeit clf.predict(x)
1000 loops, best of 3: 324 us per loop

In [13]: %timeit clf.predict(X)
10 loops, best of 3: 29.6 ms per loop

CSR coef_

In [18]: %timeit clf.predict(x)
1000 loops, best of 3: 566 us per loop

In [19]: %timeit clf.predict(X)
10 loops, best of 3: 30.3 ms per loop

@larsmans
Copy link
Member

We've discussed the automagic conversion to CSR before -- the problem is that you can't further train with a sparse coef_ unless you first densify it again.

@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2013

We've discussed the automagic conversion to CSR before -- the problem is that you can't further train with a sparse coef_ unless you first densify it again.

I think this is fair.

Maybe we could have a SparseLinearModelPredictor that would not have a fit method but could be constructed out of a fitted LinearModel subclass for the purpose of fast memory efficient prediction (e.g. to be deployed "in production").

@larsmans
Copy link
Member

I was just comtemplating such a solution, but it would mean a deviation from our usual API.

@larsmans
Copy link
Member

A different solution would be a method, say sparsify, on LinearClassifierMixin or LinearModel. That would cause fewer headaches with pipelines.

Aside, guessing whether coef_ is sparse enough to warrant automatic conversion is likely going to cause to cause trouble if the planned 64-bit index support for scipy.sparse is implemented upstream.

@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2013

A different solution would be a method, say sparsify, on LinearClassifierMixin or LinearModel. That would cause fewer headaches with pipelines.

I am ok with that option too.

@larsmans larsmans merged commit e6c85a3 into scikit-learn:master Feb 26, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants