Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] FEA Add PolynomialCountSketch to Kernel Approximation module #13003

Merged
merged 119 commits into from Aug 18, 2020

Conversation

lopeLH
Copy link
Contributor

@lopeLH lopeLH commented Jan 17, 2019

This PR adds the Tensor Sketch [1] algorithm for polynomial kernel feature map approximation to the Kernel Approximation module.

Tensor Sketch is a well established method for kernel feature map approximation, which has been broadly applied in the literature. For instance, it has recently gained a lot of popularity to accelerate certain bilinear models [2]. While the current kernel approximation module contains various kernel approximation methods, polynomial kernels are missing, so including TensorSketch completes the functionality of this module by providing an efficient and data-independent polynomial kernel approximation technique.

The PR contains the implementation of the algorithm, the corresponding tests, an example script, and a description of the algorithm in the documentation page of the kernel approximation module. This implementation has been tested to produce the same results as the original matlab implementation provided by the author of the algorithm [1].

[1] Pham, N., & Pagh, R. (2013, August). Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 239-247). ACM.

[2] Gao, Y., Beijbom, O., Zhang, N., & Darrell, T. (2016). Compact bilinear pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 317-326).


Work for follow-up PR:

  • @rth: This can be a follow up issue/PR, and would need double checking but since the count_sketches input is real you can likely use rfft and irfft witch would be faster.

  • @rth: The issue is that calling fit twice would produce a different seed and therefore a different result, since RandomState instance is mutable. In [MRG] Expose random seed in Hashingvectorizer #14605 for a similar use case, we added a seed variable in transform, but I'm not particularly happy with that outcome either. This is probably fine as is, we would just have to address this globally at some point in RFC design of random_state #14042

  • @rth: It could be worth considering whether it would make sense thesholding (in the above example 1e-15 would be ok as a threshold) and converting back to sparse. Though the intermediary step with the FFT would still be dense with the associated memory requirements, maybe it could be worth chunking with respect to n_samples not sure ([MRG+3] FEA Add PolynomialCountSketch to Kernel Approximation module #13003 (comment)).

@lopeLH lopeLH changed the title [WIP] Add Tensor Sketch algorithm to Kernel Approximation module [MRG] Add Tensor Sketch algorithm to Kernel Approximation module Jan 18, 2019
@rth
Copy link
Member

rth commented Aug 12, 2020

Back to fftpack

You could use,

try:
    from scipy import fft
except ImportError:   # scipy < 1.4
    from scipy import fftpack as fft

the scipy.fft uses their newer pocketfft implementation scipy/scipy#10238 that's more optimized in some cases numpy/numpy#11888 (comment)

cf https://github.com/scipy/scipy/wiki/Release-note-entries-for-SciPy-1.4.0#scipyfft-added for more details.

@lopeLH
Copy link
Contributor Author

lopeLH commented Aug 12, 2020

@rth I think I addressed all your comments. All checks passing :)

Maybe have a look at how I phrased things at doc/modules/kernel_approximation.rst (line 175), regarding Count sketch and its role in TensorSketch.

@lopeLH lopeLH requested a review from rth August 12, 2020 13:31
Comment on lines +167 to +171
for d in range(self.degree):
iHashIndex = self.indexHash_[d, j]
iHashBit = self.bitHash_[d, j]
count_sketches[:, d, iHashIndex] += \
(iHashBit * X_gamma[:, j]).toarray().ravel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to change now , but the following should be faster assuming a relatively low sparsity matrix. For typical matrices obtained by CountVectorizer this makes the transform around 5x faster e.g. with 10k samples, 8k input features.

Suggested change
for d in range(self.degree):
iHashIndex = self.indexHash_[d, j]
iHashBit = self.bitHash_[d, j]
count_sketches[:, d, iHashIndex] += \
(iHashBit * X_gamma[:, j]).toarray().ravel()
Xg_col = X_gamma[:, j]
for d in range(self.degree):
iHashIndex = self.indexHash_[d, j]
iHashBit = self.bitHash_[d, j]
# The following requires X_gamma to be in CSC sparse
# format
count_sketches[Xg_col.indices, d, iHashIndex] += \
(iHashBit * Xg_col.data)

@rth
Copy link
Member

rth commented Aug 13, 2020

Thanks! I have been experimenting with using this approach for text classification on a 10k samples subset of AG News dataset. Granted it's probably not a typical use case, but I still wanted to check that results generally make sense. Below are classification accuracy results,

label fit_time train accuracy test accuracy
baseline: CountVectorizer + LinearSVC 0.18 0.98 0.84
baseline w/ PCA(n_components=100) 0.64 0.84 0.83
baseline w/ SparseRandomProjection(n_components=1000) 0.29 0.51 0.49
baseline w/ SVC(kernel='poly', degree=2) 9.68 0.99 0.86
baseline w/ PolynomialSampler(degree=2, n_components=100) 1.12 0.36 0.29
baseline w/ PolynomialSampler(degree=2, n_components=1000) 3.94 0.65 0.43
baseline w/ PolynomialSampler(degree=2, n_components=10000) 25.34 0.99 0.62
baseline w/ PolynomialSampler(degree=2, n_components=20000) 53.13 0.99 0.72
baseline w/ PolynomialSampler(degree=2, n_components=40000) 106.51 0.99 0.78
baseline w/ PolynomialSampler(degree=2, n_components=100000) 269.17 0.99 0.84

obtained with the following notebook tensorsketch-experiments-sparse.py. Here test scores are done without cross-validation as it takes already a long time (and requires the above optimization for sparse), so they are not too reliable. There are also likely overfitting issues with linearSVC and a large number of components.

Main takeaways,

  1. we need to explicitly check in fit that degree>=1 otherwise FFT fails with a 0D input matrix for degree=0. It would be also good to add a test that check this exception with with pytest.raises(ValueError, match='degree=0 should be >=1.').
  2. results will be very bad for n_components < n_features even evaluating on the train subset. As far as I can tell, both the paper and the example only illustrate the case n_components > n_features, with an optimum of evaluation score / run time cost around n_components=10*n_features. We should adds some of these suggestions in the docstring of n_components (and add a sentence to the user manual) to help choose n_components. Otherwise users may be very disappointing with the performance when using the default n_components=100 on a higher dimensional case.
  3. For the sparse case 2. implies that it's not very usable, because one gets as output dense matrices with 10k+ features. Empirically, with sparse input and for n_components > n_features, the resulting dense matrix actually contains mostly zeros. So in a follow up PR, it could be worth considering whether it would make sense thesholding (in the above example 1e-15 would be ok as a threshold) and converting back to sparse. Though the intermediary step with the FFT would still be dense with the associated memory requirements, maybe it could be worth chunking with respect to n_samples not sure.

Overall +1 for merging after a few more documentation precisions on how to choose n_components and a check/test for degree<1.

@rth
Copy link
Member

rth commented Aug 13, 2020

Another comment for reviewers is that I'm not convinced by the name PolynomialSampler. This algorithm does no sampling as far as I can see, and there is 0 occurrence of word sampling in ether of cited papers. Rather, roughly, it applies a convolution on hashed feature spaces (via product in FFT space). Also the actual implementation looks more like a random projection than hashing. So the current name is a bit misleading IMO.

The original name TensorSketch might be a bit opaque but how about,

  • PolynomialCountSketch or PolynomialTensorSketch: the count sketch is the correct name for this hashing technique. It may be a bit exotic/new (invented in 2003) but there is a wiki page https://en.wikipedia.org/wiki/Count_sketch and around 14k search results
  • PolynomialFeatureHasher could have worked except that FeatureHasher works with completely different input
  • PolynomialProjection by analogy with random projection as it's a bit related
  • or some name with a combination of "Polynomial" and "Hashing" or "Projection"

WDYT?

@lopeLH
Copy link
Contributor Author

lopeLH commented Aug 13, 2020

@rth, as you requested:

  • Added a check enforcing degree >= 1 in the init fit method of PolynomialSampler.
  • Added a test checking PolynomialSampler throws if initialized with a degree lower than one.
  • Added the suggested hints regarding the selection of n_components to the docstring and user guide (please, have a look, I don't trust my English).

Regarding the name of the main class, PolynomialCountSketch sounds good to me.

@lopeLH lopeLH requested a review from rth August 13, 2020 13:25
Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PolynomialCountSketch sounds better indeed.

sklearn/kernel_approximation.py Outdated Show resolved Hide resolved
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. Let's wait a few more days to see if there are any objection to the PolynomialCountSketch name (cf #13003 (comment)). And if there are none, rename it and merge.

@lorentzenchr
Copy link
Member

+1 for PolynomialCountSketch from my side.

@TomDLT TomDLT changed the title [MRG+1] Add Tensor Sketch algorithm to Kernel Approximation module [MRG+3] Add Tensor Sketch algorithm to Kernel Approximation module Aug 17, 2020
@lopeLH
Copy link
Contributor Author

lopeLH commented Aug 17, 2020

Seems like everyone is happy with the new name (PolynomialCountSketch), so I performed the change.

Do I have to squash the ugly commit history in this branch into a single, clean commit? Anyways, let me know if there is anything left on my side.

Super excited about having my first contribution to sklearn merged! 🥳

@lorentzenchr lorentzenchr merged commit daebcac into scikit-learn:master Aug 18, 2020
@lorentzenchr lorentzenchr changed the title [MRG+3] Add Tensor Sketch algorithm to Kernel Approximation module [MRG+3] FEA Add PolynomialCountSketch to Kernel Approximation module Aug 18, 2020
@lorentzenchr
Copy link
Member

lorentzenchr commented Aug 18, 2020

@lopeLH Thank you very much for your contribution and your patience! And feel free to continue - if you'd like on one of the possible follow-ups that @rth listed.
You don't need to squash commits. That's done automatically when we merge.

I'm also excited as this is my first merge hoping everything went fine.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 18, 2020 via email

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…learn#13003)

* Add Tensor Sketch algorithm

* Add user guide entry

* Add example

* Add benchmark

Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants