Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX remove lambdas from text preprocessing #14430

Merged

Conversation

deniederhut
Copy link
Contributor

@deniederhut deniederhut commented Jul 21, 2019

Lambda functions are non-serializable under the stdlib pickle
module. This commit replaces the lambdas found in three text
preprocessing functions with hidden functions for chaining
a sequence of preprocessing steps that can be partialed where
appropriate.

Reference Issues/PRs

Closes #12833

What does this implement/fix? Explain your changes.

Instead of composing functions with lambdas, create chains
of preprocessing steps inside single functions that can
be decomposed with partialing.

Copy link
Member

@rth rth left a comment

Thanks for looking into it!

Could you please run benchmarks/bench_text_vectorizers.py before and after this PR and report results?

return lambda doc: self._char_wb_ngrams(
preprocess(self.decode(doc)))
return partial(_analyze, ngrams=self._char_wb_ngrams,
preprocessor=preprocess, decoder=self.decode)
Copy link
Member

@rth rth Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: this is really a use-case for toolz.functoolz.compose a shame that we can't use it.

sklearn/feature_extraction/text.py Show resolved Hide resolved
@deniederhut
Copy link
Contributor Author

@deniederhut deniederhut commented Jul 23, 2019

Sure thing!

Before

================================================================================
#    Text vectorizers benchmark
================================================================================

Using a subset of the 20 newsrgoups dataset (1000 documents).
This benchmarks runs in ~1 min ...

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.063 (+-0.031)   1.115 (+-0.010)  2.147 (+-0.023)
char_wb  (4, 4)       1.585 (+-0.022)   1.011 (+-0.006)  1.648 (+-0.012)
word     (1, 1)       0.299 (+-0.006)   0.224 (+-0.004)  0.304 (+-0.001)
         (1, 2)       1.172 (+-0.021)   0.444 (+-0.016)  1.228 (+-0.027)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                284.1             195.3           283.7
char_wb  (4, 4)                232.6             198.2           235.1
word     (1, 1)                155.1             159.5           160.3
         (1, 2)                264.9             167.8           265.8

After

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.110 (+-0.024)   1.091 (+-0.026)  2.155 (+-0.016)
char_wb  (4, 4)       1.627 (+-0.037)   1.023 (+-0.039)  1.643 (+-0.003)
word     (1, 1)       0.354 (+-0.075)   0.221 (+-0.001)  0.317 (+-0.009)
         (1, 2)       1.206 (+-0.021)   0.427 (+-0.009)  1.214 (+-0.005)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                304.7             210.8           303.6
char_wb  (4, 4)                255.1             213.5           253.1
word     (1, 1)                176.2             184.1           181.9
         (1, 2)                289.4             190.9           288.9

Copy link
Member

@jnothman jnothman left a comment

Ideally we'd have a non-regression test that checks that all build_* methods result in objects that can be pickled and restored.

@rth rth changed the title BUG: 12833 remove lambdas from text preprocessing FIX remove lambdas from text preprocessing Jul 25, 2019
rth
rth approved these changes Jul 25, 2019
Copy link
Member

@rth rth left a comment

Thanks @deniederhut !

I'm not overly enthusiastic about the addition of _preprocess and _analyze function, but I don't see another way of fixing pickling.

Unless we ask people to use cloudpickle?

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 27, 2019

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@deniederhut
Copy link
Contributor Author

@deniederhut deniederhut commented Jul 30, 2019

Hm... Circle is showing

agent key RSA SHA256:oEdX/44m2Y1klY5Uq287qOWuWfJrhusv4e5ImKQ5Uhk returned incorrect signature type

Does this need to be rebased?

@rth
Copy link
Member

@rth rth commented Jul 30, 2019

Please resolve conflicts (hopefully that would also fix CI by merging master in).

deniederhut added 6 commits Jul 31, 2019
Lambda functions are non-serializable under the stdlib pickle
module. This commit replaces the lambdas found in three text
preprocessing functions with hidden functions for chaining
a sequence of preprocessing steps that can be partialed where
appropriate.

Closes scikit-learn#12833
The function has been modified, so testing for identity is
no longer appropriate.
@deniederhut deniederhut force-pushed the bug/12833-remove-lambdas branch from b6dc6fe to 07d7cf4 Compare Jul 31, 2019
@deniederhut
Copy link
Contributor Author

@deniederhut deniederhut commented Aug 1, 2019

Yup! That did the trick for the CI

rth
rth approved these changes Aug 1, 2019
@thomasjpfan thomasjpfan merged commit 53f76d1 into scikit-learn:master Aug 1, 2019
17 checks passed
@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Aug 1, 2019

Thank you @deniederhut!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants