Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX remove lambdas from text preprocessing #14430

Merged

Conversation

@deniederhut
Copy link
Contributor

deniederhut commented Jul 21, 2019

Lambda functions are non-serializable under the stdlib pickle
module. This commit replaces the lambdas found in three text
preprocessing functions with hidden functions for chaining
a sequence of preprocessing steps that can be partialed where
appropriate.

Reference Issues/PRs

Closes #12833

What does this implement/fix? Explain your changes.

Instead of composing functions with lambdas, create chains
of preprocessing steps inside single functions that can
be decomposed with partialing.

Copy link
Member

rth left a comment

Thanks for looking into it!

Could you please run benchmarks/bench_text_vectorizers.py before and after this PR and report results?

return lambda doc: self._char_wb_ngrams(
preprocess(self.decode(doc)))
return partial(_analyze, ngrams=self._char_wb_ngrams,
preprocessor=preprocess, decoder=self.decode)

This comment has been minimized.

Copy link
@rth

rth Jul 22, 2019

Member

Side note: this is really a use-case for toolz.functoolz.compose a shame that we can't use it.

sklearn/feature_extraction/text.py Show resolved Hide resolved
@deniederhut

This comment has been minimized.

Copy link
Contributor Author

deniederhut commented Jul 23, 2019

Sure thing!

Before

================================================================================
#    Text vectorizers benchmark
================================================================================

Using a subset of the 20 newsrgoups dataset (1000 documents).
This benchmarks runs in ~1 min ...

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.063 (+-0.031)   1.115 (+-0.010)  2.147 (+-0.023)
char_wb  (4, 4)       1.585 (+-0.022)   1.011 (+-0.006)  1.648 (+-0.012)
word     (1, 1)       0.299 (+-0.006)   0.224 (+-0.004)  0.304 (+-0.001)
         (1, 2)       1.172 (+-0.021)   0.444 (+-0.016)  1.228 (+-0.027)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                284.1             195.3           283.7
char_wb  (4, 4)                232.6             198.2           235.1
word     (1, 1)                155.1             159.5           160.3
         (1, 2)                264.9             167.8           265.8

After

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.110 (+-0.024)   1.091 (+-0.026)  2.155 (+-0.016)
char_wb  (4, 4)       1.627 (+-0.037)   1.023 (+-0.039)  1.643 (+-0.003)
word     (1, 1)       0.354 (+-0.075)   0.221 (+-0.001)  0.317 (+-0.009)
         (1, 2)       1.206 (+-0.021)   0.427 (+-0.009)  1.214 (+-0.005)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                304.7             210.8           303.6
char_wb  (4, 4)                255.1             213.5           253.1
word     (1, 1)                176.2             184.1           181.9
         (1, 2)                289.4             190.9           288.9
Copy link
Member

jnothman left a comment

Ideally we'd have a non-regression test that checks that all build_* methods result in objects that can be pickled and restored.

@rth rth changed the title BUG: 12833 remove lambdas from text preprocessing FIX remove lambdas from text preprocessing Jul 25, 2019
@rth
rth approved these changes Jul 25, 2019
Copy link
Member

rth left a comment

Thanks @deniederhut !

I'm not overly enthusiastic about the addition of _preprocess and _analyze function, but I don't see another way of fixing pickling.

Unless we ask people to use cloudpickle?

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 27, 2019

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@deniederhut

This comment has been minimized.

Copy link
Contributor Author

deniederhut commented Jul 30, 2019

Hm... Circle is showing

agent key RSA SHA256:oEdX/44m2Y1klY5Uq287qOWuWfJrhusv4e5ImKQ5Uhk returned incorrect signature type

Does this need to be rebased?

@rth

This comment has been minimized.

Copy link
Member

rth commented Jul 30, 2019

Please resolve conflicts (hopefully that would also fix CI by merging master in).

deniederhut added 6 commits Jul 21, 2019
Lambda functions are non-serializable under the stdlib pickle
module. This commit replaces the lambdas found in three text
preprocessing functions with hidden functions for chaining
a sequence of preprocessing steps that can be partialed where
appropriate.

Closes #12833
The function has been modified, so testing for identity is
no longer appropriate.
@deniederhut deniederhut force-pushed the deniederhut:bug/12833-remove-lambdas branch from b6dc6fe to 07d7cf4 Jul 31, 2019
@deniederhut

This comment has been minimized.

Copy link
Contributor Author

deniederhut commented Aug 1, 2019

Yup! That did the trick for the CI

@rth
rth approved these changes Aug 1, 2019
@thomasjpfan thomasjpfan merged commit 53f76d1 into scikit-learn:master Aug 1, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.86%)
Details
codecov/project Absolute coverage decreased by -0.06% but relative coverage increased by +3.13% compared to 7e022a7
Details
scikit-learn.scikit-learn Build #20190731.3 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Aug 1, 2019

Thank you @deniederhut!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.