Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Ngram Performance #7567

Merged
merged 3 commits into from Jun 18, 2017
Merged

[MRG+1] Ngram Performance #7567

merged 3 commits into from Jun 18, 2017

Conversation

@jtdoepke
Copy link
Contributor

@jtdoepke jtdoepke commented Oct 4, 2016

Reference Issue

What does this implement/fix? Explain your changes.

A couple of small changes to make ngram generation a little bit faster.

  1. Bind the tokens.append() and " ".join() methods outside of the for-loops.
  2. For unigrams, skip slicing entirely and just use tokens = list(original_tokens).

Any other comments?

Benchmarks

jtdoepke added 2 commits Oct 1, 2016
* Improve ngram performance by binding methods outside the loop.
* Create unigrams without slicing.
Copy link
Member

@rth rth left a comment

This is great! Particularly since the case of ngram=[1, ..] is frequently used; it would be definitely useful!

tokens = []
if min_n == 1:
tokens = list(original_tokens)
min_n += 1

This comment has been minimized.

@rth

rth Oct 4, 2016
Member

Maybe just add a comment here to say that this does the same thing as the fist iteration of the loop below (which is then skipped) as that's not entirely clear from reading it.

min_n, max_n = self.ngram_range
if min_n == 1:
ngrams = list(text_document)
min_n += 1

This comment has been minimized.

@rth

rth Oct 4, 2016
Member

Same comment as above

* Added code comment to explain using list() for unigrams.
@jnothman
Copy link
Member

@jnothman jnothman commented Oct 4, 2016

I'm happy with the unigram fast path. I don't think binding methods reduces enough overhead to care. Please benchmark each change (of 1 and 2) separately.

@rth
Copy link
Member

@rth rth commented Oct 4, 2016

@jnothman I think they are benchmarked separately in the link above, and method binding improves performance by more than 10-20%, surprisingly...

@jtdoepke
Copy link
Contributor Author

@jtdoepke jtdoepke commented Oct 4, 2016

The benchmarks are in the order I committed the changes, so: first test has no changes, the second has only the method binding, then the third has method binding and the unigram fast path.

I'll rerun the unigram fast path benchmark separately later tonight, but the method binding on it's own was 13-20% faster.

@NelleV
Copy link
Member

@NelleV NelleV commented Oct 4, 2016

I am surprised that the method binding does anything at all. Can you put the actual times and how you've ran the benchmarks?

@rth
Copy link
Member

@rth rth commented Oct 4, 2016

@NelleV The link to benchmarks is in the "Any other comments?" section in the original post above.

for n in xrange(min_n,
min(max_n + 1, n_original_tokens + 1)):
for i in xrange(n_original_tokens - n + 1):
tokens.append(" ".join(original_tokens[i: i + n]))
tokens_append(space_join(original_tokens[i: i + n]))

This comment has been minimized.

@Carreau

Carreau Oct 4, 2016
Contributor

as a quick glance I would say that most of the time will be spend on the slicing original_tokens[i: i + n], it might be possible to swap the two loops to get something like (roughly)

for i in ...:
    _current = ''
    for _tok in original_tokens[i:i + n]:
            _current = _current+' '+_tok
            tokens_append(
                _current
            )

Thus only iterating over the slice, which should likely be faster as well.
Would need to be profiled though.

This comment has been minimized.

@Carreau

Carreau Oct 4, 2016
Contributor

Actually there seem to be an even simpler way, using the following

import itertools
def iter_window(seq, n):
    l = list(seq[:n])
    append = l.append
    for item in itertools.islice(seq, n-1, len(seq)):
        yield tuple(l)
        l.pop(0)
        l.append(item)
    yield tuple(l)

and then replacing the inner loop by

for n in range(a,b):
            for _tks in iter_window(original_tokens,n, c-1):
                tokens_append(
                    space_join(
                        _tks
                    )
                )

I get a ~2+ speedup (increasing with lenght of ngrams) and the sliding window implementation can likely be made much more efficient using a dequeue maybe.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 5, 2016

Okay, I'm somewhat persuaded. @Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

In terms of text classification speed, I've also been thinking of creating a wrapper for Spacy which may smoothly enable the creation of lemmatised, lemma + POS, named entity and dependency path features. I'm not yet sure of the ideal interface, and it would be designed as a separate repository for sklearn-contrib. If anyone's interested in coding this up, I'm happy to share my thoughts.

@Carreau
Copy link
Contributor

@Carreau Carreau commented Oct 5, 2016

@Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

Sure, I was just trying to show that being careful about the algorithm implementation could also make a significant boost and that micro-optimisations were not the only solution.

@amueller
Copy link
Member

@amueller amueller commented Oct 5, 2016

@jnothman have you tried pattern? http://www.clips.ua.ac.be/pattern I wasn't convinced by spacy.

@amueller
Copy link
Member

@amueller amueller commented Oct 5, 2016

Also, isn't the right data structure for this a trie #2639?

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 5, 2016

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller notifications@github.com
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 5, 2016

I think it depends how much weight you put on parsing, really... One
advantage of spaCy is that it naturally produces vector spaces. Although it
doesn't naturally produce vector spaces for conjunctions / n-grams, which
is something I'm a little concerned about.

On 6 October 2016 at 08:15, Joel Nothman joel.nothman@gmail.com wrote:

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller notifications@github.com
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 5, 2016

Anyway, @Carreau, a PR is welcome. @jtdoepke, I shall review this properly soon.

@jtdoepke
Copy link
Contributor Author

@jtdoepke jtdoepke commented Oct 5, 2016

Here's some additional benchmarks with everything separately, including @Carreau 's suggestions.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 6, 2016

Thanks, that's great. I'm not able to invest much thought into this atm,
but wondering what about chars makes them worse for sliding window, unlike
ngrams? If there is such disparity, does that suggest it's dependent on
properties of the documents?

On 6 October 2016 at 10:55, Jaye notifications@github.com wrote:

Here's https://gist.github.com/jtdoepke/bb19edf8e4678246a15ba48b2c47ce3f
some additional benchmarks with everything separately, including @Carreau
https://github.com/Carreau 's suggestions.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61FtqIZRVUDfM8koJgT86p08GSquks5qxDjfgaJpZM4KNOMr
.

@rth
Copy link
Member

@rth rth commented Jan 30, 2017

The benchmarks linked above by @jtdoepke can be summarized in the following (badly formatted) table,
best_performance
(the arrows denote the best performing cases)

The "1. method binding" + "2. Unigram shortcut" (the current code in this PR) outperform other proposed method (generators with sliding window etc), for unigram words and character n-grams, but not for word n-grams.

Bottom line is that this shows that there is still room for performance improvement for the n-gram vectoriser either by rewriting some sections in cython or by using generators. However, maybe that would deserve a separate PR?
Meanwhile this PR does exactly what it claims to be doing: a consistent 10-20% speed improvement for word an character n-grams (with a few lines of code)...

@jtdoepke Would you mind renaming this PR to "[MRG] Ngram Performance" to attract attention to it and so it could be added to the backlog or PRs to be reviewed? Thanks.

Update: also related #7107

@jtdoepke jtdoepke changed the title Ngram Performance [MRG] Ngram Performance Jan 30, 2017
Copy link
Member

@jnothman jnothman left a comment

were there not more enticing changes to consider, I'd say this looked good for merge. Thanks for all the benchmarks.


# bind method outside of loop to reduce overhead
ngrams_append = ngrams.append

for n in xrange(min_n, min(max_n + 1, text_len + 1)):

This comment has been minimized.

@jnothman

jnothman Feb 23, 2017
Member

Could we please get a benchmark on writing this as a list comprehension:

n_grams.extend([text_document[i: i + n]]
               for n in xrange(min_n, min(max_n + 1, text_len + 1))
               for i in xrange(text_len - n + 1))

?

This comment has been minimized.

@jnothman

jnothman Feb 23, 2017
Member

Also benchmark returning a generator i.e. use itertools.chain over generator expressions to see if the counting can benefit from not materialising the list.

This comment has been minimized.

@jtdoepke

jtdoepke Mar 7, 2017
Author Contributor

Sure. I tried several variations of list comprehension and generators (and a couple other shortcuts). Looks like creating all the ngrams as a single list comprehension is faster than a generator.

@lesteve
Copy link
Member

@lesteve lesteve commented Mar 7, 2017

I have to admit I am not an expert on this but in order to convince me you will have to use:

  • a more realistic benchmark. AFAICT your vocabulary is ~350 words long.
  • use %timeit to do timings, from a quick look at it, it looks like you are implementing your own timeit doing the mean of all the results, which is not the most robust way of doing timings.
  • profiling and possibly line-profiling to show where significant time is spent to make sure we are optimizing the right lines of code

Ideally I would rather have a benchmark script rather than a notebook.

@rth
Copy link
Member

@rth rth commented Jun 9, 2017

I benchmarked this PR anew with non synthetic data (20 newsgroup) and it appears that some of @lesteve 's concerns were justified.

In these benchmaks, this PR has no performance impact at all when ngram_range=(1, 1) but includes an up to 10% speedup for larger ngram ranges, both with words and character ngrams. The full results are accessible here, obtained with this benchmarks script . Profiling
in the case ngram_range=(1, 1) and that of ngram_range=(1, 2) show that the optimized function _word_ngrams accounts for a relatively small fraction (<20%) of the total run time.

Out of different components, or optimizations considered (unigram shrortcut, method binding, list comprehension) the first two are the ones (this PR exactly) that bring the best performance.

Across different vectorizers and options this PR brings a 5% speedup, and the performance is at worse the same as before. So the net effect of this PR is positive, but considering that generally the optimized functions accounts for a relatively small fraction of total run time, it is probably not worth spending time on additional optimizations here.

Therefore I would suggest that this PR be merged in it's current state. What do you think?
@lesteve @jnothman

@lesteve
Copy link
Member

@lesteve lesteve commented Jun 9, 2017

Thanks a lot @rth for the benchmark, is it worth adding it to the benchmarks folder?

All in all I think this is reasonable to merge. Some small improvement. The code changes are not that controversial.

@jnothman what do you think?

@lesteve lesteve changed the title [MRG] Ngram Performance [MRG+1] Ngram Performance Jun 9, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 10, 2017

@jnothman jnothman merged commit d8e54d9 into scikit-learn:master Jun 18, 2017
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 18, 2017

Thanks @jtdoepke. We could add an entry in what's new if you'd like. Please suggest a text and I'll commit it in.

@jtdoepke
Copy link
Contributor Author

@jtdoepke jtdoepke commented Jun 21, 2017

Small performance improvement to n-gram creation in
:mod:`sklearn.feature_extraction.text` by binding methods outside of loops
and taking a fast path for unigrams.
@jtdoepke jtdoepke deleted the jtdoepke:ngram_performance branch Jun 21, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 21, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants