New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Ngram Performance #7567

Merged
merged 3 commits into from Jun 18, 2017

Conversation

Projects
None yet
7 participants
@jtdoepke
Contributor

jtdoepke commented Oct 4, 2016

Reference Issue

What does this implement/fix? Explain your changes.

A couple of small changes to make ngram generation a little bit faster.

  1. Bind the tokens.append() and " ".join() methods outside of the for-loops.
  2. For unigrams, skip slicing entirely and just use tokens = list(original_tokens).

Any other comments?

Benchmarks

jtdoepke added some commits Oct 1, 2016

Improve ngram performance - method binding
* Improve ngram performance by binding methods outside the loop.
Improve ngram performance - unigram list
* Create unigrams without slicing.
@rth

This is great! Particularly since the case of ngram=[1, ..] is frequently used; it would be definitely useful!

Show outdated Hide outdated sklearn/feature_extraction/text.py
Show outdated Hide outdated sklearn/feature_extraction/text.py
Improve ngram performance - code comment
* Added code comment to explain using list() for unigrams.
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 4, 2016

Member

I'm happy with the unigram fast path. I don't think binding methods reduces enough overhead to care. Please benchmark each change (of 1 and 2) separately.

Member

jnothman commented Oct 4, 2016

I'm happy with the unigram fast path. I don't think binding methods reduces enough overhead to care. Please benchmark each change (of 1 and 2) separately.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Oct 4, 2016

Member

@jnothman I think they are benchmarked separately in the link above, and method binding improves performance by more than 10-20%, surprisingly...

Member

rth commented Oct 4, 2016

@jnothman I think they are benchmarked separately in the link above, and method binding improves performance by more than 10-20%, surprisingly...

@jtdoepke

This comment has been minimized.

Show comment
Hide comment
@jtdoepke

jtdoepke Oct 4, 2016

Contributor

The benchmarks are in the order I committed the changes, so: first test has no changes, the second has only the method binding, then the third has method binding and the unigram fast path.

I'll rerun the unigram fast path benchmark separately later tonight, but the method binding on it's own was 13-20% faster.

Contributor

jtdoepke commented Oct 4, 2016

The benchmarks are in the order I committed the changes, so: first test has no changes, the second has only the method binding, then the third has method binding and the unigram fast path.

I'll rerun the unigram fast path benchmark separately later tonight, but the method binding on it's own was 13-20% faster.

@NelleV

This comment has been minimized.

Show comment
Hide comment
@NelleV

NelleV Oct 4, 2016

Member

I am surprised that the method binding does anything at all. Can you put the actual times and how you've ran the benchmarks?

Member

NelleV commented Oct 4, 2016

I am surprised that the method binding does anything at all. Can you put the actual times and how you've ran the benchmarks?

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Oct 4, 2016

Member

@NelleV The link to benchmarks is in the "Any other comments?" section in the original post above.

Member

rth commented Oct 4, 2016

@NelleV The link to benchmarks is in the "Any other comments?" section in the original post above.

for n in xrange(min_n,
min(max_n + 1, n_original_tokens + 1)):
for i in xrange(n_original_tokens - n + 1):
tokens.append(" ".join(original_tokens[i: i + n]))
tokens_append(space_join(original_tokens[i: i + n]))

This comment has been minimized.

@Carreau

Carreau Oct 4, 2016

Contributor

as a quick glance I would say that most of the time will be spend on the slicing original_tokens[i: i + n], it might be possible to swap the two loops to get something like (roughly)

for i in ...:
    _current = ''
    for _tok in original_tokens[i:i + n]:
            _current = _current+' '+_tok
            tokens_append(
                _current
            )

Thus only iterating over the slice, which should likely be faster as well.
Would need to be profiled though.

@Carreau

Carreau Oct 4, 2016

Contributor

as a quick glance I would say that most of the time will be spend on the slicing original_tokens[i: i + n], it might be possible to swap the two loops to get something like (roughly)

for i in ...:
    _current = ''
    for _tok in original_tokens[i:i + n]:
            _current = _current+' '+_tok
            tokens_append(
                _current
            )

Thus only iterating over the slice, which should likely be faster as well.
Would need to be profiled though.

This comment has been minimized.

@Carreau

Carreau Oct 4, 2016

Contributor

Actually there seem to be an even simpler way, using the following

import itertools
def iter_window(seq, n):
    l = list(seq[:n])
    append = l.append
    for item in itertools.islice(seq, n-1, len(seq)):
        yield tuple(l)
        l.pop(0)
        l.append(item)
    yield tuple(l)

and then replacing the inner loop by

for n in range(a,b):
            for _tks in iter_window(original_tokens,n, c-1):
                tokens_append(
                    space_join(
                        _tks
                    )
                )

I get a ~2+ speedup (increasing with lenght of ngrams) and the sliding window implementation can likely be made much more efficient using a dequeue maybe.

@Carreau

Carreau Oct 4, 2016

Contributor

Actually there seem to be an even simpler way, using the following

import itertools
def iter_window(seq, n):
    l = list(seq[:n])
    append = l.append
    for item in itertools.islice(seq, n-1, len(seq)):
        yield tuple(l)
        l.pop(0)
        l.append(item)
    yield tuple(l)

and then replacing the inner loop by

for n in range(a,b):
            for _tks in iter_window(original_tokens,n, c-1):
                tokens_append(
                    space_join(
                        _tks
                    )
                )

I get a ~2+ speedup (increasing with lenght of ngrams) and the sliding window implementation can likely be made much more efficient using a dequeue maybe.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 5, 2016

Member

Okay, I'm somewhat persuaded. @Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

In terms of text classification speed, I've also been thinking of creating a wrapper for Spacy which may smoothly enable the creation of lemmatised, lemma + POS, named entity and dependency path features. I'm not yet sure of the ideal interface, and it would be designed as a separate repository for sklearn-contrib. If anyone's interested in coding this up, I'm happy to share my thoughts.

Member

jnothman commented Oct 5, 2016

Okay, I'm somewhat persuaded. @Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

In terms of text classification speed, I've also been thinking of creating a wrapper for Spacy which may smoothly enable the creation of lemmatised, lemma + POS, named entity and dependency path features. I'm not yet sure of the ideal interface, and it would be designed as a separate repository for sklearn-contrib. If anyone's interested in coding this up, I'm happy to share my thoughts.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 5, 2016

Contributor

@Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

Sure, I was just trying to show that being careful about the algorithm implementation could also make a significant boost and that micro-optimisations were not the only solution.

Contributor

Carreau commented Oct 5, 2016

@Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

Sure, I was just trying to show that being careful about the algorithm implementation could also make a significant boost and that micro-optimisations were not the only solution.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 5, 2016

Member

@jnothman have you tried pattern? http://www.clips.ua.ac.be/pattern I wasn't convinced by spacy.

Member

amueller commented Oct 5, 2016

@jnothman have you tried pattern? http://www.clips.ua.ac.be/pattern I wasn't convinced by spacy.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 5, 2016

Member

Also, isn't the right data structure for this a trie #2639?

Member

amueller commented Oct 5, 2016

Also, isn't the right data structure for this a trie #2639?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 5, 2016

Member

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller notifications@github.com
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

Member

jnothman commented Oct 5, 2016

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller notifications@github.com
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 5, 2016

Member

I think it depends how much weight you put on parsing, really... One
advantage of spaCy is that it naturally produces vector spaces. Although it
doesn't naturally produce vector spaces for conjunctions / n-grams, which
is something I'm a little concerned about.

On 6 October 2016 at 08:15, Joel Nothman joel.nothman@gmail.com wrote:

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller notifications@github.com
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

Member

jnothman commented Oct 5, 2016

I think it depends how much weight you put on parsing, really... One
advantage of spaCy is that it naturally produces vector spaces. Although it
doesn't naturally produce vector spaces for conjunctions / n-grams, which
is something I'm a little concerned about.

On 6 October 2016 at 08:15, Joel Nothman joel.nothman@gmail.com wrote:

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller notifications@github.com
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 5, 2016

Member

Anyway, @Carreau, a PR is welcome. @jtdoepke, I shall review this properly soon.

Member

jnothman commented Oct 5, 2016

Anyway, @Carreau, a PR is welcome. @jtdoepke, I shall review this properly soon.

@jtdoepke

This comment has been minimized.

Show comment
Hide comment
@jtdoepke

jtdoepke Oct 5, 2016

Contributor

Here's some additional benchmarks with everything separately, including @Carreau 's suggestions.

Contributor

jtdoepke commented Oct 5, 2016

Here's some additional benchmarks with everything separately, including @Carreau 's suggestions.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 6, 2016

Member

Thanks, that's great. I'm not able to invest much thought into this atm,
but wondering what about chars makes them worse for sliding window, unlike
ngrams? If there is such disparity, does that suggest it's dependent on
properties of the documents?

On 6 October 2016 at 10:55, Jaye notifications@github.com wrote:

Here's https://gist.github.com/jtdoepke/bb19edf8e4678246a15ba48b2c47ce3f
some additional benchmarks with everything separately, including @Carreau
https://github.com/Carreau 's suggestions.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61FtqIZRVUDfM8koJgT86p08GSquks5qxDjfgaJpZM4KNOMr
.

Member

jnothman commented Oct 6, 2016

Thanks, that's great. I'm not able to invest much thought into this atm,
but wondering what about chars makes them worse for sliding window, unlike
ngrams? If there is such disparity, does that suggest it's dependent on
properties of the documents?

On 6 October 2016 at 10:55, Jaye notifications@github.com wrote:

Here's https://gist.github.com/jtdoepke/bb19edf8e4678246a15ba48b2c47ce3f
some additional benchmarks with everything separately, including @Carreau
https://github.com/Carreau 's suggestions.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61FtqIZRVUDfM8koJgT86p08GSquks5qxDjfgaJpZM4KNOMr
.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jan 30, 2017

Member

The benchmarks linked above by @jtdoepke can be summarized in the following (badly formatted) table,
best_performance
(the arrows denote the best performing cases)

The "1. method binding" + "2. Unigram shortcut" (the current code in this PR) outperform other proposed method (generators with sliding window etc), for unigram words and character n-grams, but not for word n-grams.

Bottom line is that this shows that there is still room for performance improvement for the n-gram vectoriser either by rewriting some sections in cython or by using generators. However, maybe that would deserve a separate PR?
Meanwhile this PR does exactly what it claims to be doing: a consistent 10-20% speed improvement for word an character n-grams (with a few lines of code)...

@jtdoepke Would you mind renaming this PR to "[MRG] Ngram Performance" to attract attention to it and so it could be added to the backlog or PRs to be reviewed? Thanks.

Update: also related #7107

Member

rth commented Jan 30, 2017

The benchmarks linked above by @jtdoepke can be summarized in the following (badly formatted) table,
best_performance
(the arrows denote the best performing cases)

The "1. method binding" + "2. Unigram shortcut" (the current code in this PR) outperform other proposed method (generators with sliding window etc), for unigram words and character n-grams, but not for word n-grams.

Bottom line is that this shows that there is still room for performance improvement for the n-gram vectoriser either by rewriting some sections in cython or by using generators. However, maybe that would deserve a separate PR?
Meanwhile this PR does exactly what it claims to be doing: a consistent 10-20% speed improvement for word an character n-grams (with a few lines of code)...

@jtdoepke Would you mind renaming this PR to "[MRG] Ngram Performance" to attract attention to it and so it could be added to the backlog or PRs to be reviewed? Thanks.

Update: also related #7107

@jtdoepke jtdoepke changed the title from Ngram Performance to [MRG] Ngram Performance Jan 30, 2017

@jnothman

were there not more enticing changes to consider, I'd say this looked good for merge. Thanks for all the benchmarks.

# bind method outside of loop to reduce overhead
ngrams_append = ngrams.append
for n in xrange(min_n, min(max_n + 1, text_len + 1)):

This comment has been minimized.

@jnothman

jnothman Feb 23, 2017

Member

Could we please get a benchmark on writing this as a list comprehension:

n_grams.extend([text_document[i: i + n]]
               for n in xrange(min_n, min(max_n + 1, text_len + 1))
               for i in xrange(text_len - n + 1))

?

@jnothman

jnothman Feb 23, 2017

Member

Could we please get a benchmark on writing this as a list comprehension:

n_grams.extend([text_document[i: i + n]]
               for n in xrange(min_n, min(max_n + 1, text_len + 1))
               for i in xrange(text_len - n + 1))

?

This comment has been minimized.

@jnothman

jnothman Feb 23, 2017

Member

Also benchmark returning a generator i.e. use itertools.chain over generator expressions to see if the counting can benefit from not materialising the list.

@jnothman

jnothman Feb 23, 2017

Member

Also benchmark returning a generator i.e. use itertools.chain over generator expressions to see if the counting can benefit from not materialising the list.

This comment has been minimized.

@jtdoepke

jtdoepke Mar 7, 2017

Contributor

Sure. I tried several variations of list comprehension and generators (and a couple other shortcuts). Looks like creating all the ngrams as a single list comprehension is faster than a generator.

@jtdoepke

jtdoepke Mar 7, 2017

Contributor

Sure. I tried several variations of list comprehension and generators (and a couple other shortcuts). Looks like creating all the ngrams as a single list comprehension is faster than a generator.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Mar 7, 2017

Member

I have to admit I am not an expert on this but in order to convince me you will have to use:

  • a more realistic benchmark. AFAICT your vocabulary is ~350 words long.
  • use %timeit to do timings, from a quick look at it, it looks like you are implementing your own timeit doing the mean of all the results, which is not the most robust way of doing timings.
  • profiling and possibly line-profiling to show where significant time is spent to make sure we are optimizing the right lines of code

Ideally I would rather have a benchmark script rather than a notebook.

Member

lesteve commented Mar 7, 2017

I have to admit I am not an expert on this but in order to convince me you will have to use:

  • a more realistic benchmark. AFAICT your vocabulary is ~350 words long.
  • use %timeit to do timings, from a quick look at it, it looks like you are implementing your own timeit doing the mean of all the results, which is not the most robust way of doing timings.
  • profiling and possibly line-profiling to show where significant time is spent to make sure we are optimizing the right lines of code

Ideally I would rather have a benchmark script rather than a notebook.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jun 9, 2017

Member

I benchmarked this PR anew with non synthetic data (20 newsgroup) and it appears that some of @lesteve 's concerns were justified.

In these benchmaks, this PR has no performance impact at all when ngram_range=(1, 1) but includes an up to 10% speedup for larger ngram ranges, both with words and character ngrams. The full results are accessible here, obtained with this benchmarks script . Profiling
in the case ngram_range=(1, 1) and that of ngram_range=(1, 2) show that the optimized function _word_ngrams accounts for a relatively small fraction (<20%) of the total run time.

Out of different components, or optimizations considered (unigram shrortcut, method binding, list comprehension) the first two are the ones (this PR exactly) that bring the best performance.

Across different vectorizers and options this PR brings a 5% speedup, and the performance is at worse the same as before. So the net effect of this PR is positive, but considering that generally the optimized functions accounts for a relatively small fraction of total run time, it is probably not worth spending time on additional optimizations here.

Therefore I would suggest that this PR be merged in it's current state. What do you think?
@lesteve @jnothman

Member

rth commented Jun 9, 2017

I benchmarked this PR anew with non synthetic data (20 newsgroup) and it appears that some of @lesteve 's concerns were justified.

In these benchmaks, this PR has no performance impact at all when ngram_range=(1, 1) but includes an up to 10% speedup for larger ngram ranges, both with words and character ngrams. The full results are accessible here, obtained with this benchmarks script . Profiling
in the case ngram_range=(1, 1) and that of ngram_range=(1, 2) show that the optimized function _word_ngrams accounts for a relatively small fraction (<20%) of the total run time.

Out of different components, or optimizations considered (unigram shrortcut, method binding, list comprehension) the first two are the ones (this PR exactly) that bring the best performance.

Across different vectorizers and options this PR brings a 5% speedup, and the performance is at worse the same as before. So the net effect of this PR is positive, but considering that generally the optimized functions accounts for a relatively small fraction of total run time, it is probably not worth spending time on additional optimizations here.

Therefore I would suggest that this PR be merged in it's current state. What do you think?
@lesteve @jnothman

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Jun 9, 2017

Member

Thanks a lot @rth for the benchmark, is it worth adding it to the benchmarks folder?

All in all I think this is reasonable to merge. Some small improvement. The code changes are not that controversial.

@jnothman what do you think?

Member

lesteve commented Jun 9, 2017

Thanks a lot @rth for the benchmark, is it worth adding it to the benchmarks folder?

All in all I think this is reasonable to merge. Some small improvement. The code changes are not that controversial.

@jnothman what do you think?

@lesteve lesteve changed the title from [MRG] Ngram Performance to [MRG+1] Ngram Performance Jun 9, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 10, 2017

Member
Member

jnothman commented Jun 10, 2017

@jnothman jnothman merged commit d8e54d9 into scikit-learn:master Jun 18, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 18, 2017

Member

Thanks @jtdoepke. We could add an entry in what's new if you'd like. Please suggest a text and I'll commit it in.

Member

jnothman commented Jun 18, 2017

Thanks @jtdoepke. We could add an entry in what's new if you'd like. Please suggest a text and I'll commit it in.

@jtdoepke

This comment has been minimized.

Show comment
Hide comment
@jtdoepke

jtdoepke Jun 21, 2017

Contributor
Small performance improvement to n-gram creation in
:mod:`sklearn.feature_extraction.text` by binding methods outside of loops
and taking a fast path for unigrams.
Contributor

jtdoepke commented Jun 21, 2017

Small performance improvement to n-gram creation in
:mod:`sklearn.feature_extraction.text` by binding methods outside of loops
and taking a fast path for unigrams.

@jtdoepke jtdoepke deleted the jtdoepke:ngram_performance branch Jun 21, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2017

Member
Member

jnothman commented Jun 21, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment