Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX raise an error in CountVectorizer with a custom token pattern that captures several group #15427

Merged
merged 12 commits into from Oct 21, 2020

Conversation

hoffm386
Copy link
Contributor

@hoffm386 hoffm386 commented Nov 1, 2019

Reference Issues/PRs

Fixes #12971

What does this implement/fix? Explain your changes.

  1. Merges in changes from Handle capturing groups in token_pattern #13229, which raises a ValueError if the token pattern regex has more than one capturing group and documents the general behavior of capturing groups in the token pattern regex
  2. Improves the doc for token_pattern per requested changes on Handle capturing groups in token_pattern #13229
  3. Changes the test spec from testing 0 capturing groups to testing 1 capturing group per requested changes on Handle capturing groups in token_pattern #13229
  4. Modifies the test corpus and regexes for test_countvectorizer_custom_token_pattern so they represent a more "realistic" type of custom token pattern than used in Handle capturing groups in token_pattern #13229. The previous test used an example from CountVectorizer token_pattern issue with multi Alternative regex pattern. #12971 where the custom token pattern behaved the same as the default token pattern on that particular corpus. My new test is an adapted version of those examples that uses some of the same selectors but also (a) behaves differently than the default and (b) gives an example of what you might actually use a capturing group for. I have modified both regexes so they meet these requirements and are almost identical, but the first has 1 capturing group and the second has 2 capturing groups.

Any other comments?

Same as #13229, this does not address deprecation or backward compatibility

@hoffm386
Copy link
Contributor Author

@jnothman could I get a review on this? Thanks!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and sorry for the delayed review

vectorizer = CountVectorizer(token_pattern=token_pattern)
vectorizer.fit_transform(corpus)

assert_raise_message(ValueError, message, func, token_pattern)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually try to test a more minimal context that will error. That is, rather than func here, we would have vectorizer.fit

@hoffm386
Copy link
Contributor Author

(Something went wrong with my rebasing strategy here, not expecting a review on this nonsense right now!)

@hoffm386
Copy link
Contributor Author

Ok, that was a mess but I believe I have addressed your comments now!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hoffm386 sorry for the delay to review this. I merge master into your branch and apply really small changes to be able to merge this PR which is useful.

I will merge when the CIs turn green (apart of the MacOS failure).

@glemaitre glemaitre changed the title Finishing up handling capturing group behavior in token_pattern FIX raise an error in CountVectorizer with a custom token pattern that captures several group Oct 21, 2020
@glemaitre glemaitre merged commit 05ce814 into scikit-learn:master Oct 21, 2020
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Oct 28, 2020
…t captures several group (scikit-learn#15427)

Co-authored-by: Gangesh Gudmalwar <gangesh.gudmalwar@team.telstra.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CountVectorizer token_pattern issue with multi Alternative regex pattern.
4 participants