New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add validation of vocabulary in get_feature_names #10908

Merged
merged 6 commits into from Apr 25, 2018

Conversation

Projects
None yet
5 participants
@maskani-moh
Contributor

maskani-moh commented Apr 3, 2018

Reference Issues/PRs

Fixes #10901

What does this implement/fix? Explain your changes.

Validates the vocabulary in get_feature_names and inverse_transform to avoid raising errors when object instantiated with vocabulary.

Any other comments?

@maskani-moh maskani-moh changed the title from Add validation of vocabulary in get_feature_names to [MRG]Add validation of vocabulary in get_feature_names Apr 3, 2018

@rth

This comment has been minimized.

Member

rth commented Apr 3, 2018

Thanks for the PR @maskani-moh !

Please add a non-regression test, for instance, adapting the failing example from the original issue. This should also fix the failing codecov patch.

@maskani-moh maskani-moh changed the title from [MRG]Add validation of vocabulary in get_feature_names to [WIP]Add validation of vocabulary in get_feature_names Apr 3, 2018

@maskani-moh

This comment has been minimized.

Contributor

maskani-moh commented Apr 3, 2018

I removed the check for a given custom vocabulary form inverse_transform method as, after consideration, I believe it makes no sense to invert a matrix of word counts using a custom vocabulary that maybe didn't generate the matrix (the indices may not even match and lead to a wrong result).

Example:

# Corpus
JUNK_FOOD_DOCS = (
    "the pizza pizza beer copyright",
    "the pizza burger beer copyright",
    "the the pizza beer beer copyright",
    "the burger beer beer copyright",
    "the coke burger coke copyright",
    "the coke burger burger",
)

# Vocab inferred from corpus
cv = CountVectorizer()
X = cv.fit_transform(JUNK_FOOD_DOCS)

cv.vocabulary_
>>> {'beer': 0, 'burger': 1, 'coke': 2, 'copyright': 3, 'pizza': 4, 'the': 5}

# Custom vocab
vocab = ['beer', 'burger', 'pizza']
cv_custom = CountVectorizer(vocabulary=vocab)

# Try to inverse transform X using cv_custom
cv_custom.inverse_transform() 
>>> IndexError: index 3 is out of bounds for axis 0 with size 3

The above code throws an error as expected, as the vocabulary of cv_custom is smaller than the size of the feature vector in X.

However, inverting a matrix generated by cv_custom using cv works perfectly fine as the vocabulary is larger. The results are rubbish though, as the indices do not match to the right words.

@rth How can we deal with this case? Can we check that the vocabulary used to invert is the same as the one used to generate the matrix? How can we ensure that, without enforcing the user to first run fit_tranform on the original data? Should we warn the user about this case?

@maskani-moh maskani-moh changed the title from [WIP]Add validation of vocabulary in get_feature_names to [MRG]Add validation of vocabulary in get_feature_names Apr 3, 2018

@amueller

This comment has been minimized.

Member

amueller commented Apr 6, 2018

I don't get your point. You can only inverse-transform with the same vocabulary that generated the data. That could have been a custom one, but it needs to be the same as what you used in transform.

@maskani-moh

This comment has been minimized.

Contributor

maskani-moh commented Apr 6, 2018

What I meant is that it does not explicitly say to use the same vocabulary, although this is obvious.

However given a custom vocabulary, the inverse transform can still work (if the vocabulary is larger than the one that generated the data). A safeguard could be to throw a warning if this happens to avoid obtaining rubbish results.

But again, maybe the use of the inverse transform is so obvious that maybe we don't really need a warning here.

@amueller

This comment has been minimized.

Member

amueller commented Apr 6, 2018

I think tampering with attributes after object construction means all bets are off. But yeah, warning if there is a mismatch in size seems like a good idea and doesn't really cost us anything.

@maskani-moh

This comment has been minimized.

Contributor

maskani-moh commented Apr 6, 2018

Should I do that in another PR? I'll add the warning for all the vectorizers then 👍

@rth

This comment has been minimized.

Member

rth commented Apr 10, 2018

You can only inverse-transform with the same vocabulary that generated the data.

Should I do that in another PR?

Yes, I think validating the vocabulary for an inverse transform is an unrelated question.

This LGTM.

@rth rth changed the title from [MRG]Add validation of vocabulary in get_feature_names to [MRG+1] Add validation of vocabulary in get_feature_names Apr 10, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 11, 2018

LGTM but I suppose this deserves a small changelog entry.

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@TomDLT TomDLT merged commit 7d3a59e into scikit-learn:master Apr 25, 2018

0 of 5 checks passed

ci/circleci: python2 Your tests are queued behind your running builds
Details
ci/circleci: python3 Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
lgtm analysis: Python Running analyses for revisions
Details
@TomDLT

This comment has been minimized.

Member

TomDLT commented Apr 25, 2018

Thank you @maskani-moh !

(I fixed my name mistake in 2c4ef26.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment