Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountVectorizer sets self.vocabulary_ in transform #14559

Open
amueller opened this issue Aug 2, 2019 · 8 comments
Open

CountVectorizer sets self.vocabulary_ in transform #14559

amueller opened this issue Aug 2, 2019 · 8 comments

Comments

@amueller
Copy link
Member

amueller commented Aug 2, 2019

Right now CountVectorizer sometimes sets self.vocabulary_ outside of fit. We usually prohibit this, but the common tests haven't reached the vectorizers yet.

@jnothman
Copy link
Member

jnothman commented Aug 3, 2019 via email

@amueller
Copy link
Member Author

amueller commented Aug 5, 2019

yes, both of these sound reasonable to me.

@amueller
Copy link
Member Author

amueller commented Nov 2, 2019

@thomasjpfan do you remember someone (you? sprint?) working on this? I have some vague memories.

@thomasjpfan
Copy link
Member

thomasjpfan commented Nov 6, 2019

We may have spoke about this IRL regarding SparseCoder and self.components_?

@amueller
Copy link
Member Author

amueller commented Nov 6, 2019

hm ok might have also been me. Either way, seems open?

@SultanOrazbayev
Copy link
Contributor

This issue doesn't seem to be relevant anymore?

There are two lines where self.vocabulary_ is set, one is inside .fit_transform (seems OK):

def fit_transform(self, raw_documents, y=None):
"""Learn the vocabulary dictionary and return document-term matrix.
This is equivalent to fit followed by transform, but more efficiently
implemented.
Parameters
----------
raw_documents : iterable
An iterable which generates either str, unicode or file objects.
y : None
This parameter is ignored.
Returns
-------
X : array of shape (n_samples, n_features)
Document-term matrix.
"""
# We intentionally don't call the transform method to make
# fit_transform overridable without unwanted side effects in
# TfidfVectorizer.
if isinstance(raw_documents, str):
raise ValueError(
"Iterable over raw text documents expected, string object received."
)
self._validate_params()
self._validate_vocabulary()
max_df = self.max_df
min_df = self.min_df
max_features = self.max_features
if self.fixed_vocabulary_ and self.lowercase:
for term in self.vocabulary:
if any(map(str.isupper, term)):
warnings.warn(
"Upper case characters found in"
" vocabulary while 'lowercase'"
" is True. These entries will not"
" be matched with any documents"
)
break
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
if self.binary:
X.data.fill(1)
if not self.fixed_vocabulary_:
n_doc = X.shape[0]
max_doc_count = (
max_df if isinstance(max_df, numbers.Integral) else max_df * n_doc
)
min_doc_count = (
min_df if isinstance(min_df, numbers.Integral) else min_df * n_doc
)
if max_doc_count < min_doc_count:
raise ValueError("max_df corresponds to < documents than min_df")
if max_features is not None:
X = self._sort_features(X, vocabulary)
X, self.stop_words_ = self._limit_features(
X, vocabulary, max_doc_count, min_doc_count, max_features
)
if max_features is None:
X = self._sort_features(X, vocabulary)
self.vocabulary_ = vocabulary
return X

and one is inside _validate_vocabulary where it's re-set (which also seems OK):

def _validate_vocabulary(self):
vocabulary = self.vocabulary
if vocabulary is not None:
if isinstance(vocabulary, set):
vocabulary = sorted(vocabulary)
if not isinstance(vocabulary, Mapping):
vocab = {}
for i, t in enumerate(vocabulary):
if vocab.setdefault(t, i) != i:
msg = "Duplicate term in vocabulary: %r" % t
raise ValueError(msg)
vocabulary = vocab
else:
indices = set(vocabulary.values())
if len(indices) != len(vocabulary):
raise ValueError("Vocabulary contains repeated indices.")
for i in range(len(vocabulary)):
if i not in indices:
msg = "Vocabulary of size %d doesn't contain index %d." % (
len(vocabulary),
i,
)
raise ValueError(msg)
if not vocabulary:
raise ValueError("empty vocabulary passed to fit")
self.fixed_vocabulary_ = True
self.vocabulary_ = dict(vocabulary)
else:
self.fixed_vocabulary_ = False

@amueller
Copy link
Member Author

It's set in _count_vocab which is called in transform, right?

@SultanOrazbayev
Copy link
Contributor

Hmm, I'm not sure, because it seems to create vocabulary variable but doesn't alter self.vocabulary_... but I'm not sure...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants