-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute a stop words list for a given df threshold #10834
Comments
Does your posting of this indicate support for the general approach, @rth? |
Yes, I think it can be worthwhile to investigate in any case. The main issue with substituting stop words by I'm not sure how much of it is in scope for scikit-learn, but I'll try to look into this question a bit more this week. |
thanks. I hope my reasoning above explains why I think this is something we
should facilitate.
In research, known stop word lists may improve reproducibility, but only
if the tokenizer is also known.
An alternative is to make sure we facilitate tfidf with statistics from a
reference corpus.
|
I'm +1 if it's possible to generate a list which is compatible with the tokenization method in scikit-learn. |
my point here is that stop words can be calculated quite reliably by max_df
from a large collection of text, although max_df may not be trivial to set.
That text collection should ideally be from an identical domain/source to
your task space, and necessarily with the same tokenizer and preprocessing.
This is especially the case when your labelled data for classification, for
instance, is small. This is why providing a function is better than
providing a list, though I'm not against doing that too.
calculate_stop_words could return terms in descending frequency such that
the user could then apply their own second threshold.
|
I have run some preliminary analysis here. The main question I was interested in was whether by setting the appropriate The dataset is the 20 newsgroup corpus with headers / footers removed to make this comparison as simple as possible. My initial intuition was the even then we would retrieve only ~50% of words in the built-in list. The remaining ~50 % are for the most dubious and include for instance, most of which would be wrong to include IMO. And this is just the list for fairly high In conclusion, I agree that
Regarding the API, the main calculations would roughly be, from sklearn.feature_extraciton.text import _document_frequency
df = _document_frequency(X)
mask = df > max_df
stop_words = vectorizer.get_feature_names()[mask] I'm not sure if creating a function for this is worth it. Maybe an example could be sufficient? In particular, with the proposed API for |
Thanks @rth for the detailed experiment. My opinion:
|
Thanks for making this empirical, @rth. I am very interested in the kinds of max_df threshold you're coming up with, but I also feel like you're benchmarking with a corpus that is very far from a balanced sample of standard english usage, particularly as regards its overuse of technical terms like "system". I would've been more comfortable with this analysis over the Reuters corpus (though I know that similar newswire corpora have biases due in part to regular reports, e.g. periodic stock market reviews). I think it's quite clear from its contents that the current stop list is based on frequencies in some corpus, rather than, say, part of speech from some existing lexical resource. I think we are giving our users a false sense that it is generic by maintaining a "definitive" English list. I also think that it's clear that max_df is hard to tune, and indeed, measuring against an existing stop list (particularly one based on known function words rather than corpus statistics) would be a reasonable way to set it. (It's also possible that another parameterisation, such as top 1% of words by df, is more robust, or something that accounts for outlying tf as well as df.) I think we owe our users a Note on Using Stop Words mentioning tokenization and the extent to which max_df can substitute, including an example of building a list from an external corpus... |
Thanks for the feedback @jnothman and @qinhanmin2014 !
Did you mean RCV1 or Reuters-21578 ? The latter is quite old 1987 and might not be very representative for modern English while the publicly accessible RCV1 version only includes stemmed tokens and is not suitable for such analysis (I don't have access to the original RCV1 CD.
I was also thinking about that. top %1 of words by df also is not ideal, because the vocabulary size is still linked to the corpus size (e.g. much larger for the English Wikipedia than for the 20 newsgroups). Another idea could be a
+1
Well, I guess there is still some debate between this, deprecating, or just correcting the current list as a bug fix #9057 (comment) ... |
I think deprecating (and introducing new one) and correcting are actually similar solutions and I'm +1 for either. I just think if we need to change current list a lot, we'd better introduce a new one. |
Closing according to comment from @jnothman and @rth , please reopen if I'm wrong. comment from @jnothman
comment from @rth
|
In a followup of #10735, @jnothman mentioned in #10818 (comment)
(just coping this to a separate issue as its likely to require another PR and discussion in a PR does tend to get overlooked once it's merged.)
cc @shaz13 @qinhanmin2014 @kmike @vene
The text was updated successfully, but these errors were encountered: