Compute a stop words list for a given df threshold #10834

rth · 2018-03-19T12:41:26Z

In a followup of #10735, @jnothman mentioned in #10818 (comment)

In terms of generating new stop lists we should consider just making it from high frequency words in Wikipedia or something for a handful of languages. The point is to handle the case where users have small datasets where max_df are inapplicable. It would be good to try find someone who's already done this and calibrated max_df for a handful of languages, perhaps.
[..]
I think we need to provide a method/function/example to learn a stop list for a given df threshold, and for a given analyser. This could be CountVectorizer.fit_stop_words(X, min_df) which would set the instance's stop_words. This could call a function calculate_stop_words(vectorizer_or_analyzer, X, min_df) which would return a set.

Perhaps that function should allow the user to augment the learnt stop words with those from a public list ..
[..]
max_df is a valuable substitute [to stop words], but not if the training data is small, and not if we're hashing (although tf.idf works there).

(just coping this to a separate issue as its likely to require another PR and discussion in a PR does tend to get overlooked once it's merged.)

cc @shaz13 @qinhanmin2014 @kmike @vene

The text was updated successfully, but these errors were encountered:

jnothman · 2018-03-19T12:53:43Z

cc @amueller, @ogrisel too.

jnothman · 2018-03-19T20:51:51Z

Does your posting of this indicate support for the general approach, @rth?

rth · 2018-03-20T17:54:49Z

Yes, I think it can be worthwhile to investigate in any case.

The main issue with substituting stop words by max_df is with the choice of the threshold. In particular, in research papers, I can see someone saying that they used a classical stop words list XYz, for what it's worth, but not so much max_df=0.55 without some references or discussion about the choice of this value. I was always bothered by having some value of max_df in examples without more justification. After a quick search, I haven't seen any relevant literature except for possibly Wilbur and Sirotkin (1992) but they use a more complex approach than just thresholding by df, as far as I understand.

I'm not sure how much of it is in scope for scikit-learn, but I'll try to look into this question a bit more this week.

jnothman · 2018-03-20T21:47:48Z

thanks. I hope my reasoning above explains why I think this is something we should facilitate. In research, known stop word lists may improve reproducibility, but only if the tokenizer is also known. An alternative is to make sure we facilitate tfidf with statistics from a reference corpus.

qinhanmin2014 · 2018-03-21T13:43:38Z

I'm +1 if it's possible to generate a list which is compatible with the tokenization method in scikit-learn. max_df cannot do everything for us and it's always good to have a built-in stop word list.
I might be +0 of something like calculate_stop_words. I'm wondering whether stop words can be calculated through document frequency, especially when users don't have enough data.

jnothman · 2018-03-21T20:26:42Z

my point here is that stop words can be calculated quite reliably by max_df from a large collection of text, although max_df may not be trivial to set. That text collection should ideally be from an identical domain/source to your task space, and necessarily with the same tokenizer and preprocessing. This is especially the case when your labelled data for classification, for instance, is small. This is why providing a function is better than providing a list, though I'm not against doing that too. calculate_stop_words could return terms in descending frequency such that the user could then apply their own second threshold.

rth · 2018-03-22T10:02:34Z

I have run some preliminary analysis here. The main question I was interested in was whether by setting the appropriate max_df we could get a list of stop words somewhat equivalent to the current built-in list.

The dataset is the 20 newsgroup corpus with headers / footers removed to make this comparison as simple as possible.

My initial intuition was the max_df = 0.5 should be a fairly reasonable threshold value. It is frequently used in examples. It turns out that, for this corpus it would only select ~20 stop words as compared to 320 stop words in ENGLISH_STOP_WORDS. To reach a comparable value one would need to setmax_df = 0.038. See figure below,

even then we would retrieve only ~50% of words in the built-in list. The remaining ~50 % are for the most dubious and include for instance,

most of which would be wrong to include IMO. And this is just the list for fairly high max_df.

In conclusion, I agree that max_df filtering could be used to improve the existing stop words list. It is can handle non-standard tokens specific to the corpus (e.g. email header fields) and other languages, however,

it's definitely not easy to use, and require a study of the optimal max_df which won't be done by most users
it has a much higher potential for misuse; we are not talking about a few stop words or contractions, but rather 10s or 100s either missing stop words or words incorrectly added.
I have compared to the stop word list included in spaCy (cf last section of the notebook) as suggested by @vene in Remove "system" from ENGLISH_STOP_WORDS #10735 (comment) and frankly I'm not convinced that the few words that differ with the currently built-in list matter for practical purposes (cf last paragraph of the notebook).

Regarding the API, the main calculations would roughly be,

from sklearn.feature_extraciton.text import _document_frequency

df = _document_frequency(X)
mask = df  > max_df
stop_words = vectorizer.get_feature_names()[mask]

I'm not sure if creating a function for this is worth it. Maybe an example could be sufficient?

In particular, with the proposed API for calculate_stop_words there is not a simple way to see what is a good value of max_df short of re-computing the document frequency at each time. Also the current API of CountVectorizer and co. is already complicated and confusing enough and I'm not sure about adding additional methods, IMO if we decide to go for it, an independent function would be easier. What do you think?

qinhanmin2014 · 2018-03-22T14:36:59Z

Thanks @rth for the detailed experiment. My opinion:
(kindly forgive if there're stupid mistakes, I don't major in NLP)

I'm +1 to deprecate current stop word list (from http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) because I think it's not robust:
(1) There's typo in the list (fify, corrected in scikit-learn);
(2) Some words seem really strange from my side (e.g., computer, system);
(3) I don't know the meaning of some words (e.g., amoungst, noone).
I'm +1 for a new stop word list because I don't think max_df can do everything for us;
I'm -1 for something like calculate_stop_words because I don't think stop words can be calculated through document frequency (in many/most cases);
My solution:
(1) deprecate current stop word list;
(2) introduce a new stop word list (e.g., english-robust), we can first borrow a widely-used list from somewhere else (e.g., nltk), and then adjust it according to the tokenization method in scikit-learn (e.g., replace haven't with haven). I think it's much more convenient than training one by ourselves;
(3) use the new stop word list in the examples/docs and I think we'll get similar results as before.

jnothman · 2018-03-22T21:13:06Z

Thanks for making this empirical, @rth.

I am very interested in the kinds of max_df threshold you're coming up with, but I also feel like you're benchmarking with a corpus that is very far from a balanced sample of standard english usage, particularly as regards its overuse of technical terms like "system". I would've been more comfortable with this analysis over the Reuters corpus (though I know that similar newswire corpora have biases due in part to regular reports, e.g. periodic stock market reviews).

I think it's quite clear from its contents that the current stop list is based on frequencies in some corpus, rather than, say, part of speech from some existing lexical resource. I think we are giving our users a false sense that it is generic by maintaining a "definitive" English list.

I also think that it's clear that max_df is hard to tune, and indeed, measuring against an existing stop list (particularly one based on known function words rather than corpus statistics) would be a reasonable way to set it. (It's also possible that another parameterisation, such as top 1% of words by df, is more robust, or something that accounts for outlying tf as well as df.)

I think we owe our users a Note on Using Stop Words mentioning tokenization and the extent to which max_df can substitute, including an example of building a list from an external corpus...

rth · 2018-03-29T14:06:03Z

Thanks for the feedback @jnothman and @qinhanmin2014 !

I would've been more comfortable with this analysis over the Reuters corpus

Did you mean RCV1 or Reuters-21578 ? The latter is quite old 1987 and might not be very representative for modern English while the publicly accessible RCV1 version only includes stemmed tokens and is not suitable for such analysis (I don't have access to the original RCV1 CD.

It's also possible that another parameterisation, such as top 1% of words by df, is more robust, or something that accounts for outlying tf as well as df.

I was also thinking about that. top %1 of words by df also is not ideal, because the vocabulary size is still linked to the corpus size (e.g. much larger for the English Wikipedia than for the 20 newsgroups). Another idea could be a max_df that allows to get some recall (e.g. 50%) of some stop word list (e.g. the one include in scikit-learn), that should be more robust, but it's also not very general.

I think we owe our users a Note on Using Stop Words mentioning tokenization and the extent to which max_df can substitute, including an example of building a list from an external corpus...

+1

My solution: (1) deprecate current stop word list; (2) introduce a new stop word list (e.g., english-robust),

Well, I guess there is still some debate between this, deprecating, or just correcting the current list as a bug fix #9057 (comment) ...

qinhanmin2014 · 2018-03-30T01:10:12Z

Well, I guess there is still some debate between this, deprecating, or just correcting the current list as a bug fix #9057 (comment) ...

I think deprecating (and introducing new one) and correcting are actually similar solutions and I'm +1 for either. I just think if we need to change current list a lot, we'd better introduce a new one.

qinhanmin2014 · 2018-04-26T02:57:07Z

Closing according to comment from @jnothman and @rth , please reopen if I'm wrong.

comment from @jnothman

I agree now that substituting with max_df is insufficient.

comment from @rth

We can mention max_df in the docs as a first step to a language generic approach but I agree that it's not robust enough to recommend it by default (frankly I wouldn't use it personally on projects where I just want things to work).

rth mentioned this issue Mar 19, 2018

[MRG+1-0.5] Deprecate stop_words='english' #10818

Closed

jnothman added New Feature API labels Mar 19, 2018

rth mentioned this issue Mar 22, 2018

Remove "system" from ENGLISH_STOP_WORDS #10735

Closed

qinhanmin2014 closed this as completed Apr 26, 2018

qinhanmin2014 mentioned this issue Apr 26, 2018

Construct new stop word list for scikit-learn #11030

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute a stop words list for a given df threshold #10834

Compute a stop words list for a given df threshold #10834

rth commented Mar 19, 2018

jnothman commented Mar 19, 2018

jnothman commented Mar 19, 2018

rth commented Mar 20, 2018 •

edited

jnothman commented Mar 20, 2018 via email

qinhanmin2014 commented Mar 21, 2018

jnothman commented Mar 21, 2018 via email

rth commented Mar 22, 2018

qinhanmin2014 commented Mar 22, 2018 •

edited

jnothman commented Mar 22, 2018

rth commented Mar 29, 2018

qinhanmin2014 commented Mar 30, 2018

qinhanmin2014 commented Apr 26, 2018

Compute a stop words list for a given df threshold #10834

Compute a stop words list for a given df threshold #10834

Comments

rth commented Mar 19, 2018

jnothman commented Mar 19, 2018

jnothman commented Mar 19, 2018

rth commented Mar 20, 2018 • edited

jnothman commented Mar 20, 2018 via email

qinhanmin2014 commented Mar 21, 2018

jnothman commented Mar 21, 2018 via email

rth commented Mar 22, 2018

qinhanmin2014 commented Mar 22, 2018 • edited

jnothman commented Mar 22, 2018

rth commented Mar 29, 2018

qinhanmin2014 commented Mar 30, 2018

qinhanmin2014 commented Apr 26, 2018

rth commented Mar 20, 2018 •

edited

qinhanmin2014 commented Mar 22, 2018 •

edited