Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute a stop words list for a given df threshold #10834

Closed
rth opened this issue Mar 19, 2018 · 12 comments
Closed

Compute a stop words list for a given df threshold #10834

rth opened this issue Mar 19, 2018 · 12 comments

Comments

@rth
Copy link
Member

rth commented Mar 19, 2018

In a followup of #10735, @jnothman mentioned in #10818 (comment)

In terms of generating new stop lists we should consider just making it from high frequency words in Wikipedia or something for a handful of languages. The point is to handle the case where users have small datasets where max_df are inapplicable. It would be good to try find someone who's already done this and calibrated max_df for a handful of languages, perhaps.
[..]
I think we need to provide a method/function/example to learn a stop list for a given df threshold, and for a given analyser. This could be CountVectorizer.fit_stop_words(X, min_df) which would set the instance's stop_words. This could call a function calculate_stop_words(vectorizer_or_analyzer, X, min_df) which would return a set.

Perhaps that function should allow the user to augment the learnt stop words with those from a public list ..
[..]
max_df is a valuable substitute [to stop words], but not if the training data is small, and not if we're hashing (although tf.idf works there).

(just coping this to a separate issue as its likely to require another PR and discussion in a PR does tend to get overlooked once it's merged.)

cc @shaz13 @qinhanmin2014 @kmike @vene

@jnothman
Copy link
Member

cc @amueller, @ogrisel too.

@jnothman
Copy link
Member

Does your posting of this indicate support for the general approach, @rth?

@rth
Copy link
Member Author

rth commented Mar 20, 2018

Yes, I think it can be worthwhile to investigate in any case.

The main issue with substituting stop words by max_df is with the choice of the threshold. In particular, in research papers, I can see someone saying that they used a classical stop words list XYz, for what it's worth, but not so much max_df=0.55 without some references or discussion about the choice of this value. I was always bothered by having some value of max_df in examples without more justification. After a quick search, I haven't seen any relevant literature except for possibly Wilbur and Sirotkin (1992) but they use a more complex approach than just thresholding by df, as far as I understand.

I'm not sure how much of it is in scope for scikit-learn, but I'll try to look into this question a bit more this week.

@jnothman
Copy link
Member

jnothman commented Mar 20, 2018 via email

@qinhanmin2014
Copy link
Member

I'm +1 if it's possible to generate a list which is compatible with the tokenization method in scikit-learn. max_df cannot do everything for us and it's always good to have a built-in stop word list.
I might be +0 of something like calculate_stop_words. I'm wondering whether stop words can be calculated through document frequency, especially when users don't have enough data.

@jnothman
Copy link
Member

jnothman commented Mar 21, 2018 via email

@rth
Copy link
Member Author

rth commented Mar 22, 2018

I have run some preliminary analysis here. The main question I was interested in was whether by setting the appropriate max_df we could get a list of stop words somewhat equivalent to the current built-in list.

The dataset is the 20 newsgroup corpus with headers / footers removed to make this comparison as simple as possible.

My initial intuition was the max_df = 0.5 should be a fairly reasonable threshold value. It is frequently used in examples. It turns out that, for this corpus it would only select ~20 stop words as compared to 320 stop words in ENGLISH_STOP_WORDS. To reach a comparable value one would need to setmax_df = 0.038. See figure below,
download

even then we would retrieve only ~50% of words in the built-in list. The remaining ~50 % are for the most dubious and include for instance,
untitled

most of which would be wrong to include IMO. And this is just the list for fairly high max_df.

In conclusion, I agree that max_df filtering could be used to improve the existing stop words list. It is can handle non-standard tokens specific to the corpus (e.g. email header fields) and other languages, however,

  • it's definitely not easy to use, and require a study of the optimal max_df which won't be done by most users
  • it has a much higher potential for misuse; we are not talking about a few stop words or contractions, but rather 10s or 100s either missing stop words or words incorrectly added.
  • I have compared to the stop word list included in spaCy (cf last section of the notebook) as suggested by @vene in Remove "system" from ENGLISH_STOP_WORDS #10735 (comment) and frankly I'm not convinced that the few words that differ with the currently built-in list matter for practical purposes (cf last paragraph of the notebook).

Regarding the API, the main calculations would roughly be,

from sklearn.feature_extraciton.text import _document_frequency

df = _document_frequency(X)
mask = df  > max_df
stop_words = vectorizer.get_feature_names()[mask]

I'm not sure if creating a function for this is worth it. Maybe an example could be sufficient?

In particular, with the proposed API for calculate_stop_words there is not a simple way to see what is a good value of max_df short of re-computing the document frequency at each time. Also the current API of CountVectorizer and co. is already complicated and confusing enough and I'm not sure about adding additional methods, IMO if we decide to go for it, an independent function would be easier. What do you think?

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Mar 22, 2018

Thanks @rth for the detailed experiment. My opinion:
(kindly forgive if there're stupid mistakes, I don't major in NLP)

  • I'm +1 to deprecate current stop word list (from http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) because I think it's not robust:
    (1) There's typo in the list (fify, corrected in scikit-learn);
    (2) Some words seem really strange from my side (e.g., computer, system);
    (3) I don't know the meaning of some words (e.g., amoungst, noone).

  • I'm +1 for a new stop word list because I don't think max_df can do everything for us;

  • I'm -1 for something like calculate_stop_words because I don't think stop words can be calculated through document frequency (in many/most cases);

  • My solution:
    (1) deprecate current stop word list;
    (2) introduce a new stop word list (e.g., english-robust), we can first borrow a widely-used list from somewhere else (e.g., nltk), and then adjust it according to the tokenization method in scikit-learn (e.g., replace haven't with haven). I think it's much more convenient than training one by ourselves;
    (3) use the new stop word list in the examples/docs and I think we'll get similar results as before.

@jnothman
Copy link
Member

Thanks for making this empirical, @rth.

I am very interested in the kinds of max_df threshold you're coming up with, but I also feel like you're benchmarking with a corpus that is very far from a balanced sample of standard english usage, particularly as regards its overuse of technical terms like "system". I would've been more comfortable with this analysis over the Reuters corpus (though I know that similar newswire corpora have biases due in part to regular reports, e.g. periodic stock market reviews).

I think it's quite clear from its contents that the current stop list is based on frequencies in some corpus, rather than, say, part of speech from some existing lexical resource. I think we are giving our users a false sense that it is generic by maintaining a "definitive" English list.

I also think that it's clear that max_df is hard to tune, and indeed, measuring against an existing stop list (particularly one based on known function words rather than corpus statistics) would be a reasonable way to set it. (It's also possible that another parameterisation, such as top 1% of words by df, is more robust, or something that accounts for outlying tf as well as df.)

I think we owe our users a Note on Using Stop Words mentioning tokenization and the extent to which max_df can substitute, including an example of building a list from an external corpus...

@rth
Copy link
Member Author

rth commented Mar 29, 2018

Thanks for the feedback @jnothman and @qinhanmin2014 !

I would've been more comfortable with this analysis over the Reuters corpus

Did you mean RCV1 or Reuters-21578 ? The latter is quite old 1987 and might not be very representative for modern English while the publicly accessible RCV1 version only includes stemmed tokens and is not suitable for such analysis (I don't have access to the original RCV1 CD.

It's also possible that another parameterisation, such as top 1% of words by df, is more robust, or something that accounts for outlying tf as well as df.

I was also thinking about that. top %1 of words by df also is not ideal, because the vocabulary size is still linked to the corpus size (e.g. much larger for the English Wikipedia than for the 20 newsgroups). Another idea could be a max_df that allows to get some recall (e.g. 50%) of some stop word list (e.g. the one include in scikit-learn), that should be more robust, but it's also not very general.

I think we owe our users a Note on Using Stop Words mentioning tokenization and the extent to which max_df can substitute, including an example of building a list from an external corpus...

+1

My solution: (1) deprecate current stop word list; (2) introduce a new stop word list (e.g., english-robust),

Well, I guess there is still some debate between this, deprecating, or just correcting the current list as a bug fix #9057 (comment) ...

@qinhanmin2014
Copy link
Member

Well, I guess there is still some debate between this, deprecating, or just correcting the current list as a bug fix #9057 (comment) ...

I think deprecating (and introducing new one) and correcting are actually similar solutions and I'm +1 for either. I just think if we need to change current list a lot, we'd better introduce a new one.

@qinhanmin2014
Copy link
Member

Closing according to comment from @jnothman and @rth , please reopen if I'm wrong.

comment from @jnothman

I agree now that substituting with max_df is insufficient.

comment from @rth

We can mention max_df in the docs as a first step to a language generic approach but I agree that it's not robust enough to recommend it by default (frankly I wouldn't use it personally on projects where I just want things to work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants