Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to select top most frequent features from a dfm #1253

Closed
Monduiz opened this issue Mar 4, 2018 · 2 comments
Closed

how to select top most frequent features from a dfm #1253

Monduiz opened this issue Mar 4, 2018 · 2 comments

Comments

@Monduiz
Copy link

Monduiz commented Mar 4, 2018

I am looking through the documentation trying to do the same as max_features from tfidfvectorizer in scikit.

max_features is very clearly defined in tfidfvectorizer:

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

Like say I I use tokens(x, what = "character") and I would like to adjust with
dfm(x, max_features =50000) .

It seems logical to me that this would be a dfm operation. I am missing what the operation is and when it appears in the workflow if its already there. Basically, the goal is to reduce the dfm matrix to the 50000 first term by frequency. max_count in dfm_trim would not accomplish that.

What is not clear to me in the documentation:

If dfm(x[1:50000]; does this selects the first 50000 by top word frequency?
If dfm(x, max_docfreq = 50000; does this selects the first 50000 by top word frequency?

Do I have to add another step with %>% dfm_sort(margin = "features") but then not sure how to select by features and frequency.

Thanks for the guidance!

@kbenoit kbenoit changed the title replicate tfidfvectorizer max_features how to select top most frequent features from a dfm Mar 4, 2018
@kbenoit
Copy link
Collaborator

kbenoit commented Mar 4, 2018

If you want to keep the top 50,000 most frequent documents, you can do so using:

x <- dfm_select(x, names(topfeatures(x, n = 50000)))

This also works:

x <- dfm_sort(x)[, max(c(nfeat(x), 50000)]

Also, we are working on this - see #1254, which as currently proposed would be:

dfm_trim(x, min_count = 50000, count_fun = "invrank")

@Monduiz
Copy link
Author

Monduiz commented Mar 4, 2018

Thank you @kbenoit! I like the solution proposed!

@kbenoit kbenoit closed this as completed Mar 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants