how to select top most frequent features from a dfm #1253

Monduiz · 2018-03-04T13:43:09Z

I am looking through the documentation trying to do the same as max_features from tfidfvectorizer in scikit.

max_features is very clearly defined in tfidfvectorizer:

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

Like say I I use tokens(x, what = "character") and I would like to adjust with
dfm(x, max_features =50000) .

It seems logical to me that this would be a dfm operation. I am missing what the operation is and when it appears in the workflow if its already there. Basically, the goal is to reduce the dfm matrix to the 50000 first term by frequency. max_count in dfm_trim would not accomplish that.

What is not clear to me in the documentation:

If dfm(x[1:50000]; does this selects the first 50000 by top word frequency?
If dfm(x, max_docfreq = 50000; does this selects the first 50000 by top word frequency?

Do I have to add another step with %>% dfm_sort(margin = "features") but then not sure how to select by features and frequency.

Thanks for the guidance!

The text was updated successfully, but these errors were encountered:

kbenoit · 2018-03-04T19:39:19Z

If you want to keep the top 50,000 most frequent documents, you can do so using:

x <- dfm_select(x, names(topfeatures(x, n = 50000)))

This also works:

x <- dfm_sort(x)[, max(c(nfeat(x), 50000)]

Also, we are working on this - see #1254, which as currently proposed would be:

dfm_trim(x, min_count = 50000, count_fun = "invrank")

Monduiz · 2018-03-04T22:44:04Z

Thank you @kbenoit! I like the solution proposed!

kbenoit changed the title ~~replicate tfidfvectorizer max_features~~ how to select top most frequent features from a dfm Mar 4, 2018

kbenoit closed this as completed Mar 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to select top most frequent features from a dfm #1253

how to select top most frequent features from a dfm #1253

Monduiz commented Mar 4, 2018 •

edited

Loading

kbenoit commented Mar 4, 2018

Monduiz commented Mar 4, 2018

how to select top most frequent features from a dfm #1253

how to select top most frequent features from a dfm #1253

Comments

Monduiz commented Mar 4, 2018 • edited Loading

kbenoit commented Mar 4, 2018

Monduiz commented Mar 4, 2018

Monduiz commented Mar 4, 2018 •

edited

Loading