You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking through the documentation trying to do the same as max_features from tfidfvectorizer in scikit.
max_features is very clearly defined in tfidfvectorizer:
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
Like say I I use tokens(x, what = "character") and I would like to adjust with dfm(x, max_features =50000) .
It seems logical to me that this would be a dfm operation. I am missing what the operation is and when it appears in the workflow if its already there. Basically, the goal is to reduce the dfm matrix to the 50000 first term by frequency. max_count in dfm_trim would not accomplish that.
What is not clear to me in the documentation:
If dfm(x[1:50000]; does this selects the first 50000 by top word frequency?
If dfm(x, max_docfreq = 50000; does this selects the first 50000 by top word frequency?
Do I have to add another step with %>% dfm_sort(margin = "features") but then not sure how to select by features and frequency.
Thanks for the guidance!
The text was updated successfully, but these errors were encountered:
kbenoit
changed the title
replicate tfidfvectorizer max_features
how to select top most frequent features from a dfm
Mar 4, 2018
I am looking through the documentation trying to do the same as max_features from tfidfvectorizer in scikit.
max_features is very clearly defined in tfidfvectorizer:
Like say I I use
tokens(x, what = "character")
and I would like to adjust withdfm(x, max_features =50000)
.It seems logical to me that this would be a dfm operation. I am missing what the operation is and when it appears in the workflow if its already there. Basically, the goal is to reduce the dfm matrix to the 50000 first term by frequency. max_count in dfm_trim would not accomplish that.
What is not clear to me in the documentation:
If
dfm(x[1:50000]
; does this selects the first 50000 by top word frequency?If
dfm(x, max_docfreq = 50000
; does this selects the first 50000 by top word frequency?Do I have to add another step with
%>% dfm_sort(margin = "features")
but then not sure how to select by features and frequency.Thanks for the guidance!
The text was updated successfully, but these errors were encountered: