-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve dfm_trim() options #1254
Comments
I like the functionality, so I can start implementing, but I am not sure about the labels. |
OK, maybe One of my earlier ideas was for the user to supply a function for aggregation, such as sum, min, max, quantile, etc with options passed through |
I support your proposal to add Sticking to the behavior of base functions is possible, but we need to ask users for a bit more coding. For top 10 features: dfm_trim(some_dfm, min_count = nfeat(some_dfm) - 10, count_fun = "rank") Relatedly, we also need to decide which is more appropriate textplot_network(x, min_freq = 0.5, ...)
textplot_wordcloud(x, min_size = 0.5, max_size = 4, min_count = 3, ...) |
I also think that the default values for |
A consistent design here would be to use larger |
Can you collect the above into an alternate set of examples and syntax, similar to the first suggestion above? Briefly:
|
It would be like these: counts# remove any features occurring fewer than 5 times in total
dfm_trim(x, min_count = 5, count_fun = "raw")
# remove all but the top 200 most frequent features
dfm_trim(x, min_count = nfeat(x) - 200, count_fun = "rank")
# keep all features
dfm_trim(x, min_count = 1, count_fun = "rank")
# remove any features not occurring at a rate of 10% of total feature count
dfm_trim(x, min_count = .10, count_fun = "prop")
# keep only features occurring at the median total frequency or above
dfm_trim(x, min_count = .50, count_fun = "quant")
# keep only 5% most frequent features
dfm_trim(x, min_count = .95, count_fun = "quant")
document frequency# remove any features occurring in fewer than 5 documents each
dfm_trim(x, min_docfreq = 5, docfreq_fun = "raw")
# keep only the most frequent 200 features in terms of document frequency
dfm_trim(x, min_docfreq = ndoc(x) - 200, docfreq_fun = "rank")
# remove any features not occurring at least 5% of the total document frequency
dfm_trim(x, min_count = .05, docfreq_fun = "prop")
# keep only features with the median document frequency or higher
dfm_trim(x, min_count = .50, docfreq_fun = "quant") |
Updated thinking: dfm_trim(x,
min_termfreq = NULL, max_termfreq = NULL, termfreq_type = c("count", "rank", "quantile"),
min_docfreq = NULL, max_docfreq = NULL, docfreq_type = c("count", "rank" "quantile"),
sparsity = NULL,
verbose = quanteda_options("verbose")) This changes the behaviour since at the moment, calling just We should trap |
Hi, Sorry, I am a bit confused about the status of
(myDfm <- dfm(data_corpus_inaugural[1:5]))
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).
(dfm_trim (myDfm, min_docfreq=.4))
## Document-feature matrix of: 5 documents, 569 features (44.2% sparse). In 1.1.6: (myDfm <- dfm(data_corpus_inaugural[1:5]))
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse).
dfm_trim (myDfm, min_docfreq=.4)
## Document-feature matrix of: 5 documents, 1,948 features (69.5% sparse). In order to get the original result one has to change the call of dfm_trim (myDfm, min_docfreq=.4, docfreq_type="prop")
## Document-feature matrix of: 5 documents, 569 features (44.2% sparse). What is your strategy regarding backward compatibility?
Sorry if this is an obsolete comment because you are already aware of the issues. |
Good eye @HolgersID! We changed @koheiw I see two things we missed. First, the signature for the generic does not match the |
dfm_trim()
currently accepts two sets ofmin/max
arguments, forcount
anddocfreq
respectively. Each also accepts a fractional value, currently interpreted as a percentile of the features by their total frequencies. We want to make this interpretation both more consistent and more flexible by adding arguments todfm_trim()
.Proposal
where
count_fun
refers to the aggregation across documents for the feature count, whether this is weighted or not. Applied to the global set of feature counts, the values mean the total, the proportion, the percentile, and the inverse rank of each feature, respectively. (Inverse rank means selecting the top relative frequencymin_count
features.)docfreq_fun
refers to the method of applying the document frequency threshold across documents.boolean
is the default, which means count the document frequency of a feature as the number of documents in which it has some weighted value > 0. The other values are the same as applied for counts.Examples
counts
document frequency
The text was updated successfully, but these errors were encountered: