-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider separating the grouping functions for corpus_ and dfm_ #725
Comments
Let's start from |
ok, this should replicate the functionality of the so: dfm_group(x, by, fun.aggregate = sum) where Note: this follows the If the dfm's docvars differ in their values in the |
Upon further reflection, I think we would not want to use |
I agree, @kbenoit. I don't think too many people face my problem, and I managed to find a non-quanteda based way to solve it. Yet, |
Following on the discussion in #723 and #720, we should consider adding separate functions to perform the grouping of counts or texts currently present as arguments to the
corpus()
anddfm()
constructor functions.These would take grouping variables, and perform an aggregation function on the main object and on other variables.
corpus_group()
- would aggregate the texts by pasting (similar totexts(x, groups = )
now), but take other arguments to allow docvars to be aggregated in a user-specified way, if these differ across the same values of the grouping variables.dfm_group()
- this would simply separategroups =
functionality from thedfm()
, and allow the user to aggregate dfm cell values (usually, counts, but could be weighted counts) by a grouping variable for documents. This is very easy with existing functions: we simply replace the docname row attribute by the unique grouping value label, and calldfm_compress(x, margin = "documents")
. The default aggregation function would besum
but we could offer alternatives through a user-defined numerical function. We could add arguments to aggregate docvars too but this might be overly complicated - if the user wants this, he/she should do it at the corpus stage.Rather than reinvent the wheel, we might consider using reshape2 syntax and putting this into
corpus_reshape()
. This would clarify #619 for instance (what are the differences betweencorpus_reshape()
andcorpus_segment()
), since reshaping would offer functionality for moving up to aggregate units, rather than splitting/segmenting. An alternative would be to use dplyr functions such assummarise_each
but this might be too far from our existing approach to fit in naturally.The text was updated successfully, but these errors were encountered: