Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider separating the grouping functions for corpus_ and dfm_ #725

Closed
kbenoit opened this issue May 9, 2017 · 4 comments
Closed

Consider separating the grouping functions for corpus_ and dfm_ #725

kbenoit opened this issue May 9, 2017 · 4 comments

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented May 9, 2017

Following on the discussion in #723 and #720, we should consider adding separate functions to perform the grouping of counts or texts currently present as arguments to the corpus() and dfm() constructor functions.

These would take grouping variables, and perform an aggregation function on the main object and on other variables.

  • corpus_group() - would aggregate the texts by pasting (similar to texts(x, groups = ) now), but take other arguments to allow docvars to be aggregated in a user-specified way, if these differ across the same values of the grouping variables.

  • dfm_group() - this would simply separate groups = functionality from the dfm(), and allow the user to aggregate dfm cell values (usually, counts, but could be weighted counts) by a grouping variable for documents. This is very easy with existing functions: we simply replace the docname row attribute by the unique grouping value label, and call dfm_compress(x, margin = "documents"). The default aggregation function would be sum but we could offer alternatives through a user-defined numerical function. We could add arguments to aggregate docvars too but this might be overly complicated - if the user wants this, he/she should do it at the corpus stage.

Rather than reinvent the wheel, we might consider using reshape2 syntax and putting this into corpus_reshape(). This would clarify #619 for instance (what are the differences between corpus_reshape() and corpus_segment()), since reshaping would offer functionality for moving up to aggregate units, rather than splitting/segmenting. An alternative would be to use dplyr functions such as summarise_each but this might be too far from our existing approach to fit in naturally.

@koheiw
Copy link
Collaborator

koheiw commented May 9, 2017

Let's start from dfm_group which is easy to program and highly demanded. Redesigning corpus_reshape, should not be difficult as there is a good underlying implementation. I am also keen on improving internal handling of docvars.

@kbenoit
Copy link
Collaborator Author

kbenoit commented May 9, 2017

ok, this should replicate the functionality of the groups = argument in dfm(). Let's not implement docvar aggregation at this stage, although we could implement a fun.aggregate equivalent whose default is sum.

so:

dfm_group(x, by, fun.aggregate = sum)

where by would be the same as groups in dfm(). This would take a variable from the docvars or externally supplied column(s) of data whose element (or rows) match the documents.

Note: this follows the ?cast syntax from reshape2. If you want to call it fun_aggregate that's fine too, since this is probably how Hadley would name it now.

If the dfm's docvars differ in their values in the by identifier, we should issue a warning, and drop that docvar from the dfm.

@kbenoit
Copy link
Collaborator Author

kbenoit commented May 9, 2017

Upon further reflection, I think we would not want to use fun.aggregate for now. Otherwise we cannot simply call dfm_compress() since this (only) sums counts.

@stefan-mueller
Copy link
Collaborator

I agree, @kbenoit. I don't think too many people face my problem, and I managed to find a non-quanteda based way to solve it. Yet, corpus_group() and dfm_group() seem very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants