Consider separating the grouping functions for corpus_ and dfm_ #725

kbenoit · 2017-05-09T05:35:32Z

Following on the discussion in #723 and #720, we should consider adding separate functions to perform the grouping of counts or texts currently present as arguments to the corpus() and dfm() constructor functions.

These would take grouping variables, and perform an aggregation function on the main object and on other variables.

corpus_group() - would aggregate the texts by pasting (similar to texts(x, groups = ) now), but take other arguments to allow docvars to be aggregated in a user-specified way, if these differ across the same values of the grouping variables.
dfm_group() - this would simply separate groups = functionality from the dfm(), and allow the user to aggregate dfm cell values (usually, counts, but could be weighted counts) by a grouping variable for documents. This is very easy with existing functions: we simply replace the docname row attribute by the unique grouping value label, and call dfm_compress(x, margin = "documents"). The default aggregation function would be sum but we could offer alternatives through a user-defined numerical function. We could add arguments to aggregate docvars too but this might be overly complicated - if the user wants this, he/she should do it at the corpus stage.

Rather than reinvent the wheel, we might consider using reshape2 syntax and putting this into corpus_reshape(). This would clarify #619 for instance (what are the differences between corpus_reshape() and corpus_segment()), since reshaping would offer functionality for moving up to aggregate units, rather than splitting/segmenting. An alternative would be to use dplyr functions such as summarise_each but this might be too far from our existing approach to fit in naturally.

The text was updated successfully, but these errors were encountered:

koheiw · 2017-05-09T06:53:10Z

Let's start from dfm_group which is easy to program and highly demanded. Redesigning corpus_reshape, should not be difficult as there is a good underlying implementation. I am also keen on improving internal handling of docvars.

kbenoit · 2017-05-09T08:08:30Z

ok, this should replicate the functionality of the groups = argument in dfm(). Let's not implement docvar aggregation at this stage, although we could implement a fun.aggregate equivalent whose default is sum.

so:

dfm_group(x, by, fun.aggregate = sum)

where by would be the same as groups in dfm(). This would take a variable from the docvars or externally supplied column(s) of data whose element (or rows) match the documents.

Note: this follows the ?cast syntax from reshape2. If you want to call it fun_aggregate that's fine too, since this is probably how Hadley would name it now.

If the dfm's docvars differ in their values in the by identifier, we should issue a warning, and drop that docvar from the dfm.

kbenoit · 2017-05-09T12:12:18Z

Upon further reflection, I think we would not want to use fun.aggregate for now. Otherwise we cannot simply call dfm_compress() since this (only) sums counts.

stefan-mueller · 2017-05-09T12:17:29Z

I agree, @kbenoit. I don't think too many people face my problem, and I managed to find a non-quanteda based way to solve it. Yet, corpus_group() and dfm_group() seem very useful.

kbenoit added infrastructure question labels May 9, 2017

stefan-mueller mentioned this issue May 9, 2017

corpus_reshape: reshape to sentences across documents and preserve docvars #723

Closed

koheiw added a commit that referenced this issue May 10, 2017

Add a grouping function for dfm_group and dfm_compress #725

e971ee8

kbenoit mentioned this issue May 10, 2017

Add a corpus_group() function #729

Closed

koheiw mentioned this issue May 10, 2017

Issue 725 #727

Merged

kbenoit closed this as completed May 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider separating the grouping functions for corpus_ and dfm_ #725

Consider separating the grouping functions for corpus_ and dfm_ #725

kbenoit commented May 9, 2017 •

edited

Loading

koheiw commented May 9, 2017

kbenoit commented May 9, 2017 •

edited

Loading

kbenoit commented May 9, 2017

stefan-mueller commented May 9, 2017

Consider separating the grouping functions for corpus_ and dfm_ #725

Consider separating the grouping functions for corpus_ and dfm_ #725

Comments

kbenoit commented May 9, 2017 • edited Loading

koheiw commented May 9, 2017

kbenoit commented May 9, 2017 • edited Loading

kbenoit commented May 9, 2017

stefan-mueller commented May 9, 2017

kbenoit commented May 9, 2017 •

edited

Loading

kbenoit commented May 9, 2017 •

edited

Loading