Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AccumulateCallback #689

merged 8 commits into from Nov 16, 2018


Copy link

@blakeboswell blakeboswell commented Jun 17, 2017

This pr addresses issue #683. Judging by the issue's activity, the user community isn't clamoring for this capability. However, I encounter the need regularly enough to think others will find it useful.


AccumulateCallback accumulates the results of operations on chunks in a single variable rather than collecting them in DataFrame or List. This allows for a reduce operation to be performed without creating an intermediate object that may be too large to hold in memory.


In this example, we sum the mpg feature of mtcars.

f <- function(x, pos, acc) sum(x$mpg) + acc

total_mpg <- read_csv_chunked(readr_example("mtcars.csv"),
                              AccumulateCallback$new(f, acc = 0),
                              chunk_size = 5)
  • The callback function requires an additional input, acc which enables operation results to be accumulated.
  • The initialize method of AccumulateCallback also takes an additional optional input, acc, which allows for the initial value of the accumulator to be defined. This value defaults to NULL if omitted.

SideEffectChunkCallback Work Around

This same behavior can be accomplished with the SideEffectChunkCallback as follows

acc <- 0
f <- function(x, pos){
  acc <<- sum(x$mpg) + acc
                 chunk_size = 5)

With this in mind, this pr does not really provide a new capability. Just a more explicit expression of an existing capability.

Practical Example: Regression

A practical example would be fitting a glm to a data set that is too large to hold in memory.


chunk_regress <- function(data, index, mdl){
    return(biglm(mpg ~ wt, data))
  update(mdl, data)

mdl <- read_csv_chunked(readr_example("mtcars.csv"),
                        chunk_size = 5)


In this example we rely on the fact that the default acc value of AccumulateCallback initialize method is NULL to set the initial condition of chunk_regress.

blakeboswell and others added 8 commits June 5, 2017 02:21
Add AccumulateCallback function definition and documentation to callback.R
Modified documentation, added checkcallback mechanism
Add acc parameter to AccumulateCallback definition
Update documentation to reflect new parameter
Update check_call_back to accomodate AccumulateCallback
Add tests of acc as input to AccumulateCallback
Add tests for less than three arguments in callback
Built roxygen documentation with new AccumulateCallback
@jimhester jimhester merged commit 1e0971c into tidyverse:master Nov 16, 2018
Copy link

This looks good, thanks!

Thanks again for writing the PR and for your patience in getting it merged!

@blakeboswell blakeboswell deleted the accumulate_callback branch November 16, 2018 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

Successfully merging this pull request may close these issues.

None yet

2 participants