Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AccumulateCallback #689

Merged
merged 8 commits into from Nov 16, 2018
Merged

Conversation

@blakeboswell
Copy link
Contributor

@blakeboswell blakeboswell commented Jun 17, 2017

This pr addresses issue #683. Judging by the issue's activity, the user community isn't clamoring for this capability. However, I encounter the need regularly enough to think others will find it useful.

Summary

AccumulateCallback accumulates the results of operations on chunks in a single variable rather than collecting them in DataFrame or List. This allows for a reduce operation to be performed without creating an intermediate object that may be too large to hold in memory.

Example

In this example, we sum the mpg feature of mtcars.

f <- function(x, pos, acc) sum(x$mpg) + acc

total_mpg <- read_csv_chunked(readr_example("mtcars.csv"),
                              AccumulateCallback$new(f, acc = 0),
                              chunk_size = 5)
  • The callback function requires an additional input, acc which enables operation results to be accumulated.
  • The initialize method of AccumulateCallback also takes an additional optional input, acc, which allows for the initial value of the accumulator to be defined. This value defaults to NULL if omitted.

SideEffectChunkCallback Work Around

This same behavior can be accomplished with the SideEffectChunkCallback as follows

acc <- 0
f <- function(x, pos){
  acc <<- sum(x$mpg) + acc
}
read_csv_chunked(readr_example("mtcars.csv"),
                 SideEffectChunkCallback$new(f),
                 chunk_size = 5)
acc

With this in mind, this pr does not really provide a new capability. Just a more explicit expression of an existing capability.

Practical Example: Regression

A practical example would be fitting a glm to a data set that is too large to hold in memory.

library(biglm)

chunk_regress <- function(data, index, mdl){
  if(is.null(mdl))
    return(biglm(mpg ~ wt, data))
  update(mdl, data)
}

mdl <- read_csv_chunked(readr_example("mtcars.csv"),
                        AccumulateCallback$new(chunk_regress),
                        chunk_size = 5)

summary(mdl)

In this example we rely on the fact that the default acc value of AccumulateCallback initialize method is NULL to set the initial condition of chunk_regress.

blakeboswell and others added 8 commits Jun 5, 2017
Add AccumulateCallback function definition and documentation to callback.R
Modified documentation, added checkcallback mechanism
Add acc parameter to AccumulateCallback definition
Update documentation to reflect new parameter
Update check_call_back to accomodate AccumulateCallback
Add tests of acc as input to AccumulateCallback
Add tests for less than three arguments in callback
Built roxygen documentation with new AccumulateCallback
documentation
@jimhester jimhester merged commit 1e0971c into tidyverse:master Nov 16, 2018
0 of 2 checks passed
@jimhester
Copy link
Member

@jimhester jimhester commented Nov 16, 2018

This looks good, thanks!

Thanks again for writing the PR and for your patience in getting it merged!

@blakeboswell blakeboswell deleted the accumulate_callback branch Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants