There are times where it's useful for the chunked callback to return a list (that is combined into a single list at the end).
An example is when processing a large JSONL file with purrr functions: (1) read file chunk, (2) use purrr to transform, filter, reduce, etc. as needed, and (3) combine into a final list, call that final object my_data. My specific use-case is to filter the data to discard irrelevant data such that the final object can fit into memory for interactive work/exploration.
When downstream functions expect my_data to be a data frame, then DataFrameCallback works just fine.
But if there already exist some downstream functions written to deal with my_data as a list (e.g. prepared purrr-enabled pipelines), the DataFrameCallback must be written to return a single list-column data frame, then extract that single list-column, and finally pass the now-extracted list to the downstream functions.
It seems silly to return a single-column data frame, just to then extract that single column later, when we know we'd like have a list in-hand at the end of the chunked reading/processing phase.
Minor suggestion, but certainly not a blocker/bug/etc. (since the work-around is described above).
In all other aspects, readr's great!
The text was updated successfully, but these errors were encountered:
Related to (PR) #520.
There are times where it's useful for the chunked callback to return a list (that is combined into a single list at the end).
An example is when processing a large JSONL file with purrr functions: (1) read file chunk, (2) use purrr to transform, filter, reduce, etc. as needed, and (3) combine into a final list, call that final object
my_data
.My specific use-case is to filter the data to discard irrelevant data such that the final object can fit into memory for interactive work/exploration.
When downstream functions expect
my_data
to be a data frame, then DataFrameCallback works just fine.But if there already exist some downstream functions written to deal with
my_data
as a list (e.g. prepared purrr-enabled pipelines), the DataFrameCallback must be written to return a single list-column data frame, then extract that single list-column, and finally pass the now-extracted list to the downstream functions.It seems silly to return a single-column data frame, just to then extract that single column later, when we know we'd like have a list in-hand at the end of the chunked reading/processing phase.
Minor suggestion, but certainly not a blocker/bug/etc. (since the work-around is described above).
In all other aspects, readr's great!
The text was updated successfully, but these errors were encountered: