Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upspeed up unpack_nested_data #42 #51
Conversation
Codecov Report
@@ Coverage Diff @@
## master #51 +/- ##
==========================================
+ Coverage 70.93% 71.75% +0.82%
==========================================
Files 3 3
Lines 516 531 +15
==========================================
+ Hits 366 381 +15
Misses 150 150
Continue to review full report at Codecov.
|
|
This looks great, @wdearden208 . Please see the comments I left for some things to address. I will review again when you've addressed those. Also, @austin3dickey I'm going to let you be the final judge on this one, since this function is your masterpiece and you know the nuances of dealing with nested ES results |
| # Create the unpacked data.table by replicating the originally unpacked | ||
| # columns by the number of rows in each entry in the original unpacked column | ||
| group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack)) | ||
| n <- pmax(purrr::map_int(listDT, NROW), 1) |
jameslamb
Jan 19, 2018
Member
TILpmax
TILpmax
| @@ -391,7 +391,8 @@ unpack_nested_data <- function(chomped_df, col_to_unpack) { | |||
| futile.logger::flog.fatal(msg) | |||
| stop(msg) | |||
| } | |||
| if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 1) { | |||
| if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != | |||
| 1) { | |||
jameslamb
Jan 19, 2018
Member
please put this 1) { back on the line above, so the line ends with the opening {
please put this 1) { back on the line above, so the line ends with the opening {
| # If we tried to unpack an empty column, fail | ||
| if (nrow(newDT) == 0) { | ||
| # Check for empty column | ||
| if (all(lengths(listDT) == 0)) { |
jameslamb
Jan 19, 2018
Member
- TIL
lengths()
- out of curiosity, what does this new version catch that
nrow(newDT) didn't? It's not obvious to me
- TIL
lengths() - out of curiosity, what does this new version catch that
nrow(newDT)didn't? It's not obvious to me
austin3dickey
Jan 22, 2018
Member
This fails faster, I like it
This fails faster, I like it
austin3dickey
Mar 8, 2018
Member
Actually, I don't think this will fail if all the data.tables have 0 rows, just if they all have 0 columns. Was this intended?
Actually, I don't think this will fail if all the data.tables have 0 rows, just if they all have 0 columns. Was this intended?
wdearden
Mar 13, 2018
Author
Good point. I fixed it.
Good point. I fixed it.
| msg <- "The column given to unpack_nested_data had no data in it." | ||
| futile.logger::flog.fatal(msg) | ||
| stop(msg) | ||
| } | ||
|
|
||
| # Fix the ID because we may have removed some empty elements due to that bug | ||
| newDT[, .id := oldIDs[.id]] | ||
| listDT[lengths(listDT) == 0] <- NA |
jameslamb
Jan 19, 2018
Member
oh maybe I'm getting this. Will this catch the cases where you have a cell with value list()? I like this
oh maybe I'm getting this. Will this catch the cases where you have a cell with value list()? I like this
wdearden
Mar 13, 2018
Author
Yes it will
Yes it will
| } else if (all(is_atomic)) { | ||
| newDT <- data.table::as.data.table(unlist(listDT)) | ||
| } else { | ||
| msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames" |
jameslamb
Jan 19, 2018
Member
I'd like if we gave a more informative error message here. If I got this error as a user I wouldn't know what to do next.
Please do two things:
- write a unit test using
expect_error() that reaches this line
- propose an error message format that might be able to give a user a bit more guidance on where to go next
I'd like if we gave a more informative error message here. If I got this error as a user I wouldn't know what to do next.
Please do two things:
- write a unit test using
expect_error()that reaches this line - propose an error message format that might be able to give a user a bit more guidance on where to go next
| # Create the unpacked data.table by replicating the originally unpacked | ||
| # columns by the number of rows in each entry in the original unpacked column | ||
| group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack)) | ||
| n <- pmax(purrr::map_int(listDT, NROW), 1) |
jameslamb
Jan 19, 2018
Member
can you please use a more descriptive variable name than n?
can you please use a more descriptive variable name than n?
| # columns by the number of rows in each entry in the original unpacked column | ||
| group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack)) | ||
| n <- pmax(purrr::map_int(listDT, NROW), 1) | ||
| rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE] |
jameslamb
Jan 19, 2018
Member
it's not obvious to me, staring at this line, what it actually does. Can you please add a comment and choose a more informative name than rest?
it's not obvious to me, staring at this line, what it actually does. Can you please add a comment and choose a more informative name than rest?
jameslamb
Jan 19, 2018
Member
what does ..groupvars do?
what does ..groupvars do?
jameslamb
Jan 19, 2018
Member
also what does drop = FALSE do?
also what does drop = FALSE do?
|
@wdearden208 I am going to approve this PR, looks great! Appreciate all the effort you put into this However, I'm not going to merge. @austin3dickey you have the final |
|
@wdearden208 now that we merged that logging PR, you need to rebase this onto |
handle mixed atomic/data frame column and better explanations in comments
4f6cfbc
to
ada870c
|
I just updated this! It put me as a non-author committer, so you will still get credit for the contribution |
| # Merge | ||
| outDT[, .id := .I] | ||
| outDT <- newDT[outDT, on = ".id"] | ||
| is_df <- purrr::map_lgl(listDT, is.data.frame) |
austin3dickey
Mar 8, 2018
Member
Could you please add #' @importFrom purrr map_lgl to the Roxygen of this function to be explicit?
Could you please add #' @importFrom purrr map_lgl to the Roxygen of this function to be explicit?
wdearden
Mar 13, 2018
Author
Done
Done
| } else { | ||
| msg <- paste0("Each row in column ", col_to_unpack, " must be a data frame or a vector.") | ||
| futile.logger::flog.fatal(msg) | ||
| stop(msg) |
austin3dickey
Mar 8, 2018
Member
You should use log_fatal here instead of these two lines
You should use log_fatal here instead of these two lines
wdearden
Mar 13, 2018
Author
Done
Done
|
|
||
| # Find column name to use for NA vectors | ||
| first_df <- min(which(is_df)) | ||
| col_name <- names(listDT[[first_df]])[1] |
austin3dickey
Mar 8, 2018
•
Member
Hmm, using the first column name here seems pretty arbitrary. I wonder if we should surface "which column should I put atomic data in, if it's a mixture of atomic and data.tables" to the user.
To that point, have you ever seen a payload that would result in listDT being a mixture of atomic and data.tables? I wonder if this whole section is too much complexity for a use case I've never seen.
Hmm, using the first column name here seems pretty arbitrary. I wonder if we should surface "which column should I put atomic data in, if it's a mixture of atomic and data.tables" to the user.
To that point, have you ever seen a payload that would result in listDT being a mixture of atomic and data.tables? I wonder if this whole section is too much complexity for a use case I've never seen.
wdearden
Mar 13, 2018
Author
Ok I dropped the logic for handling mixed atomic/data.table columns but I still needed to grab the column name to give NA entries. I got an error if I didn't because in the "handle NAs" test it created a second x column with all NA.
Ok I dropped the logic for handling mixed atomic/data.table columns but I still needed to grab the column name to give NA entries. I got an error if I didn't because in the "handle NAs" test it created a second x column with all NA.
| outDT <- outDT[, !c(".id", col_to_unpack), with = FALSE] | ||
| # Create the unpacked data.table by replicating the originally unpacked | ||
| # columns by the number of rows in each entry in the original unpacked column | ||
| times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1) |
austin3dickey
Mar 8, 2018
Member
Why are you using pmax here? I thought purrr::map_int() is guaranteed to return one integer vector.
Why are you using pmax here? I thought purrr::map_int() is guaranteed to return one integer vector.
wdearden
Mar 13, 2018
Author
I need to use pmax for the cases where the entry in col_to_unpack is empty, so the number of rows is 0. I need to rep it one time because I don't want to drop that row.
I need to use pmax for the cases where the entry in col_to_unpack is empty, so the number of rows is 0. I need to rep it one time because I don't want to drop that row.
| # columns by the number of rows in each entry in the original unpacked column | ||
| times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1) | ||
| # Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack | ||
| replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)] |
austin3dickey
Mar 8, 2018
•
Member
Makes sense, though perhaps you should test out my former strategy of using the built-in idcol argument of rbindlist. That's essentially replicating the logic rep(1:nrow(chomped_df), times_to_replicate) but I suspect it will be faster since it's data.table haha
Makes sense, though perhaps you should test out my former strategy of using the built-in idcol argument of rbindlist. That's essentially replicating the logic rep(1:nrow(chomped_df), times_to_replicate) but I suspect it will be faster since it's data.table haha
austin3dickey
Mar 8, 2018
Member
So I tried benchmarking a few versions of this function with a large listDT that was full of data.tables (no atomic):
- the old function
- the old function with
lapply(listDT, data.table::as.data.table) commented out
- this version
(3) was twice as fast as (1), but (2) was twice as fast as (3). I suspect the difference is all the copies you're making in this chunk.
So I tried benchmarking a few versions of this function with a large listDT that was full of data.tables (no atomic):
- the old function
- the old function with
lapply(listDT, data.table::as.data.table)commented out - this version
(3) was twice as fast as (1), but (2) was twice as fast as (3). I suspect the difference is all the copies you're making in this chunk.
wdearden
Mar 12, 2018
Author
Yep, I think these changes lost the speedup. The main speedup was from dropping data.table::as.data.table.
Yep, I think these changes lost the speedup. The main speedup was from dropping data.table::as.data.table.
wdearden
Mar 13, 2018
Author
Try out this update's speed. For what it's worth I tried out this function where I used the same merge as the original function instead of rep and it ended up being about 20% slower. I think that makes sense because I'm still using a data.table [. So it shouldn't copy the data any more than a data.table merge would copy the data. It's basically going through the same algorithm as merging in the data.table with .id being the same as times_to_replicate. Furthermore, I think it makes sense this way since we don't have to deal with creating a .id column which could cause a problem if chomped_df has a .id column.
Try out this update's speed. For what it's worth I tried out this function where I used the same merge as the original function instead of rep and it ended up being about 20% slower. I think that makes sense because I'm still using a data.table [. So it shouldn't copy the data any more than a data.table merge would copy the data. It's basically going through the same algorithm as merging in the data.table with .id being the same as times_to_replicate. Furthermore, I think it makes sense this way since we don't have to deal with creating a .id column which could cause a problem if chomped_df has a .id column.
austin3dickey
Apr 6, 2018
Member
Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.
Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.
- Old function: 153 ms
- Old function with the
lapply(listDT, data.table::as.data.table) commented out: 69 ms
- Your first version: 97 ms
- Your second version: 89 ms
- A hybrid: 17 ms
For the hybrid, I built off your second version. I changed the top to:
inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]
then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:
outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)
So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.
Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.
Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.
- Old function: 153 ms
- Old function with the
lapply(listDT, data.table::as.data.table)commented out: 69 ms - Your first version: 97 ms
- Your second version: 89 ms
- A hybrid: 17 ms
For the hybrid, I built off your second version. I changed the top to:
inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]
then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:
outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)
So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.
|
Lookin' good; I love the initiative to get rid of |
bbd0997
to
20410ca
20410ca
to
f9b4516
| # columns by the number of rows in each entry in the original unpacked column | ||
| times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1) | ||
| # Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack | ||
| replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)] |
austin3dickey
Apr 6, 2018
Member
Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.
Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.
- Old function: 153 ms
- Old function with the
lapply(listDT, data.table::as.data.table) commented out: 69 ms
- Your first version: 97 ms
- Your second version: 89 ms
- A hybrid: 17 ms
For the hybrid, I built off your second version. I changed the top to:
inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]
then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:
outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)
So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.
Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.
Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.
- Old function: 153 ms
- Old function with the
lapply(listDT, data.table::as.data.table)commented out: 69 ms - Your first version: 97 ms
- Your second version: 89 ms
- A hybrid: 17 ms
For the hybrid, I built off your second version. I changed the top to:
inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]
then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:
outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)
So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.
|
Whoops I commented on an outdated file just now. |
|
For those keeping track at home, this final version outperforms the original function by 9x and MAJOR props to @wdearden . Thanks for the idea and execution. |
Reference #42