Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up unpack_nested_data #42 #51

Merged
merged 3 commits into from Apr 12, 2018

Conversation

@wdearden
Copy link

@wdearden wdearden commented Jan 12, 2018

Reference #42

@codecov-io
Copy link

@codecov-io codecov-io commented Jan 12, 2018

Codecov Report

Merging #51 into master will increase coverage by 0.82%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   70.93%   71.75%   +0.82%     
==========================================
  Files           3        3              
  Lines         516      531      +15     
==========================================
+ Hits          366      381      +15     
  Misses        150      150
Impacted Files Coverage Δ
R/elasticsearch_parsers.R 76.74% <100%> (+0.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3cffef...721cd9c. Read the comment docs.

Copy link
Member

@jameslamb jameslamb left a comment

This looks great, @wdearden208 . Please see the comments I left for some things to address. I will review again when you've addressed those.

Also, @austin3dickey I'm going to let you be the final judge on this one, since this function is your masterpiece and you know the nuances of dealing with nested ES results

# Create the unpacked data.table by replicating the originally unpacked
# columns by the number of rows in each entry in the original unpacked column
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
n <- pmax(purrr::map_int(listDT, NROW), 1)

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

TILpmax

@@ -391,7 +391,8 @@ unpack_nested_data <- function(chomped_df, col_to_unpack) {
futile.logger::flog.fatal(msg)
stop(msg)
}
if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 1) {
if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) !=
1) {

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

please put this 1) { back on the line above, so the line ends with the opening {

# If we tried to unpack an empty column, fail
if (nrow(newDT) == 0) {
# Check for empty column
if (all(lengths(listDT) == 0)) {

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

  1. TIL lengths()
  2. out of curiosity, what does this new version catch that nrow(newDT) didn't? It's not obvious to me

This comment has been minimized.

@austin3dickey

austin3dickey Jan 22, 2018
Member

This fails faster, I like it

This comment has been minimized.

@austin3dickey

austin3dickey Mar 8, 2018
Member

Actually, I don't think this will fail if all the data.tables have 0 rows, just if they all have 0 columns. Was this intended?

This comment has been minimized.

@wdearden

wdearden Mar 13, 2018
Author

Good point. I fixed it.

msg <- "The column given to unpack_nested_data had no data in it."
futile.logger::flog.fatal(msg)
stop(msg)
}

# Fix the ID because we may have removed some empty elements due to that bug
newDT[, .id := oldIDs[.id]]
listDT[lengths(listDT) == 0] <- NA

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

oh maybe I'm getting this. Will this catch the cases where you have a cell with value list()? I like this

This comment has been minimized.

@wdearden

wdearden Mar 13, 2018
Author

Yes it will

} else if (all(is_atomic)) {
newDT <- data.table::as.data.table(unlist(listDT))
} else {
msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames"

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

I'd like if we gave a more informative error message here. If I got this error as a user I wouldn't know what to do next.

Please do two things:

  1. write a unit test using expect_error() that reaches this line
  2. propose an error message format that might be able to give a user a bit more guidance on where to go next
# Create the unpacked data.table by replicating the originally unpacked
# columns by the number of rows in each entry in the original unpacked column
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
n <- pmax(purrr::map_int(listDT, NROW), 1)

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

can you please use a more descriptive variable name than n?

# columns by the number of rows in each entry in the original unpacked column
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
n <- pmax(purrr::map_int(listDT, NROW), 1)
rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE]

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

it's not obvious to me, staring at this line, what it actually does. Can you please add a comment and choose a more informative name than rest?

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

what does ..groupvars do?

This comment has been minimized.

@jameslamb

jameslamb Jan 19, 2018
Member

also what does drop = FALSE do?

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Jan 27, 2018

@wdearden208 I am going to approve this PR, looks great! Appreciate all the effort you put into this

However, I'm not going to merge. @austin3dickey you have the final 👍 / 👎 on this one

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Jan 28, 2018

@wdearden208 now that we merged that logging PR, you need to rebase this onto master

@wdearden wdearden force-pushed the wdearden:feature/speedUpUnpack branch from c14c972 to 4f6cfbc Feb 5, 2018
handle mixed atomic/data frame column and better explanations in comments
@jameslamb jameslamb force-pushed the wdearden:feature/speedUpUnpack branch from 4f6cfbc to ada870c Mar 6, 2018
@jameslamb
Copy link
Member

@jameslamb jameslamb commented Mar 6, 2018

I just updated this! It put me as a non-author committer, so you will still get credit for the contribution

# Merge
outDT[, .id := .I]
outDT <- newDT[outDT, on = ".id"]
is_df <- purrr::map_lgl(listDT, is.data.frame)

This comment has been minimized.

@austin3dickey

austin3dickey Mar 8, 2018
Member

Could you please add #' @importFrom purrr map_lgl to the Roxygen of this function to be explicit?

This comment has been minimized.

} else {
msg <- paste0("Each row in column ", col_to_unpack, " must be a data frame or a vector.")
futile.logger::flog.fatal(msg)
stop(msg)

This comment has been minimized.

@austin3dickey

austin3dickey Mar 8, 2018
Member

You should use log_fatal here instead of these two lines

This comment has been minimized.


# Find column name to use for NA vectors
first_df <- min(which(is_df))
col_name <- names(listDT[[first_df]])[1]

This comment has been minimized.

@austin3dickey

austin3dickey Mar 8, 2018
Member

Hmm, using the first column name here seems pretty arbitrary. I wonder if we should surface "which column should I put atomic data in, if it's a mixture of atomic and data.tables" to the user.

To that point, have you ever seen a payload that would result in listDT being a mixture of atomic and data.tables? I wonder if this whole section is too much complexity for a use case I've never seen.

This comment has been minimized.

@wdearden

wdearden Mar 13, 2018
Author

Ok I dropped the logic for handling mixed atomic/data.table columns but I still needed to grab the column name to give NA entries. I got an error if I didn't because in the "handle NAs" test it created a second x column with all NA.

outDT <- outDT[, !c(".id", col_to_unpack), with = FALSE]
# Create the unpacked data.table by replicating the originally unpacked
# columns by the number of rows in each entry in the original unpacked column
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)

This comment has been minimized.

@austin3dickey

austin3dickey Mar 8, 2018
Member

Why are you using pmax here? I thought purrr::map_int() is guaranteed to return one integer vector.

This comment has been minimized.

@wdearden

wdearden Mar 13, 2018
Author

I need to use pmax for the cases where the entry in col_to_unpack is empty, so the number of rows is 0. I need to rep it one time because I don't want to drop that row.

# columns by the number of rows in each entry in the original unpacked column
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)
# Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack
replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)]

This comment has been minimized.

@austin3dickey

austin3dickey Mar 8, 2018
Member

Makes sense, though perhaps you should test out my former strategy of using the built-in idcol argument of rbindlist. That's essentially replicating the logic rep(1:nrow(chomped_df), times_to_replicate) but I suspect it will be faster since it's data.table haha

This comment has been minimized.

@austin3dickey

austin3dickey Mar 8, 2018
Member

So I tried benchmarking a few versions of this function with a large listDT that was full of data.tables (no atomic):

  1. the old function
  2. the old function with lapply(listDT, data.table::as.data.table) commented out
  3. this version

(3) was twice as fast as (1), but (2) was twice as fast as (3). I suspect the difference is all the copies you're making in this chunk.

This comment has been minimized.

@wdearden

wdearden Mar 12, 2018
Author

Yep, I think these changes lost the speedup. The main speedup was from dropping data.table::as.data.table.

This comment has been minimized.

@wdearden

wdearden Mar 13, 2018
Author

Try out this update's speed. For what it's worth I tried out this function where I used the same merge as the original function instead of rep and it ended up being about 20% slower. I think that makes sense because I'm still using a data.table [. So it shouldn't copy the data any more than a data.table merge would copy the data. It's basically going through the same algorithm as merging in the data.table with .id being the same as times_to_replicate. Furthermore, I think it makes sense this way since we don't have to deal with creating a .id column which could cause a problem if chomped_df has a .id column.

This comment has been minimized.

@austin3dickey

austin3dickey Apr 6, 2018
Member

Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.

Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.

  • Old function: 153 ms
  • Old function with the lapply(listDT, data.table::as.data.table) commented out: 69 ms
  • Your first version: 97 ms
  • Your second version: 89 ms
  • A hybrid: 17 ms

For the hybrid, I built off your second version. I changed the top to:

inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]

then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:

outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)

So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.

Copy link
Member

@austin3dickey austin3dickey left a comment

Lookin' good; I love the initiative to get rid of lapply(listDT, data.table::as.data.table). But I don't think we need to treat for the case where there's a mixture of atomic and data.frames in the list (feel free to disagree here). And there are a few more speedups you can achieve. Take a look whenever you can!

@wdearden wdearden force-pushed the wdearden:feature/speedUpUnpack branch from bbd0997 to 20410ca Mar 13, 2018
@wdearden wdearden force-pushed the wdearden:feature/speedUpUnpack branch from 20410ca to f9b4516 Mar 13, 2018
# columns by the number of rows in each entry in the original unpacked column
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)
# Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack
replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)]

This comment has been minimized.

@austin3dickey

austin3dickey Apr 6, 2018
Member

Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.

Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.

  • Old function: 153 ms
  • Old function with the lapply(listDT, data.table::as.data.table) commented out: 69 ms
  • Your first version: 97 ms
  • Your second version: 89 ms
  • A hybrid: 17 ms

For the hybrid, I built off your second version. I changed the top to:

inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]

then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:

outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)

So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.

@austin3dickey
Copy link
Member

@austin3dickey austin3dickey commented Apr 6, 2018

Whoops I commented on an outdated file just now.

Austin Dickey
@austin3dickey
Copy link
Member

@austin3dickey austin3dickey commented Apr 12, 2018

For those keeping track at home, this final version outperforms the original function by 9x and tidyr::unnest by 2x!! (On my real-world test dataset)

MAJOR props to @wdearden . Thanks for the idea and execution.

@austin3dickey austin3dickey merged commit 2479d26 into uptake:master Apr 12, 2018
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.