Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up unpack_nested_data #42 #51

Merged
merged 3 commits into from
Apr 12, 2018

Conversation

wdearden
Copy link

Reference #42

@codecov-io
Copy link

codecov-io commented Jan 12, 2018

Codecov Report

Merging #51 into master will increase coverage by 0.82%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   70.93%   71.75%   +0.82%     
==========================================
  Files           3        3              
  Lines         516      531      +15     
==========================================
+ Hits          366      381      +15     
  Misses        150      150
Impacted Files Coverage Δ
R/elasticsearch_parsers.R 76.74% <100%> (+0.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3cffef...721cd9c. Read the comment docs.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @wdearden208 . Please see the comments I left for some things to address. I will review again when you've addressed those.

Also, @austin3dickey I'm going to let you be the final judge on this one, since this function is your masterpiece and you know the nuances of dealing with nested ES results

# Create the unpacked data.table by replicating the originally unpacked
# columns by the number of rows in each entry in the original unpacked column
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
n <- pmax(purrr::map_int(listDT, NROW), 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TILpmax

@@ -391,7 +391,8 @@ unpack_nested_data <- function(chomped_df, col_to_unpack) {
futile.logger::flog.fatal(msg)
stop(msg)
}
if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 1) {
if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) !=
1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put this 1) { back on the line above, so the line ends with the opening {

# If we tried to unpack an empty column, fail
if (nrow(newDT) == 0) {
# Check for empty column
if (all(lengths(listDT) == 0)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. TIL lengths()
  2. out of curiosity, what does this new version catch that nrow(newDT) didn't? It's not obvious to me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails faster, I like it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think this will fail if all the data.tables have 0 rows, just if they all have 0 columns. Was this intended?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I fixed it.

msg <- "The column given to unpack_nested_data had no data in it."
futile.logger::flog.fatal(msg)
stop(msg)
}

# Fix the ID because we may have removed some empty elements due to that bug
newDT[, .id := oldIDs[.id]]
listDT[lengths(listDT) == 0] <- NA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh maybe I'm getting this. Will this catch the cases where you have a cell with value list()? I like this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will

} else if (all(is_atomic)) {
newDT <- data.table::as.data.table(unlist(listDT))
} else {
msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like if we gave a more informative error message here. If I got this error as a user I wouldn't know what to do next.

Please do two things:

  1. write a unit test using expect_error() that reaches this line
  2. propose an error message format that might be able to give a user a bit more guidance on where to go next

# Create the unpacked data.table by replicating the originally unpacked
# columns by the number of rows in each entry in the original unpacked column
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
n <- pmax(purrr::map_int(listDT, NROW), 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please use a more descriptive variable name than n?

# columns by the number of rows in each entry in the original unpacked column
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
n <- pmax(purrr::map_int(listDT, NROW), 1)
rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not obvious to me, staring at this line, what it actually does. Can you please add a comment and choose a more informative name than rest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does ..groupvars do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what does drop = FALSE do?

@jameslamb
Copy link
Collaborator

@wdearden208 I am going to approve this PR, looks great! Appreciate all the effort you put into this

However, I'm not going to merge. @austin3dickey you have the final 👍 / 👎 on this one

@jameslamb
Copy link
Collaborator

@wdearden208 now that we merged that logging PR, you need to rebase this onto master

handle mixed atomic/data frame column and better explanations in comments
@jameslamb
Copy link
Collaborator

I just updated this! It put me as a non-author committer, so you will still get credit for the contribution

# Merge
outDT[, .id := .I]
outDT <- newDT[outDT, on = ".id"]
is_df <- purrr::map_lgl(listDT, is.data.frame)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add #' @importFrom purrr map_lgl to the Roxygen of this function to be explicit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else {
msg <- paste0("Each row in column ", col_to_unpack, " must be a data frame or a vector.")
futile.logger::flog.fatal(msg)
stop(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use log_fatal here instead of these two lines

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Find column name to use for NA vectors
first_df <- min(which(is_df))
col_name <- names(listDT[[first_df]])[1]
Copy link
Collaborator

@austin3dickey austin3dickey Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, using the first column name here seems pretty arbitrary. I wonder if we should surface "which column should I put atomic data in, if it's a mixture of atomic and data.tables" to the user.

To that point, have you ever seen a payload that would result in listDT being a mixture of atomic and data.tables? I wonder if this whole section is too much complexity for a use case I've never seen.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I dropped the logic for handling mixed atomic/data.table columns but I still needed to grab the column name to give NA entries. I got an error if I didn't because in the "handle NAs" test it created a second x column with all NA.

outDT <- outDT[, !c(".id", col_to_unpack), with = FALSE]
# Create the unpacked data.table by replicating the originally unpacked
# columns by the number of rows in each entry in the original unpacked column
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using pmax here? I thought purrr::map_int() is guaranteed to return one integer vector.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to use pmax for the cases where the entry in col_to_unpack is empty, so the number of rows is 0. I need to rep it one time because I don't want to drop that row.

# columns by the number of rows in each entry in the original unpacked column
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)
# Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack
replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)]
Copy link
Collaborator

@austin3dickey austin3dickey Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, though perhaps you should test out my former strategy of using the built-in idcol argument of rbindlist. That's essentially replicating the logic rep(1:nrow(chomped_df), times_to_replicate) but I suspect it will be faster since it's data.table haha

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried benchmarking a few versions of this function with a large listDT that was full of data.tables (no atomic):

  1. the old function
  2. the old function with lapply(listDT, data.table::as.data.table) commented out
  3. this version

(3) was twice as fast as (1), but (2) was twice as fast as (3). I suspect the difference is all the copies you're making in this chunk.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think these changes lost the speedup. The main speedup was from dropping data.table::as.data.table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try out this update's speed. For what it's worth I tried out this function where I used the same merge as the original function instead of rep and it ended up being about 20% slower. I think that makes sense because I'm still using a data.table [. So it shouldn't copy the data any more than a data.table merge would copy the data. It's basically going through the same algorithm as merging in the data.table with .id being the same as times_to_replicate. Furthermore, I think it makes sense this way since we don't have to deal with creating a .id column which could cause a problem if chomped_df has a .id column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.

Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.

  • Old function: 153 ms
  • Old function with the lapply(listDT, data.table::as.data.table) commented out: 69 ms
  • Your first version: 97 ms
  • Your second version: 89 ms
  • A hybrid: 17 ms

For the hybrid, I built off your second version. I changed the top to:

inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]

then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:

outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)

So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.

Copy link
Collaborator

@austin3dickey austin3dickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookin' good; I love the initiative to get rid of lapply(listDT, data.table::as.data.table). But I don't think we need to treat for the case where there's a mixture of atomic and data.frames in the list (feel free to disagree here). And there are a few more speedups you can achieve. Take a look whenever you can!

# columns by the number of rows in each entry in the original unpacked column
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)
# Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack
replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.

Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.

  • Old function: 153 ms
  • Old function with the lapply(listDT, data.table::as.data.table) commented out: 69 ms
  • Your first version: 97 ms
  • Your second version: 89 ms
  • A hybrid: 17 ms

For the hybrid, I built off your second version. I changed the top to:

inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]

then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:

outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)

So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.

@austin3dickey
Copy link
Collaborator

austin3dickey commented Apr 6, 2018

Whoops I commented on an outdated file just now.

@austin3dickey
Copy link
Collaborator

For those keeping track at home, this final version outperforms the original function by 9x and tidyr::unnest by 2x!! (On my real-world test dataset)

MAJOR props to @wdearden . Thanks for the idea and execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants