speed up unpack_nested_data #42 #51

wdearden · 2018-01-12T01:58:42Z

Reference #42

codecov-io · 2018-01-12T02:09:19Z

Codecov Report

Merging #51 into master will increase coverage by 0.82%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   70.93%   71.75%   +0.82%     
==========================================
  Files           3        3              
  Lines         516      531      +15     
==========================================
+ Hits          366      381      +15     
  Misses        150      150

Impacted Files	Coverage Δ
R/elasticsearch_parsers.R	`76.74% <100%> (+0.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3cffef...721cd9c. Read the comment docs.

jameslamb

This looks great, @wdearden208 . Please see the comments I left for some things to address. I will review again when you've addressed those.

Also, @austin3dickey I'm going to let you be the final judge on this one, since this function is your masterpiece and you know the nuances of dealing with nested ES results

jameslamb · 2018-01-19T07:20:52Z

R/elasticsearch_parsers.R

+    # Create the unpacked data.table by replicating the originally unpacked
+    # columns by the number of rows in each entry in the original unpacked column
+    group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
+    n <- pmax(purrr::map_int(listDT, NROW), 1)


jameslamb · 2018-01-19T07:22:26Z

R/elasticsearch_parsers.R

@@ -391,7 +391,8 @@ unpack_nested_data <- function(chomped_df, col_to_unpack) {
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
-    if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 1) {
+    if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 
+        1) {


please put this 1) { back on the line above, so the line ends with the opening {

jameslamb · 2018-01-19T07:24:29Z

R/elasticsearch_parsers.R

-    # If we tried to unpack an empty column, fail
-    if (nrow(newDT) == 0) {
+    # Check for empty column
+    if (all(lengths(listDT) == 0)) {


TIL lengths()

out of curiosity, what does this new version catch that nrow(newDT) didn't? It's not obvious to me

This fails faster, I like it

Actually, I don't think this will fail if all the data.tables have 0 rows, just if they all have 0 columns. Was this intended?

Good point. I fixed it.

jameslamb · 2018-01-19T07:25:17Z

R/elasticsearch_parsers.R

        msg <- "The column given to unpack_nested_data had no data in it."
        futile.logger::flog.fatal(msg)
        stop(msg)
    }

-    # Fix the ID because we may have removed some empty elements due to that bug
-    newDT[, .id := oldIDs[.id]]
+    listDT[lengths(listDT) == 0] <- NA


oh maybe I'm getting this. Will this catch the cases where you have a cell with value list()? I like this

Yes it will

jameslamb · 2018-01-19T07:28:47Z

R/elasticsearch_parsers.R

+    } else if (all(is_atomic)) {
+        newDT <- data.table::as.data.table(unlist(listDT))
+    } else {
+        msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames"


I'd like if we gave a more informative error message here. If I got this error as a user I wouldn't know what to do next.

Please do two things:

write a unit test using expect_error() that reaches this line

propose an error message format that might be able to give a user a bit more guidance on where to go next

jameslamb · 2018-01-19T07:29:49Z

R/elasticsearch_parsers.R

+    # Create the unpacked data.table by replicating the originally unpacked
+    # columns by the number of rows in each entry in the original unpacked column
+    group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
+    n <- pmax(purrr::map_int(listDT, NROW), 1)


can you please use a more descriptive variable name than n?

jameslamb · 2018-01-19T07:31:14Z

R/elasticsearch_parsers.R

+    # columns by the number of rows in each entry in the original unpacked column
+    group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
+    n <- pmax(purrr::map_int(listDT, NROW), 1)
+    rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE]


it's not obvious to me, staring at this line, what it actually does. Can you please add a comment and choose a more informative name than rest?

what does ..groupvars do?

also what does drop = FALSE do?

jameslamb · 2018-01-27T23:55:50Z

@wdearden208 I am going to approve this PR, looks great! Appreciate all the effort you put into this

However, I'm not going to merge. @austin3dickey you have the final 👍 / 👎 on this one

jameslamb · 2018-01-28T16:05:03Z

@wdearden208 now that we merged that logging PR, you need to rebase this onto master

handle mixed atomic/data frame column and better explanations in comments

jameslamb · 2018-03-06T19:10:16Z

I just updated this! It put me as a non-author committer, so you will still get credit for the contribution

austin3dickey · 2018-03-08T14:23:46Z

R/elasticsearch_parsers.R

-    # Merge
-    outDT[, .id := .I]
-    outDT <- newDT[outDT, on = ".id"]
+    is_df <- purrr::map_lgl(listDT, is.data.frame)


Could you please add #' @importFrom purrr map_lgl to the Roxygen of this function to be explicit?

austin3dickey · 2018-03-08T15:34:56Z

R/elasticsearch_parsers.R

+    } else {
+        msg <- paste0("Each row in column ", col_to_unpack, " must be a data frame or a vector.")
+        futile.logger::flog.fatal(msg)
+        stop(msg)


You should use log_fatal here instead of these two lines

austin3dickey · 2018-03-08T15:37:49Z

R/elasticsearch_parsers.R

+
+        # Find column name to use for NA vectors
+        first_df <- min(which(is_df))
+        col_name <- names(listDT[[first_df]])[1]


Hmm, using the first column name here seems pretty arbitrary. I wonder if we should surface "which column should I put atomic data in, if it's a mixture of atomic and data.tables" to the user.

To that point, have you ever seen a payload that would result in listDT being a mixture of atomic and data.tables? I wonder if this whole section is too much complexity for a use case I've never seen.

Ok I dropped the logic for handling mixed atomic/data.table columns but I still needed to grab the column name to give NA entries. I got an error if I didn't because in the "handle NAs" test it created a second x column with all NA.

austin3dickey · 2018-03-08T15:43:57Z

R/elasticsearch_parsers.R

-    outDT <- outDT[, !c(".id", col_to_unpack), with = FALSE]
+    # Create the unpacked data.table by replicating the originally unpacked
+    # columns by the number of rows in each entry in the original unpacked column
+    times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)


Why are you using pmax here? I thought purrr::map_int() is guaranteed to return one integer vector.

I need to use pmax for the cases where the entry in col_to_unpack is empty, so the number of rows is 0. I need to rep it one time because I don't want to drop that row.

austin3dickey · 2018-03-08T15:54:02Z

R/elasticsearch_parsers.R

+    # columns by the number of rows in each entry in the original unpacked column
+    times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)
+    # Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack
+    replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)]


Makes sense, though perhaps you should test out my former strategy of using the built-in idcol argument of rbindlist. That's essentially replicating the logic rep(1:nrow(chomped_df), times_to_replicate) but I suspect it will be faster since it's data.table haha

So I tried benchmarking a few versions of this function with a large listDT that was full of data.tables (no atomic):

the old function

the old function with lapply(listDT, data.table::as.data.table) commented out

this version

(3) was twice as fast as (1), but (2) was twice as fast as (3). I suspect the difference is all the copies you're making in this chunk.

Yep, I think these changes lost the speedup. The main speedup was from dropping data.table::as.data.table.

Try out this update's speed. For what it's worth I tried out this function where I used the same merge as the original function instead of rep and it ended up being about 20% slower. I think that makes sense because I'm still using a data.table [. So it shouldn't copy the data any more than a data.table merge would copy the data. It's basically going through the same algorithm as merging in the data.table with .id being the same as times_to_replicate. Furthermore, I think it makes sense this way since we don't have to deal with creating a .id column which could cause a problem if chomped_df has a .id column.

Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.

Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.

Old function: 153 ms

Old function with the lapply(listDT, data.table::as.data.table) commented out: 69 ms

Your first version: 97 ms

Your second version: 89 ms

A hybrid: 17 ms

For the hybrid, I built off your second version. I changed the top to:

inDT <- data.table::copy(chomped_df) listDT <- inDT[[col_to_unpack]] inDT[, (col_to_unpack) := NULL] inDT[, .id := .I]

then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:

outDT <- inDT[newDT, on = ".id"] outDT[, .id := NULL] return(outDT)

So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.

austin3dickey

Lookin' good; I love the initiative to get rid of lapply(listDT, data.table::as.data.table). But I don't think we need to treat for the case where there's a mixture of atomic and data.frames in the list (feel free to disagree here). And there are a few more speedups you can achieve. Take a look whenever you can!

austin3dickey · 2018-04-06T19:27:35Z

R/elasticsearch_parsers.R

+    # columns by the number of rows in each entry in the original unpacked column
+    times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1)
+    # Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack
+    replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)]


Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.

Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.

Old function: 153 ms

Old function with the lapply(listDT, data.table::as.data.table) commented out: 69 ms

Your first version: 97 ms

Your second version: 89 ms

A hybrid: 17 ms

For the hybrid, I built off your second version. I changed the top to:

inDT <- data.table::copy(chomped_df) listDT <- inDT[[col_to_unpack]] inDT[, (col_to_unpack) := NULL] inDT[, .id := .I]

then added the idcol = TRUE back to rbindlist and merged after rbindlist like so:

outDT <- inDT[newDT, on = ".id"] outDT[, .id := NULL] return(outDT)

So I still think there's value in researching idcol as an option. If you're worried about the .id column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar instead of idcol = TRUE to automatically name that ID column.

austin3dickey · 2018-04-06T19:28:31Z

Whoops I commented on an outdated file just now.

austin3dickey · 2018-04-12T15:57:40Z

For those keeping track at home, this final version outperforms the original function by 9x and tidyr::unnest by 2x!! (On my real-world test dataset)

MAJOR props to @wdearden . Thanks for the idea and execution.

jameslamb requested review from ngparas, jameslamb and austin3dickey January 12, 2018 05:44

jameslamb requested changes Jan 19, 2018

View reviewed changes

jameslamb approved these changes Jan 27, 2018

View reviewed changes

wdearden force-pushed the feature/speedUpUnpack branch from c14c972 to 4f6cfbc Compare February 5, 2018 20:49

speed up unpack_nested_data uptake#42

ada870c

handle mixed atomic/data frame column and better explanations in comments

jameslamb force-pushed the feature/speedUpUnpack branch from 4f6cfbc to ada870c Compare March 6, 2018 19:09

austin3dickey reviewed Mar 8, 2018

View reviewed changes

austin3dickey requested changes Mar 8, 2018

View reviewed changes

wdearden force-pushed the feature/speedUpUnpack branch from bbd0997 to 20410ca Compare March 13, 2018 01:57

Remove handling of mixed atomic/data.table columns

f9b4516

wdearden force-pushed the feature/speedUpUnpack branch from 20410ca to f9b4516 Compare March 13, 2018 16:03

austin3dickey requested changes Apr 6, 2018

View reviewed changes

Made it a merge instead of rep

721cd9c

austin3dickey approved these changes Apr 12, 2018

View reviewed changes

austin3dickey merged commit 2479d26 into uptake:master Apr 12, 2018

jameslamb mentioned this pull request Apr 15, 2018

unpack_nested_data is slower than tidyr::unnest #42

Closed

jameslamb mentioned this pull request Jun 18, 2018

performance benchmarks #74

Open

speed up unpack_nested_data #42 #51

speed up unpack_nested_data #42 #51

Conversation

wdearden commented Jan 12, 2018

codecov-io commented Jan 12, 2018 • edited Loading

Codecov Report

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Jan 27, 2018

jameslamb commented Jan 28, 2018

jameslamb commented Mar 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austin3dickey Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austin3dickey Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austin3dickey left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austin3dickey commented Apr 6, 2018 • edited Loading

austin3dickey commented Apr 12, 2018

codecov-io commented Jan 12, 2018 •

edited

Loading

austin3dickey Mar 8, 2018 •

edited

Loading

austin3dickey Mar 8, 2018 •

edited

Loading

austin3dickey left a comment •

edited

Loading

austin3dickey commented Apr 6, 2018 •

edited

Loading