Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_df fails on mapping functions that generate single rows #179

Closed
jebyrnes opened this issue Feb 25, 2016 · 14 comments
Closed

map_df fails on mapping functions that generate single rows #179

jebyrnes opened this issue Feb 25, 2016 · 14 comments

Comments

@jebyrnes
Copy link

@jebyrnes jebyrnes commented Feb 25, 2016

If a map function creates a single row of output, map_df will fail in making a data frame. One can force it to work by making a data frame of the transpose of the output, but that seems like an unnecessary PITA.

While there may be a different function to use, this strikes me as behavior that is not quite sensible.

Example:

alist <- list(
data.frame(a=1:10, b=1:10),
data.frame(a=21:30, b=31:40)

)

returns normally

map_df(alist, .f = function(x) x+1)

returns Error: cannot convert object to a data frame

map_df(alist, .f = colMeans)

returns with the proper output

map_df(alist, function(x) data.frame(t(colMeans(x))))

@artemklevtsov
Copy link
Contributor

@artemklevtsov artemklevtsov commented Feb 27, 2016

Related with this issue: tidyverse/dplyr#1450.

@jennybc
Copy link
Member

@jennybc jennybc commented Feb 27, 2016

More threads on the general topic of: "it can be hard to glue conformable things together row-wise": tidyverse/dplyr#1104, #112. I must admit I haven't really explored the slice functions but the row collation sounds intriguing.

Here's another purrr solution but it also requires explicit transpose:

library(purrr)

alist <- list(
  data.frame(a = 1:2, b = 6:7),
  data.frame(a = 3:5, b = 8:10)
)

alist %>% 
  transpose() %>% 
  map_df(. %>% map_dbl(mean))
#> Source: local data frame [2 x 2]
#> 
#>       a     b
#>   (dbl) (dbl)
#> 1   1.5   6.5
#> 2   4.0   9.0

@artemklevtsov
Copy link
Contributor

@artemklevtsov artemklevtsov commented Feb 27, 2016

Coerce a colMeans output to list is fast and simple:

map_df(alist, ~as.list(colMeans(.x)), .id = "df")
#> Source: local data frame [2 x 3]
#> 
#>      df     a     b
#>   (chr) (dbl) (dbl)
#> 1     1   5.5   5.5
#> 2     2  25.5  35.5

@jennybc your example not works for me:

alist %>% transpose() %>% map_df(. %>% map_dbl(mean))
#> Rcpp::exception in 'eval(expr, envir, enclos)':
#>   cannot convert object to a data frame

map_dbl returns the numeric vector.

@jennybc
Copy link
Member

@jennybc jennybc commented Feb 28, 2016

@artemklevtsov Hmm... I just successfully ran my example again in a clean R session. FWIW I've got purrr installed from 9312764. Is your version more or less recent?

@artemklevtsov
Copy link
Contributor

@artemklevtsov artemklevtsov commented Feb 28, 2016

@jennybc I used stable version. Anyway transpose and stats::mean is not the most effective solution.

UPD: with git version the same error. This equivalent to:

alist %>% transpose() %>% map(. %>% map_dbl(mean)) %>% dplyr::bind_rows()

But bind_rows don't support an atomic vectors.

UPD2: works with dplyr dev-version.

@jennybc
Copy link
Member

@jennybc jennybc commented Feb 28, 2016

@artemklevtsov Well there still must be some version mismatch between us, because it works for me. I'm not making it up!

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 29, 2016

The thing is that map_df() is for collating dataframeable objects into a single data frame. What is a dataframeable object is defined in dplyr's bind_rows() and bind_cols() functions, and a vector is currently not part of that set.

Adirectional vectors are normally ambiguous as to their nature, they can either be a row vector or a column vector. It looks like the development version of dplyr considers them to be column vectors:

row_vectors <- list(
  c(a = 1, b = 2),
  c(a = 3, b = 4)
)

col_vectors <- list(
  a = c(1, 2),
  b = c(3, 4)
)
bind_rows(row_vectors)
#> Error: cannot convert object to a data frame

bind_rows(col_vectors)
#> Source: local data frame [2 x 2]
#>
#>       a     b
#>   (dbl) (dbl)
#> 1     1     3
#> 2     2     4

This is because a named list of vectors is a dataframeable object. I think that behaviour with vectors is a bit weird because bind_rows() effectively binds vectors on columns and not on rows. It'd probably make more sense to go the rbind() / cbind() way and consider vectors as row or col vectors depending on the direction of the binding. But then there'd be an ambiguity regarding the automatic coercion of lists to data frames... In any case, that's a dplyr issue and not a purrr one.

Here's another purrr solution but it also requires explicit transpose

That's a case where it makes sense to use dmap(). It coerces the results to a data frame with dplyr::as_data_frame(). This one works with the release version of dplyr:

alist %>% map_df(. %>% dmap(mean))
#> Source: local data frame [2 x 2]
#>
#>       a     b
#>   (dbl) (dbl)
#> 1   1.5   6.5
#> 2   4.0   9.0

@jebyrnes
Copy link
Author

@jebyrnes jebyrnes commented Feb 29, 2016

@lionel- I think the dmap solution is the one that wins here, although the others are great. But this makes the most sense in terms of generalizing beyond colMeans to other functions.

I see the conformability row v. col vector argument. Just seemed like sensible behaviour if a data frame came in, a single row was produced, that it would be a single row of a data frame out. But - I recognize that's making a lot of assumptions that might not be valid, even with map_df.

Thanks for a great discussion, all! This was illuminating.

@artemklevtsov
Copy link
Contributor

@artemklevtsov artemklevtsov commented Feb 29, 2016

Final note about performance:

library(purrr)
library(microbenchmark)
list_dfs <- lapply(1:100, function(...) as.data.frame(replicate(10, runif(1000))))
microbenchmark(
    map_df(list_dfs, ~as.list(colMeans(.x))),
    map_df(transpose(list_dfs), . %>% map_dbl(mean)),
    map_df(list_dfs, . %>% dmap(mean)))
#> Unit: milliseconds
#>                                              expr       min        lq      mean   median        uq       max neval cld
#>          map_df(list_dfs, ~as.list(colMeans(.x))) 10.465237 10.879279 12.754523 12.43822 12.870534 21.255816   100  b 
#>  map_df(transpose(list_dfs), . %>% map_dbl(mean))  4.604981  4.712693  5.476855  4.79528  5.652353  9.876513   100 a  
#>                map_df(list_dfs, . %>% dmap(mean)) 20.829029 21.324822 23.918506 21.57097 23.288950 97.466169   100   c

To improve @jennybc solution:

mean2 <- function(x) sum(x) / length(x)
microbenchmark(
    map_df(transpose(list_dfs), . %>% map_dbl(mean)),
    map_df(transpose(list_dfs), . %>% map_dbl(mean2)))
#> Unit: milliseconds
#>                                               expr      min       lq     mean   median       uq      max neval cld
#>   map_df(transpose(list_dfs), . %>% map_dbl(mean)) 5.268310 5.326473 5.797630 5.384035 5.540218 8.678301   100   b
#>  map_df(transpose(list_dfs), . %>% map_dbl(mean2)) 2.214055 2.266207 2.495217 2.317600 2.366828 4.773380   100  a 

@jennybc
Copy link
Member

@jennybc jennybc commented Feb 29, 2016

This is educational! I feel like I still struggle with row binding with dplyr + purrr in situations where plyr makes it easy to express myself.

@lionel- How would you do this with purrr?

x <- dplyr::data_frame(int = 1:3,
                       let = letters[int],
                       fac = factor(let),
                       dbl = int + 0.1)
plyr::ldply(x, class, .id = "var_name")
#>   var_name        V1
#> 1      int   integer
#> 2      let character
#> 3      fac    factor
#> 4      dbl   numeric

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 29, 2016

@lionel- How would you do this with purrr?

hmm I'd probably use tidyr:

dmap(x, class) %>% tidyr::gather()
#> Source: local data frame [4 x 2]
#>
#>     key     value
#>   (chr)     (chr)
#> 1   int   integer
#> 2   let character
#> 3   fac    factor
#> 4   dbl   numeric

@pgensler
Copy link

@pgensler pgensler commented May 16, 2017

Is this a good example of where one needs to use dmap, and NOT map_df? iI have URL's I want to scrape, which are in a dataframe, and when I want to scrape them, map_df fails because of bind_rows:

pacman::p_load("httr","magrittr", "dplyr", "purrr","rvest")
get_page <- function(i=1, pb=NULL){
  if (!is.null(pb)) pb$tick()$print()
  result = POST(data$URL[[i]])
  stop_for_status(result)
  A = content(result, as="parsed", encoding = "iso-8859-1")
  
  #Assign output to table$link new col
  #data$SCRAPED_NAME[[i]] <- 
A %>% 
  html_nodes(css ="#_brand4 span") %>%
  html_text()
  i <- i+1
}

data = data.frame(
  "URL" = c("https://www.ratebeer.com/beer/8481","https://www.ratebeer.com/beer/3228/"),
  "SCRAPED_NAME" = NA, stringsAsFactors = FALSE
)
debug(get_page)
finaldf = map_df(1:(length(data$URL)),get_page)
#> debugging in: .f(.x[[i]], ...)
#> debug at <text>#2: {
#>     if (!is.null(pb)) 
#>         pb$tick()$print()
#>     result = POST(data$URL[[i]])
#>     stop_for_status(result)
#>     A = content(result, as = "parsed", encoding = "iso-8859-1")
#>     A %>% html_nodes(css = "#_brand4 span") %>% html_text()
#>     i <- i + 1
#> }
#> debug at <text>#3: if (!is.null(pb)) pb$tick()$print()
#> debug at <text>#4: result = POST(data$URL[[i]])
#> debug at <text>#5: stop_for_status(result)
#> debug at <text>#6: A = content(result, as = "parsed", encoding = "iso-8859-1")
#> debug at <text>#10: A %>% html_nodes(css = "#_brand4 span") %>% html_text()
#> debug at <text>#13: i <- i + 1
#> exiting from: .f(.x[[i]], ...)
#> debugging in: .f(.x[[i]], ...)
#> debug at <text>#2: {
#>     if (!is.null(pb)) 
#>         pb$tick()$print()
#>     result = POST(data$URL[[i]])
#>     stop_for_status(result)
#>     A = content(result, as = "parsed", encoding = "iso-8859-1")
#>     A %>% html_nodes(css = "#_brand4 span") %>% html_text()
#>     i <- i + 1
#> }
#> debug at <text>#3: if (!is.null(pb)) pb$tick()$print()
#> debug at <text>#4: result = POST(data$URL[[i]])
#> debug at <text>#5: stop_for_status(result)
#> debug at <text>#6: A = content(result, as = "parsed", encoding = "iso-8859-1")
#> debug at <text>#10: A %>% html_nodes(css = "#_brand4 span") %>% html_text()
#> debug at <text>#13: i <- i + 1
#> exiting from: .f(.x[[i]], ...)
#> Error in bind_rows_(x, .id): cannot convert object to a data frame

@lionel-
Copy link
Member

@lionel- lionel- commented May 16, 2017

It's better if you create a minimal reprex rather than a complex example. I think this should now work if you install the dev version of dplyr.

@DataStrategist
Copy link

@DataStrategist DataStrategist commented Jun 24, 2017

Sorry for thread-rezzing, but I'd like to add my own explorations, maybe purrr should not depend on bind_rows? At very least, map_df could give a more helpful error message, just like dmap does which at least helps users figure out what they could do get a df:

library(dplyr)
library(purrr)

a <- list(c(45, 108, 29, 65, 94, 67, 53, 107, 114, 218), 
          c(114, 88,79, 126, 71, 105, 44, 119, 224, 176))

## What I want, but in a df
a %>% map(head) 
#> [[1]]
#> [1]  45 108  29  65  94  67
#> 
#> [[2]]
#> [1] 114  88  79 126  71 105

## This DOESNT work...
a %>% map_df(head) 
#> Error in bind_rows_(x, .id): cannot convert object to a data frame

## but if I set names it does
a %>% set_names(1:length(a)) %>% map_df(head) 
#> # A tibble: 6 × 2
#>     `1`   `2`
#>   <dbl> <dbl>
#> 1    45   114
#> 2   108    88
#> 3    29    79
#> 4    65   126
#> 5    94    71
#> 6    67   105

## Also ok
a %>% map(head) %>% unlist %>% matrix(ncol=length(a),byrow = T) %>% as.data.frame 
#>    V1  V2
#> 1  45 108
#> 2  29  65
#> 3  94  67
#> 4 114  88
#> 5  79 126
#> 6  71 105

## Doesn't work... needs names
a %>% dmap(head) 
#> Error: Each variable must be named.
#> Problem variables: 1, 2

## Works
a %>% set_names(1:length(a)) %>% dmap(head) 
#> # A tibble: 6 × 2
#>     `1`   `2`
#>   <dbl> <dbl>
#> 1    45   114
#> 2   108    88
#> 3    29    79
#> 4    65   126
#> 5    94    71
#> 6    67   105

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants