Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unnest removes rsample attributes #688

Closed
IndrajeetPatil opened this issue Jul 24, 2019 · 8 comments
Closed

unnest removes rsample attributes #688

IndrajeetPatil opened this issue Jul 24, 2019 · 8 comments
Labels
feature rectangling 🗄️ vctrs ↗️
Milestone

Comments

@IndrajeetPatil
Copy link

@IndrajeetPatil IndrajeetPatil commented Jul 24, 2019

I am not sure if this is intended behavior, but rsample + tidyr::unnest() workflow doesn't seem to work.

Let's say I want to calculate bootstrap confidence intervals for lm model using rsample.

Without unnesting the list column, everything works as expected-

library(tidyverse)
library(rsample)

set.seed(123)
(df_nested <-
  rsample::bootstraps(
    data = iris,
    times = 100,
    apparent = TRUE
  ) %>% 
  dplyr::mutate(
    .data = .,
    results = purrr::map(
      .x = splits,
      .f = ~ broom::tidy(stats::lm(formula = Sepal.Length ~ Species, data = .))
    )
  ) )
#> # Bootstrap sampling with apparent sample 
#> # A tibble: 101 x 3
#>    splits           id           results         
#>  * <list>           <chr>        <list>          
#>  1 <split [150/56]> Bootstrap001 <tibble [3 x 5]>
#>  2 <split [150/58]> Bootstrap002 <tibble [3 x 5]>
#>  3 <split [150/50]> Bootstrap003 <tibble [3 x 5]>
#>  4 <split [150/58]> Bootstrap004 <tibble [3 x 5]>
#>  5 <split [150/56]> Bootstrap005 <tibble [3 x 5]>
#>  6 <split [150/55]> Bootstrap006 <tibble [3 x 5]>
#>  7 <split [150/56]> Bootstrap007 <tibble [3 x 5]>
#>  8 <split [150/52]> Bootstrap008 <tibble [3 x 5]>
#>  9 <split [150/60]> Bootstrap009 <tibble [3 x 5]>
#> 10 <split [150/55]> Bootstrap010 <tibble [3 x 5]>
#> # ... with 91 more rows

class(df_nested)
#> [1] "bootstraps" "rset"       "tbl_df"     "tbl"        "data.frame"

rsample::int_pctl(df_nested, results)
#> Warning: Recommend at least 1000 non-missing bootstrap resamples for terms:
#> `(Intercept)`, `Speciesversicolor`, `Speciesvirginica`.
#> # A tibble: 3 x 6
#>   term              .lower .estimate .upper .alpha .method   
#>   <chr>              <dbl>     <dbl>  <dbl>  <dbl> <chr>     
#> 1 (Intercept)        4.91      5.00    5.10   0.05 percentile
#> 2 Speciesversicolor  0.788     0.937   1.09   0.05 percentile
#> 3 Speciesvirginica   1.41      1.58    1.73   0.05 percentile

But if I unnest the tidy output from broom while preserving everything else from rsample, the rset attribute is removed-

(df_unnested <- df_nested %>%
  tidyr::unnest(data = ., results, .drop = FALSE, .preserve = "results"))
#> # A tibble: 303 x 8
#>    splits   id      results  term    estimate std.error statistic   p.value
#>    <list>   <chr>   <list>   <chr>      <dbl>     <dbl>     <dbl>     <dbl>
#>  1 <split ~ Bootst~ <tibble~ (Inter~    4.96     0.0742     66.9  6.11e-112
#>  2 <split ~ Bootst~ <tibble~ Specie~    0.968    0.104       9.35 1.38e- 16
#>  3 <split ~ Bootst~ <tibble~ Specie~    1.64     0.112      14.6  2.17e- 30
#>  4 <split ~ Bootst~ <tibble~ (Inter~    4.91     0.0789     62.3  1.56e-107
#>  5 <split ~ Bootst~ <tibble~ Specie~    1.08     0.111       9.78 1.05e- 17
#>  6 <split ~ Bootst~ <tibble~ Specie~    1.67     0.105      15.9  8.11e- 34
#>  7 <split ~ Bootst~ <tibble~ (Inter~    5.03     0.0752     66.9  5.45e-112
#>  8 <split ~ Bootst~ <tibble~ Specie~    0.979    0.109       8.96 1.35e- 15
#>  9 <split ~ Bootst~ <tibble~ Specie~    1.63     0.104      15.7  3.87e- 33
#> 10 <split ~ Bootst~ <tibble~ (Inter~    5.06     0.0737     68.6  1.55e-113
#> # ... with 293 more rows

class(df_unnested)
#> [1] "tbl_df"     "tbl"        "data.frame"

rsample::int_pctl(df_unnested, results)
#> Error: `.data` should be an `rset` object generated from `bootstraps()`

Is unnest supposed to remove all other attributes save for "tbl_df", "tbl", "data.frame"?

Created on 2019-07-24 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2019-07-24                  
#> 
#> - Packages --------------------------------------------------------------
#>  package     * version    date       lib
#>  assertthat    0.2.1      2019-03-21 [1]
#>  backports     1.1.4      2019-04-10 [1]
#>  broom         0.5.2.9001 2019-06-26 [1]
#>  callr         3.3.1      2019-07-18 [1]
#>  cellranger    1.1.0      2016-07-27 [1]
#>  cli           1.1.0      2019-03-19 [1]
#>  codetools     0.2-16     2018-12-24 [1]
#>  colorspace    1.4-1      2019-03-18 [1]
#>  crayon        1.3.4      2017-09-16 [1]
#>  desc          1.2.0      2019-04-03 [1]
#>  devtools      2.0.1      2018-10-26 [1]
#>  digest        0.6.20     2019-07-04 [1]
#>  dplyr       * 0.8.3      2019-07-04 [1]
#>  evaluate      0.14       2019-05-28 [1]
#>  fansi         0.4.0      2018-11-05 [1]
#>  forcats     * 0.4.0      2019-02-17 [1]
#>  fs            1.3.1      2019-05-06 [1]
#>  furrr         0.1.0      2018-05-16 [1]
#>  future        1.14.0     2019-07-02 [1]
#>  generics      0.0.2      2019-03-05 [1]
#>  ggplot2     * 3.2.0.9000 2019-06-05 [1]
#>  globals       0.12.4     2018-10-11 [1]
#>  glue          1.3.1      2019-03-12 [1]
#>  gtable        0.3.0      2019-03-25 [1]
#>  haven         2.1.1      2019-07-04 [1]
#>  highr         0.8        2019-03-20 [1]
#>  hms           0.5.0      2019-07-09 [1]
#>  htmltools     0.3.6      2017-04-28 [1]
#>  httr          1.4.0      2018-12-11 [1]
#>  jsonlite      1.6        2018-12-07 [1]
#>  knitr         1.23       2019-05-18 [1]
#>  lazyeval      0.2.2      2019-03-15 [1]
#>  listenv       0.7.0      2018-01-21 [1]
#>  lubridate     1.7.4      2018-04-11 [1]
#>  magrittr      1.5        2014-11-22 [1]
#>  memoise       1.1.0      2017-04-21 [1]
#>  modelr        0.1.4      2019-02-18 [1]
#>  munsell       0.5.0      2018-06-12 [1]
#>  pillar        1.4.2      2019-06-29 [1]
#>  pkgbuild      1.0.3      2019-03-20 [1]
#>  pkgconfig     2.0.2      2018-08-16 [1]
#>  pkgload       1.0.2      2018-10-29 [1]
#>  prettyunits   1.0.2      2015-07-13 [1]
#>  processx      3.4.1      2019-07-18 [1]
#>  ps            1.3.0      2018-12-21 [1]
#>  purrr       * 0.3.2      2019-03-15 [1]
#>  R6            2.4.0      2019-02-14 [1]
#>  Rcpp          1.0.1      2019-03-17 [1]
#>  readr       * 1.3.1      2018-12-21 [1]
#>  readxl        1.3.1      2019-03-13 [1]
#>  remotes       2.1.0      2019-06-24 [1]
#>  rlang         0.4.0      2019-06-25 [1]
#>  rmarkdown     1.14       2019-07-12 [1]
#>  rprojroot     1.3-2      2018-01-03 [1]
#>  rsample     * 0.0.5      2019-07-12 [1]
#>  rvest         0.3.4      2019-05-15 [1]
#>  scales        1.0.0      2018-08-09 [1]
#>  sessioninfo   1.1.1      2018-11-05 [1]
#>  stringi       1.4.3      2019-03-12 [1]
#>  stringr     * 1.4.0      2019-02-10 [1]
#>  testthat      2.2.0      2019-07-22 [1]
#>  tibble      * 2.1.3      2019-06-06 [1]
#>  tidyr       * 0.8.3      2019-03-01 [1]
#>  tidyselect    0.2.5      2018-10-11 [1]
#>  tidyverse   * 1.2.1      2017-11-14 [1]
#>  usethis       1.5.1      2019-07-04 [1]
#>  utf8          1.1.4      2018-05-24 [1]
#>  vctrs         0.2.0      2019-07-05 [1]
#>  withr         2.1.2      2018-03-15 [1]
#>  xfun          0.8        2019-06-25 [1]
#>  xml2          1.2.0      2018-01-24 [1]
#>  yaml          2.2.0      2018-07-25 [1]
#>  zeallot       0.1.0      2018-01-28 [1]
#>  source                            
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  local                             
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.2)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  Github (r-lib/desc@c860e7b)       
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  Github (brodieG/fansi@ab11e9c)    
#>  CRAN (R 3.5.2)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  Github (r-lib/generics@c15ac43)   
#>  Github (tidyverse/ggplot2@b560662)
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.6.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.5.1)                    
#> 
#> [1] C:/Users/inp099/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.1/library
@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jul 24, 2019

I haven't fully thought it through, but what would be the implications of calling vec_restore() from reconstruct_tibble() rather than as_tibble() in the second branch of the if statement? That preserves the rset class. One test breaks, but I think it is related to a vctrs issue r-lib/vctrs#503

test_that("never has row names (#305)", {

@hadley
Copy link
Member

@hadley hadley commented Jul 25, 2019

I was planning on waiting for r-lib/vctrs#211, rather than continuing to build up hacks in tidyr.

@hadley hadley added feature rectangling 🗄️ labels Sep 7, 2019
@hadley hadley added this to the v1.1.0 milestone Nov 28, 2019
@hadley
Copy link
Member

@hadley hadley commented Nov 28, 2019

@DavisVaughan can you please follow up (from the rsample side) on the comments in #812?

@hadley
Copy link
Member

@hadley hadley commented Apr 23, 2020

@DavisVaughan can you work around this without any tidyr changes? It might make more sense for tidyr to use the dplyr generics, which would mean we couldn't fix in this release.

@hadley
Copy link
Member

@hadley hadley commented Apr 24, 2020

Actually, I'm confident that you can — if you think preserving these attributes is important, you can just provide an unnest() method. In the long run, hopefully we can resolve this without needing explicit methods, but I don't think this needs to block the tidyr release.

@hadley hadley closed this as completed Apr 24, 2020
@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Apr 24, 2020

Yea I'll do it today

@topepo
Copy link
Member

@topepo topepo commented Apr 28, 2020

After looking at this, IMO unnested rset objects should be tibbles and lose their attributes. rset objects are not meant to be unnested and this would break code when people use nrow(rset) to determine how many resamples were created.

@IndrajeetPatil The int_* functions are meant to consume a list column of tibbles (if they are in the broom::tidy format). Your first code block works as intended.

@austinwpearce
Copy link

@austinwpearce austinwpearce commented Jul 29, 2021

I agree that unnest removes the rset attributes. The question then becomes how to accomplish what the first block of code does, but for multiple groups within a dataset. If you want to nest a tibble by a group, then make bootstraps for each group, you'll have to unnest the bootstraps in order to fit the model. At this point, the unnesting means you cannot use the int_* functions on each group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature rectangling 🗄️ vctrs ↗️
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants