Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dispatch when manipulating df proxies #1129

Open
earowang opened this issue May 28, 2020 · 12 comments
Open

Avoid dispatch when manipulating df proxies #1129

earowang opened this issue May 28, 2020 · 12 comments
Labels
bug an unexpected problem or unintended behavior op:proxy-restore theory

Comments

@earowang
Copy link
Contributor

earowang commented May 28, 2020

I think I need some help here.

I have defined vec_restore() for the tsibble object, which solves the attribute updating issue when vec_slice() a tsibble. But it breaks bind_rows() that uses vec_rbind() internally.

In the tsibble code, I need to make sure that there's no NA in my index column, and validate_index() does the job. I have no idea when NA has been introduced in the vec_restore(). When I debug through my validate_index(), it will not trigger that error because of no NA.

library(tsibble)
library(vctrs)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:tsibble':
#> 
#>     interval
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
idx_day <- seq.Date(ymd("2017-01-01"), ymd("2017-01-20"), by = 4)
tsbl <- tsibble(
  date = idx_day,
  value = rnorm(5),
  index = date
)
tsbl
#> # A tsibble: 5 x 2 [4D]
#>   date         value
#>   <date>       <dbl>
#> 1 2017-01-01  0.965 
#> 2 2017-01-05  0.0740
#> 3 2017-01-09 -1.35  
#> 4 2017-01-13 -0.257 
#> 5 2017-01-17 -0.814
new_data(tsbl)
#> # A tsibble: 1 x 1 [4D]
#>   date      
#>   <date>    
#> 1 2017-01-21

vec_rbind(tsbl, new_data(tsbl))
#> Error: Column `date` (index) must not contain `NA`.

Created on 2020-05-29 by the reprex package (v0.3.0)

The vec_restore() implemention sits in this branch https://github.com/tidyverts/tsibble/tree/vctrs-vec-restore

@lionel-
Copy link
Member

lionel- commented May 29, 2020

I think this is a bug on our end. We work on the proxy without removing the class, which means that we then dispatch on all the vctrs methods while working with the proxy (e.g. vec_init(proxy).

We can't really remove the class for all proxies because atomic vectors can't be shallow-duplicated (even with altrep on recent R there is some overhead). We should at least do it with data frames though.

The other solution is to duplicate our internal API with proxy variants that never dispatch.

In the mean time I'll look into stripping the class of data frame subclasses in vec_proxy(). This should fix this issue. Hopefully this doesn't impact revdeps of tidyr and dplyr and we'll be able to include this fix in the upcoming release.

@lionel- lionel- added bug an unexpected problem or unintended behavior op:proxy-restore theory labels May 29, 2020
@lionel-
Copy link
Member

lionel- commented May 29, 2020

@earowang Can you please add this method in your package while we figure this out?

#' @export
vec_proxy.tbl_ts <- function(x, ...) {
  new_data_frame(x)
}

This strips all the attributes of your data frame and prevents dispatching again during internal memory manipulations.

@DavisVaughan
Copy link
Member

@earowang, I've also been working on vctrs/dplyr compatibility for rsample and dials. One thing that I have found there is that the vec_restore() method will have to decide where x is still a valid tsibble or not, and either:

  • fall back to a tibble if it isn't valid
  • continue restoring back to a tsibble if it is valid

For example, with the current implementation of vec_restore.tsibble in that branch, it is always assumed that you can restore back to a valid tsibble, which isn't always true. Here's an example where vec_slice() makes duplicate rows. The [ implementation knows how to fall back to a tibble, but the vec_restore() method that gets called at the end of vec_slice() currently always assumes the result is a valid tsibble.

# devtools::install_github("tidyverts/tsibble", ref = "vctrs-vec-restore")

library(vctrs)
library(tsibble)

# falls back to tibble b/c of duplicates in index
pedestrian[c(1, 1, 2),]
#> # A tibble: 3 x 5
#>   Sensor         Date_Time           Date        Time Count
#>   <chr>          <dttm>              <date>     <int> <int>
#> 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
#> 2 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
#> 3 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826

# oops
vec_slice(pedestrian, c(1, 1, 2))
#> # A tsibble: 3 x 5 [1h] <Australia/Melbourne>
#> # Key:       Sensor [1]
#>   Sensor         Date_Time           Date        Time Count
#>   <chr>          <dttm>              <date>     <int> <int>
#> 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
#> 2 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
#> 3 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826

Created on 2020-05-29 by the reprex package (v0.3.0)

Here are the rsample and dials PRs. I've added a lot of notes that you might find useful:
tidymodels/rsample#150
tidymodels/dials#114

The general pattern that is emerging would be for you to have a tsibble_reconstructable(x, to) function that holds all of the logic for deciding when x is still a valid tsibble or not (relative to to, which will be the original valid tsibble before any transformation happened). It returns TRUE if you can reconstruct, otherwise it returns FALSE.

Then you have a tsibble_reconstruct() method that looks like:

tsibble_reconstruct(x, to) {
  if (tsibble_reconstructable(x, to)) {
    # reconstruct the tsibble
  } else {
    # "upcast" `x` to a bare tibble
  }
}

Then you could consistently use this helper in the dplyr 1.0.0 methods and vec_restore(). For dials and rsample it is as simple as this, but it might not be for tsibble:

vec_restore.tsibble(x, to) {
  tsibble_reconstruct(x, to)
}

dplyr_reconstruct.tsibble(x, to) {
  tsibble_reconstruct(x, to)
}

@earowang
Copy link
Contributor Author

Much appreciated for all your responses.

@lionel- yep, adding vec_proxy() makes everything work again.

@DavisVaughan Actually a bit of correctness has been sacrificed for performance. My understanding is that vec_restore() is called many times in {vctrs}. If I validate to, it is costly for {tsibble} for little gain.

@lionel-
Copy link
Member

lionel- commented May 30, 2020

A lot of the time it is called with empty data (when doing vec_ptype() which calls vec_slice()). Also with the proxy workaround, it will be called fewer times. So it might not be this costly to validate the data.

I think an overall goal for 0.4.0 is to make sure restore is only called on complete data.

@earowang
Copy link
Contributor Author

I passed R CMD check --as-cran locally on macOS and win-builder devel. When I submitted to CRAN, vec_rbind() errors.

https://win-builder.r-project.org/incoming_pretest/tsibble_0.9.0_20200531_234022/Windows/00check.log

@lionel-
Copy link
Member

lionel- commented Jun 1, 2020

This is very strange. For some reasons it looks like your vec_proxy() method is not registered or picked up properly on this machine?

@lionel-
Copy link
Member

lionel- commented Jun 1, 2020

Which version of vctrs are you using locally? Note that I sent vctrs 0.3.1 to win-builder last week, and it "remembers" the last sent version, even for other packages. So maybe tsibble passes tests with 0.3.1, but not 0.3.0?

@lionel-
Copy link
Member

lionel- commented Jun 1, 2020

oh it would make sense that the proxy thing in vec-rbind does not work on 0.3.0. @DavisVaughan changed the implementation to take the proxy before the loop, rather than inside the loop.

In that case, you'll need to wait before vctrs 0.3.1 is on CRAN. I was holding it for r-spatial/sf#1390, but I might send it anyway since it's not about a regression.

@earowang
Copy link
Contributor Author

earowang commented Jun 1, 2020

Oh yea, I was using 0.3.1, and thought 0.3.1 was on CRAN. Indeed, it doesn't work with vctrs 0.3.0.

@earowang
Copy link
Contributor Author

earowang commented Jun 1, 2020

When do you plan to send 0.3.1 for CRAN? I'll have to submit tsibble to fix CRAN check errors with dplyr v1.0.0.

@lionel-
Copy link
Member

lionel- commented Jun 5, 2020

@earowang on CRAN!

@lionel- lionel- changed the title vec_restore() and vec_rbind() Avoid dispatch when manipulating df proxies Sep 13, 2022
lionel- added a commit that referenced this issue Sep 13, 2022
lionel- added a commit that referenced this issue Sep 13, 2022
lionel- added a commit that referenced this issue Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior op:proxy-restore theory
Projects
None yet
Development

No branches or pull requests

3 participants