Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Avoid stripping attributes in dplyr_reconstruct() #6596

Closed
wants to merge 1 commit into from

Conversation

krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 11, 2022

Reverts #5277.

This avoids materialization of the lazy relational object.

We don't need this if duckdblabs/duckplyr@9ceb815 is deemed okay.

@krlmlr
Copy link
Member Author

krlmlr commented May 17, 2023

@DavisVaughan: The following reprex illustrates the problem.

With stock dplyr_reconstruct() which is currently stripping attributes, data frames will be materialized prematurely.

con <- DBI::dbConnect(duckdb::duckdb())
df <- data.frame(a = 1)

rel1 <- duckdb:::rel_from_df(con, df)
rel2 <- duckdb:::rel_project(
  rel1,
  list(duckdb:::expr_reference("a"))
)

df_out <- duckdb:::rel_to_altrep(rel2)

options(duckdb.materialize_message = TRUE)
invisible(dplyr::dplyr_reconstruct(df_out, df))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a]
#>   r_dataframe_scan(0x12f8a9ea0)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)

df_out
#>   a
#> 1 1

Created on 2023-05-17 with reprex v2.0.2

We want the "materializing:" message to happen only when df_out is printed, and this is what happens if we don't call dplyr_reconstruct() :

con <- DBI::dbConnect(duckdb::duckdb())
df <- data.frame(a = 1)

rel1 <- duckdb:::rel_from_df(con, df)
rel2 <- duckdb:::rel_project(
  rel1,
  list(duckdb:::expr_reference("a"))
)

df_out <- duckdb:::rel_to_altrep(rel2)

options(duckdb.materialize_message = TRUE)
df_out
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a]
#>   r_dataframe_scan(0x111aaf660)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
#> 
#>   a
#> 1 1

Created on 2023-05-17 with reprex v2.0.2

This message indicates when the actual computation is happening. Earlier I took some time to understand what's triggering this materialization, it's buried very deeply in vctrs. (Something related to the decision if we have named rows or not.) I figure that not stripping attributes could be the easier fix, but perhaps you have other suggestions.

To sum up: my goal is the following behavior:

con <- DBI::dbConnect(duckdb::duckdb())
df <- data.frame(a = 1)

rel1 <- duckdb:::rel_from_df(con, df)
rel2 <- duckdb:::rel_project(
  rel1,
  list(duckdb:::expr_reference("a"))
)

df_out <- duckdb:::rel_to_altrep(rel2)

options(duckdb.materialize_message = TRUE)
invisible(dplyr::dplyr_reconstruct(df_out, df))

df_out
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a]
#>   r_dataframe_scan(0x12f8a9ea0)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
#>
#>   a
#> 1 1

Created on 2023-05-17 with reprex v2.0.2

@krlmlr
Copy link
Member Author

krlmlr commented May 17, 2023

I can create a small package with a self-contained demo if needed, I don't know how to create an ALTREP data type in a reprex.

@hadley hadley requested a review from DavisVaughan June 6, 2023 13:02
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Jun 17, 2023
# vctrs 0.6.3

* Fixed an issue where certain ALTREP row names were being materialized when
  passed to `new_data_frame()`. We've fixed this by removing a safeguard in
  `new_data_frame()` that performed a compatibility check when both `n` and
  `row.names` were provided. Because this is a low level function designed for
  performance, it is up to the caller to ensure these inputs are compatible
  (tidyverse/dplyr#6596).

* Fixed an issue where `vec_set_*()` used with data frames could accidentally
  return an object with the type of the proxy rather than the type of the
  original inputs (#1837).

* Fixed a rare `vec_locate_matches()` bug that could occur when using a max/min
  `filter` (tidyverse/dplyr#6835).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant