Skip to content

Implement keep_empty for unnest_longer() #1442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jan 4, 2023

Closes #1363
Closes #1339

We are going to use keep_empty = FALSE as the default to be consistent with unnest(), unchop(), separate_longer_position(), etc. This is a breaking change as previously NULL elements implicitly had keep_empty = TRUE applied to them. However, we are now consistent with how we handle NULL and empty vectors like integer(), which I think is important.

No changes to worse in revdeps, so I feel good about this!
"322b4064-9f31-474f-9750-36e8c7c2d8df"

@DavisVaughan DavisVaughan force-pushed the feature/unnest-longer-keep-empty branch from dab3470 to c3b0e34 Compare January 5, 2023 15:29
@DavisVaughan DavisVaughan marked this pull request as ready for review January 5, 2023 15:30
if (is.integer(index)) {
index <- 1L
} else {
index <- NA_character_
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly questioning our decision to use NA when there aren't any names over "", like what names2() would return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that there are no names here, it's that there are no values. So I think NA is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really referring to this

tibble(
  x = list(1:2, c(a = 3, b = 4))
) %>%
  unnest_longer(x)
#> # A tibble: 4 × 2
#>       x x_id 
#>   <dbl> <chr>
#> 1     1 <NA> 
#> 2     2 <NA> 
#> 3     3 a    
#> 4     4 b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could change ^ to use "", and then consistently use NA_integer_ or NA_character_ when we promote empty vectors to size 1? So its like:

No names anywhere:

  • Positional names for existing values
  • NA_integer_ if a zero length vector was promoted to size 1

At least one name somewhere:

  • Elements with names use their names
  • Elements without names use ""
  • NA_character_ if a zero length vector was promoted to size 1

Copy link
Member Author

@DavisVaughan DavisVaughan Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using "" for values without names is probably more consistent in this case

tibble(
  x = list(c(a = 1, 2), c(3, 4))
) %>%
  unnest_longer(x)
#> # A tibble: 4 × 2
#>       x x_id 
#>   <dbl> <chr>
#> 1     1 "a"  
#> 2     2 ""   
#> 3     3 <NA>  # <- better as ""
#> 4     4 <NA>  # <- better as ""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like using "" for unnamed elements and NA for elements that didn't exist. But I can see how that might be slicing things too fine.

Copy link
Member Author

@DavisVaughan DavisVaughan Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what I'm suggesting? 3 and 4 above are "unnamed elements" to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this case:

library(tidyr)
tibble(x = list(c(a = 1, 2), c(3, 4), NULL)) %>%
  unnest_longer(x, keep_empty = TRUE)
#> # A tibble: 5 × 2
#>       x x_id 
#>   <dbl> <chr>
#> 1     1 "a"  
#> 2     2 ""   
#> 3     3 <NA>  # should be ""
#> 4     4 <NA>  # should be ""
#> 5    NA <NA> # should be NA

Created on 2023-01-05 with reprex v2.0.2

@DavisVaughan DavisVaughan requested a review from hadley January 5, 2023 15:38
if (is.integer(index)) {
index <- 1L
} else {
index <- NA_character_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that there are no names here, it's that there are no values. So I think NA is correct.

And change to:
- `""` index names for unnamed vectors
- `NA` index names for empty vectors that are "kept"
@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jan 6, 2023

This PR grew a bit, but I realized that I could vectorize col_to_long() a bit more by:

  • Moving all names handling to collect_indices_info()
  • Adding list_init_null() to handle converting NULLs to either size 0 or size 1 vectors
  • Adding list_init_typed() to follow that up and convert size 0 vectors to size 1 vectors

We already have list_init_empty() that tries to do both of these at once, and we use it in a few other places. It isn't quite right for our needs here, so I had to break it apart into its two functions.

After this PR I'll go back and remove list_init_empty() in favor of using these two helpers. I'm also confident that when I do that I will be able to fix #1327 based on the way I've implemented these two new helpers.


I've also implemented the changes to the name handling we discussed here #1442 (comment). It was heavily tied to this so it was easiest to change it all at once.

Comment on lines +201 to +203
if (is_list_of(x)) {
cli::cli_abort("`x` can't be a list-of. Unclass first and provide `ptype`.", .internal = TRUE)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restriction of "no list-of" isn't technically necessary, in particular the previous list_init_empty() helper tried to allow list-ofs and use their information. But this is part of what caused the failure in #1327, and it is cleaner to extract out the ptype of list-ofs ahead of time, strip them down to bare lists, and pass all that through instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me because list_of() can't contain NULLs, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can actually (that's the missing value) but can't contain unspecified() (although that branch probably wouldn't be hit for list-ofs because you'd typically also be supplying ptype)

@DavisVaughan DavisVaughan requested a review from hadley January 6, 2023 17:47
Comment on lines +201 to +203
if (is_list_of(x)) {
cli::cli_abort("`x` can't be a list-of. Unclass first and provide `ptype`.", .internal = TRUE)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me because list_of() can't contain NULLs, right?


if (vec_any_missing(x)) {
null <- vec_detect_missing(x)
null <- which(null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you assume that vec_assign() will do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely will. But since I use null in two separate assignments below, it seemed slightly more efficient to which() it once here

@DavisVaughan DavisVaughan merged commit 672f679 into tidyverse:main Jan 9, 2023
@DavisVaughan DavisVaughan deleted the feature/unnest-longer-keep-empty branch January 9, 2023 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants