Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve missing rows when unnesting #358

Closed
leungi opened this issue Sep 2, 2017 · 19 comments
Closed

Preserve missing rows when unnesting #358

leungi opened this issue Sep 2, 2017 · 19 comments
Labels
feature rectangling 🗄️

Comments

@leungi
Copy link

@leungi leungi commented Sep 2, 2017

Hi,

Supposed tibble is as such (columns separated by ' | '):

index | text | polarity | polarity_confidence | aspects
1 | blah1 | positive | 0.579939 | list()
2 | blah2 | negative | 0.693546 | list()
3 | blah3 | negative | 0.676733 | list()
4 | blah4 | positive | 0.756442 | list()
5 | blah5 | positive | 0.815249 | list()
6 | blah6 | positive | 0.72212 | list()
7 | blah7 | negative | 0.808398 | list(a = value, b = value, c = value)
8 | blah8 | negative | 0.63281 | list()
9 | blah9 | negative | 0.709047 | list()
10 | blah10 | negative | 0.912631 | list()
11 | blah11 | negative | 0.752882 | list(a = value, b = value, c = value)

Issue:
tibble %>%
unnest(aspects)

##will drop every row except from 7 and 11 (i.e. those with non-empty list), '.drop = FALSE' doesn't help

My workaround currently is as follow:

  1. by row, determine if list is empty (using length())
  2. if list is empty, sub with dummy non-empty list (using if_else)
  3. then unnest

Workaround code:
tibble %>%
mutate(listLength = map_int(aspects, length)) %>%
mutate(aspects = if_else(listLength <= 0, list(data.frame("NA")), aspects)) %>%
unnest(aspects)

Desired output:
index | text | polarity | polarity_confidence | a | b | c
1 | blah1 | positive | 0.579939 | NA | NA | NA
2 | blah2 | negative | 0.693546 | NA | NA | NA
3 | blah3 | negative | 0.676733 | NA | NA | NA
4 | blah4 | positive | 0.756442 | NA | NA | NA
5 | blah5 | positive | 0.815249 | NA | NA | NA
6 | blah6 | positive | 0.72212 | NA | NA | NA
7 | blah7 | negative | 0.808398 | value | value | value
8 | blah8 | negative | 0.63281 | NA | NA | NA
9 | blah9 | negative | 0.709047 | NA | NA | NA
10 | blah10 | negative | 0.912631 | NA | NA | NA
11 | blah11 | negative | 0.752882 | value | value | value

Am I missing something?

Look FW to insights. Thanks in advance.

@hadley

This comment has been minimized.

@hadley hadley added reprex rectangling 🗄️ labels Nov 15, 2017
@markdly
Copy link
Contributor

@markdly markdly commented Nov 19, 2017

Adding a minimal reprex based on my understanding of OP issue. I think this is also related to #316.

I conceptually think of unnest as something which results in more rows/columns than the tibble provided while nest results in fewer rows/columns. Perhaps this is why these issues have been raised as losing rows during an unnest might be counter intuitive for some users (myself included) even though unnest is working as documented.

I think the desired result for both this issue and #316 is a dplyr::left_join of the non-list columns being unnested combined with the unnest results as shown in the workaround below.

library(dplyr)
library(tidyr)
library(purrr)
df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

# Row with empty tibble has been removed
df %>% unnest()
#> # A tibble: 1 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     2     5     7

# Would like to keep all rows instead. Possible workaround: 
df1 <- df %>% select(-y)
df2 <- df %>% filter(length(y) > 0) %>% unnest()
left_join(df1, df2, by = "x")
#> # A tibble: 2 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1    NA    NA
#> 2     2     5     7

Perhaps an extra example in the documentation to highlight this feature of unnest could help to make users more aware of this situation? (I'd be happy to draft a PR if that was the case)...

@leungi
Copy link
Author

@leungi leungi commented Nov 20, 2017

Hadley/Mark, thanks for reviewing this; apologies for delayed reply as I got tied up with work.

The original data in question came from an API call, and I didn't save it, but it's similar to what Mark has. He's also on point regarding my issue.

Mark's solution yields the intended result as my workaround:

  • columns (to be unnested) with blank values (e.g., column 'y' in Mark's example) will remain intact in resulting tibble, along with columns with non-blank values
  • columns with blank values with have NA filled in

Based on Mark's comments, this issue/phenomena is to by design, though I believe it'll be useful to have an argument in unnest to keep non-empty list after unnesting. I find these situations happening quite often in my work.

@hadley
Copy link
Member

@hadley hadley commented Nov 20, 2017

Hmmmmm, maybe it's worth having an option for this, but I'm not sure what to call it.

@hadley hadley added feature and removed reprex labels Nov 20, 2017
@hadley hadley changed the title Unnest a list column with some rows having empty list Preserve missing rows when unnesting Nov 20, 2017
@leungi
Copy link
Author

@leungi leungi commented Nov 20, 2017

Thanks Hadley.

Suggestion: na.drop = T/F

@markdly
Copy link
Contributor

@markdly markdly commented Nov 20, 2017

How about empty = "drop" or "fill"

(e.g. similar approach to the extra and fill option values in separate)

@hadley
Copy link
Member

@hadley hadley commented Nov 21, 2017

replace_na() now works with list-cols so you can at least do this:

library(tidyr)
library(tibble)

df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

df %>% 
  replace_na(list(y = list(tibble(a = NA, b = NA)))) %>%
  unnest()
#> # A tibble: 2 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1    NA    NA
#> 2     2     5     7

@leungi

This comment has been minimized.

@hadley

This comment has been minimized.

@hadley
Copy link
Member

@hadley hadley commented Jan 4, 2018

I've now hit this use case in two practical problems, so I definitely believe it should be an option.

@leungi

This comment has been minimized.

@jrgilbertson

This comment has been minimized.

@hadley
Copy link
Member

@hadley hadley commented May 11, 2018

Note: this is related to a left join vs an inner join.

@leungi
Copy link
Author

@leungi leungi commented May 14, 2018

Thanks for update and linking issue @hadley; will try it out when nest_join() turns on in dev version.

@hadley
Copy link
Member

@hadley hadley commented May 30, 2018

I think this might be best as drop = FALSE and can be implemented internally with something like:

explicit_na <- function(x) {
  dims <- length(dims(x)) 
  if (dims == 0L && length(x) == 0) {
    x[NA_integer]
  } else if (dims == 2L && nrow(x) == 0) {
   x[NA_integer, , drop = FALSE]
  } else {
   x
  }
}

@markdly
Copy link
Contributor

@markdly markdly commented Jun 1, 2018

I couldn't get explicit_na to work as it is, but if I tweak it slightly:

library(dplyr)
explicit_na <- function(x) {
  dims <- length(dim(x))
  if (dims == 0L && length(x) == 0) {  
    x <- ifelse(is.list(x) && !is.data.frame(x), list(NA_integer_), NA_integer_)
  } else if (dims == 2L && nrow(x) == 0) {  
    x[TRUE, ] <- NA_integer_
  }
  x
}

These cases return what I'd expect

character(0) %>% explicit_na()
#> [1] NA

list() %>% explicit_na()
#> [[1]]
#> [1] NA

data.frame(a = character()) %>% explicit_na()
#>      a
#> 1 <NA>

But now I'm wondering what should happen if a dataframe has no names?

df <- data.frame()
df
#> data frame with 0 columns and 0 rows

df %>% explicit_na()
#> data frame with 0 columns and 1 row

@hadley
Copy link
Member

@hadley hadley commented Jun 1, 2018

That function is just a reminder for me. It needs testing.

@hadley
Copy link
Member

@hadley hadley commented Feb 9, 2019

Note to self: can't use .drop because it's already used to control if the variables being unnested are dropped.

@hadley
Copy link
Member

@hadley hadley commented Apr 23, 2019

Currently implemented in unnest2(), which I'm going to re-unify with unnest() shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature rectangling 🗄️
Projects
None yet
Development

No branches or pull requests

4 participants