Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve missing rows when unnesting #358

Closed
leungi opened this issue Sep 2, 2017 · 19 comments

Comments

@leungi
Copy link

commented Sep 2, 2017

Hi,

Supposed tibble is as such (columns separated by ' | '):

index | text | polarity | polarity_confidence | aspects
1 | blah1 | positive | 0.579939 | list()
2 | blah2 | negative | 0.693546 | list()
3 | blah3 | negative | 0.676733 | list()
4 | blah4 | positive | 0.756442 | list()
5 | blah5 | positive | 0.815249 | list()
6 | blah6 | positive | 0.72212 | list()
7 | blah7 | negative | 0.808398 | list(a = value, b = value, c = value)
8 | blah8 | negative | 0.63281 | list()
9 | blah9 | negative | 0.709047 | list()
10 | blah10 | negative | 0.912631 | list()
11 | blah11 | negative | 0.752882 | list(a = value, b = value, c = value)

Issue:
tibble %>%
unnest(aspects)

##will drop every row except from 7 and 11 (i.e. those with non-empty list), '.drop = FALSE' doesn't help

My workaround currently is as follow:

  1. by row, determine if list is empty (using length())
  2. if list is empty, sub with dummy non-empty list (using if_else)
  3. then unnest

Workaround code:
tibble %>%
mutate(listLength = map_int(aspects, length)) %>%
mutate(aspects = if_else(listLength <= 0, list(data.frame("NA")), aspects)) %>%
unnest(aspects)

Desired output:
index | text | polarity | polarity_confidence | a | b | c
1 | blah1 | positive | 0.579939 | NA | NA | NA
2 | blah2 | negative | 0.693546 | NA | NA | NA
3 | blah3 | negative | 0.676733 | NA | NA | NA
4 | blah4 | positive | 0.756442 | NA | NA | NA
5 | blah5 | positive | 0.815249 | NA | NA | NA
6 | blah6 | positive | 0.72212 | NA | NA | NA
7 | blah7 | negative | 0.808398 | value | value | value
8 | blah8 | negative | 0.63281 | NA | NA | NA
9 | blah9 | negative | 0.709047 | NA | NA | NA
10 | blah10 | negative | 0.912631 | NA | NA | NA
11 | blah11 | negative | 0.752882 | value | value | value

Am I missing something?

Look FW to insights. Thanks in advance.

@hadley

This comment was marked as resolved.

Copy link
Member

commented Nov 15, 2017

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@markdly

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2017

Adding a minimal reprex based on my understanding of OP issue. I think this is also related to #316.

I conceptually think of unnest as something which results in more rows/columns than the tibble provided while nest results in fewer rows/columns. Perhaps this is why these issues have been raised as losing rows during an unnest might be counter intuitive for some users (myself included) even though unnest is working as documented.

I think the desired result for both this issue and #316 is a dplyr::left_join of the non-list columns being unnested combined with the unnest results as shown in the workaround below.

library(dplyr)
library(tidyr)
library(purrr)
df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

# Row with empty tibble has been removed
df %>% unnest()
#> # A tibble: 1 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     2     5     7

# Would like to keep all rows instead. Possible workaround: 
df1 <- df %>% select(-y)
df2 <- df %>% filter(length(y) > 0) %>% unnest()
left_join(df1, df2, by = "x")
#> # A tibble: 2 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1    NA    NA
#> 2     2     5     7

Perhaps an extra example in the documentation to highlight this feature of unnest could help to make users more aware of this situation? (I'd be happy to draft a PR if that was the case)...

@leungi

This comment has been minimized.

Copy link
Author

commented Nov 20, 2017

Hadley/Mark, thanks for reviewing this; apologies for delayed reply as I got tied up with work.

The original data in question came from an API call, and I didn't save it, but it's similar to what Mark has. He's also on point regarding my issue.

Mark's solution yields the intended result as my workaround:

  • columns (to be unnested) with blank values (e.g., column 'y' in Mark's example) will remain intact in resulting tibble, along with columns with non-blank values
  • columns with blank values with have NA filled in

Based on Mark's comments, this issue/phenomena is to by design, though I believe it'll be useful to have an argument in unnest to keep non-empty list after unnesting. I find these situations happening quite often in my work.

@hadley

This comment has been minimized.

Copy link
Member

commented Nov 20, 2017

Hmmmmm, maybe it's worth having an option for this, but I'm not sure what to call it.

@hadley hadley added feature and removed reprex labels Nov 20, 2017
@hadley hadley changed the title Unnest a list column with some rows having empty list Preserve missing rows when unnesting Nov 20, 2017
@leungi

This comment has been minimized.

Copy link
Author

commented Nov 20, 2017

Thanks Hadley.

Suggestion: na.drop = T/F

@markdly

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2017

How about empty = "drop" or "fill"

(e.g. similar approach to the extra and fill option values in separate)

@hadley

This comment has been minimized.

Copy link
Member

commented Nov 21, 2017

replace_na() now works with list-cols so you can at least do this:

library(tidyr)
library(tibble)

df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

df %>% 
  replace_na(list(y = list(tibble(a = NA, b = NA)))) %>%
  unnest()
#> # A tibble: 2 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1    NA    NA
#> 2     2     5     7
@leungi

This comment was marked as outdated.

Copy link
Author

commented Nov 21, 2017

Hadley,

I'm using tibble_1.3.4 and tidyr_0.7.2, but can't reproduce your output, unless the upgraded replace_na is not in latest CRAN versions yet.

library(tidyr)
library(tibble)

df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

df %>% 
  replace_na(list(y = list(tibble(a = NA, b = NA)))) %>%
  unnest()
#> # A tibble: 1 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     2     5     7
@hadley

This comment was marked as outdated.

Copy link
Member

commented Nov 21, 2017

It's in the dev version, sorry.

@hadley

This comment has been minimized.

Copy link
Member

commented Jan 4, 2018

I've now hit this use case in two practical problems, so I definitely believe it should be an option.

@leungi

This comment was marked as off-topic.

Copy link
Author

commented Jan 7, 2018

Happy 2018; thanks for update.

Look forward to your enhancements!

@jrgilbertson

This comment was marked as off-topic.

Copy link

commented Feb 7, 2018

Thank you for the temporary workaround (and upcoming feature)! Spent more time than I'd like to admit tonight trying to figure out this exact use case...

@hadley

This comment has been minimized.

Copy link
Member

commented May 11, 2018

Note: this is related to a left join vs an inner join.

@leungi

This comment has been minimized.

Copy link
Author

commented May 14, 2018

Thanks for update and linking issue @hadley; will try it out when nest_join() turns on in dev version.

@hadley

This comment has been minimized.

Copy link
Member

commented May 30, 2018

I think this might be best as drop = FALSE and can be implemented internally with something like:

explicit_na <- function(x) {
  dims <- length(dims(x)) 
  if (dims == 0L && length(x) == 0) {
    x[NA_integer]
  } else if (dims == 2L && nrow(x) == 0) {
   x[NA_integer, , drop = FALSE]
  } else {
   x
  }
}
@markdly

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2018

I couldn't get explicit_na to work as it is, but if I tweak it slightly:

library(dplyr)
explicit_na <- function(x) {
  dims <- length(dim(x))
  if (dims == 0L && length(x) == 0) {  
    x <- ifelse(is.list(x) && !is.data.frame(x), list(NA_integer_), NA_integer_)
  } else if (dims == 2L && nrow(x) == 0) {  
    x[TRUE, ] <- NA_integer_
  }
  x
}

These cases return what I'd expect

character(0) %>% explicit_na()
#> [1] NA

list() %>% explicit_na()
#> [[1]]
#> [1] NA

data.frame(a = character()) %>% explicit_na()
#>      a
#> 1 <NA>

But now I'm wondering what should happen if a dataframe has no names?

df <- data.frame()
df
#> data frame with 0 columns and 0 rows

df %>% explicit_na()
#> data frame with 0 columns and 1 row
@hadley

This comment has been minimized.

Copy link
Member

commented Jun 1, 2018

That function is just a reminder for me. It needs testing.

@hadley

This comment has been minimized.

Copy link
Member

commented Feb 9, 2019

Note to self: can't use .drop because it's already used to control if the variables being unnested are dropped.

@hadley hadley closed this in 90a13cb Apr 23, 2019
@hadley

This comment has been minimized.

Copy link
Member

commented Apr 23, 2019

Currently implemented in unnest2(), which I'm going to re-unify with unnest() shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.