Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directional unnesting #418

Closed
hadley opened this issue Feb 16, 2018 · 3 comments

Comments

@hadley
Copy link
Member

commented Feb 16, 2018

Copying notes from lingering file on my desktop:

# Some intersting challenges from Jenny
# https://github.com/tidyverse/googledrive/blob/519452fc4d3257354079324e4afede777604848f/data-raw/discovery-doc-prep.R#L118-L124

library(tidyverse)
df <- tibble(
  g = c("a", "a", "b"),
  x = c(1, 3, 5),
  y = c("x", "y", "z")
)
df

# One missing case z = list(1, 2, 3)
# There you just want to simplify the list col to a vector


# rows and cols change; current nest behaviour
# df %>% nest(x, y, .key = "data")
nest_both <- tribble(
  ~g,  ~data,
  "a", tibble(x = c(1, 3), y = c("x", "y")),
  "b", tibble(x = 5, y = "z")
)
nest_both

# rows change; cols don't
# get list of vectors
# df %>% nest_rows(x, y)
nest_rows <- tribble(
  ~g, ~x,       ~y,
  "a", c(1, 3), c("x", "y"),
  "b", 5,       "z"
)

# cols change; rows don't
# use lists, not tibbles to convey intent.
# df %>% nest_cols(x, y, .key = "data")
nest_cols <- tribble(
  ~g,  ~data,
  "a", list(x = 1, y = "x"),
  "a", list(x = 3, y = "y"),
  "b", list(x = 5, y = "z")
)
nest_cols

# unnest ------------------------------------------------------------------

# All of these should be able to automatically determine the
# unnested direction: data frame = both; named vector = col;
# unnamed vector = row; anything else or mix = error.

unnest(nest_both, data)
unnest(nest_cols, data)
unnest(nest_rows, x, y)

# Lengths must be consistent (otherwise would have to cross?)
# nest_row %>% unnest_row()
nest_rows %>% unnest(x, y)

# bind_rows() handles name/type consistency
# nest_col %>% unnest_col()
nest_cols %>%
  mutate(data = data %>% map(as_tibble)) %>%
  unnest(data)

# What happens if we try do to the "wrong" direction?
nest_rows %>% unnest_col()
nest_cols %>% unnest_row()

# needs column names
# can you supply multiple columns? (yes, but how to supply names? provide numbers by default?)
# can you provide maximum number? need to handle potential raggedness
# (this is starting to feel like separate)
nest_rows %>% unnest_col()

# needs option to capture names
# how to manage types of data col? here would be mix of character and integer
# use purrr::simplify? uses unlist() but guarantees length will be ok

nest_cols %>% unnest(data)
nest_cols %>% unnest(.id = "name") # not picking up name

# would simplify to integer or die trying?
nest_cols %>% unnest(.id = "name", .type = "integer")
@dan-reznik

This comment was marked as outdated.

Copy link

commented Jun 27, 2018

i have been simulating unnest_cols() with the following function (requires lists to be named), eg:

library(tidyverse)
unnest_cols <- function(df, col) {
  col <- enquo(col)
  df %>% 
    mutate(.id=row_number(), # so spread() will know what to do
           .l=map(!!col,names)) %>%
    unnest(.l,!!col) %>%
    spread(.l,!!col,convert=T) %>%
    select(-.id)
}

so assume you have

df <- tribble(
~col,~stuff,
list(a=10,b=11,c=12),"line1",
list(a=20,b=21,c=22),"line2",
list(a=30,b=31,c=32),"line3"
)

and you can call

df %>% unnest_cols(col) 

to get:

a,b,c,stuff
10,11,12,"line1"
20,21,22,"line2"
30,31,32,"line3"
@gvelasq

This comment was marked as resolved.

Copy link

commented Jul 15, 2018

My particular use case is for tibblized JSON data from the Zengine API which most resemble nest_rows in the original post above.

I wrote the function below (here and here) as a temporary workaround for the nest_rows workflow. Based on this I have a few feature requests for the forthcoming unnest_col():

  • Concatenate _n onto new column names where n represents a number for each unique element in the list-col;
  • Allow selective unnesting by list-col name, or if not specified default to all list-cols;
  • Optionally keep original list-cols unchanged;
  • Optionally unnest in place, or default to unnesting at the end of the dataset;
  • Optionally allow for unique list-col elements to be specified either by list-col contents or by order of appearance in the list-col. In the example below, matching by list-col contents as I implemented yields a "five" in the third row of the v2_2 variable. I did not implement the alternative option of matching by order of appearance which would yield a "five" in the third row of the v2_1 variable instead. Matching by order would be relevant for variables arising from ordered multiple responses.

Many thanks for considering!

df <- tibble::tribble(
  ~v1,     ~v2,                      ~v3, ~v4,
  "one",   c("four", "five", "six"), 3,   4L,
  "two",   NA_character_,            2,   c(6L, 5L, 4L),
  "three", "five",                   1,   5L
)

unnest_wide <- function(.data) {
  stopifnot(is.data.frame(.data))
  .data <- tibble::rowid_to_column(.data)
  lst_index <- purrr::map_int(.data, is.list)
  lst_cols <- names(lst_index)[lst_index == 1L]
  lst_vals <- paste0(lst_cols, ".")
  unique_vals <- vector("list", length(lst_cols))
  tmp <- vector("list", length(lst_cols))
  for (i in seq_along(lst_cols)) {
    unique_vals[[i]] <- stats::na.omit(unique(unlist(.data[[lst_cols[i]]])))
    tmp[[i]] <- dplyr::select(.data, rowid, lst_cols[i])
    tmp[[i]] <- dplyr::mutate(tmp[[i]], !!lst_vals[i] := .data[[lst_cols[i]]])
    tmp[[i]] <- tidyr::unnest(tmp[[i]])
    tmp[[i]] <- dplyr::mutate(tmp[[i]], !!lst_cols[i] := match(tmp[[i]][[lst_cols[i]]], unique_vals[[i]]))
    tmp[[i]] <- tidyr::spread(tmp[[i]], !!lst_cols[i], !!lst_vals[i], convert = TRUE, sep = "_")
    tmp[[i]] <- dplyr::select_if(tmp[[i]], !grepl(paste0(lst_cols[i], "_NA"), colnames(tmp[[i]])))
    .data <- dplyr::select(.data, -(!!lst_cols[i]))
    .data <- dplyr::left_join(.data, tmp[[i]], by = "rowid")
  }
  .data <- dplyr::select(.data, -rowid)
  return(.data)
}

unnest_wide(df)
#> # A tibble: 3 x 8
#>   v1       v3 v2_1  v2_2  v2_3   v4_1  v4_2  v4_3
#>   <chr> <dbl> <chr> <chr> <chr> <int> <int> <int>
#> 1 one       3 four  five  six       4    NA    NA
#> 2 two       2 <NA>  <NA>  <NA>      4     6     5
#> 3 three     1 <NA>  five  <NA>     NA    NA     5

Created on 2018-07-25 by the reprex package (v0.2.0).

@hadley

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Probably should refocus unnest() on data frames, and create new unnest_long() and unnest_wide(). unnest() would warn once per session when given a vector.

I think unnest_long() and unnest_wide() would still have to handle data frames (if it doesn't add too much complexity).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.